• Admond Lee

My First Data Scientist Internship

Updated: Jan 14, 2020

At the point of writing, it was the day before the last day of my Data Scientist internship at Quantum Inventions. And now, sitting before the laptop screen, reflecting on the learning journey for the past few months is nothing but truthfully hard yet fulfilling.

At the end of the day, the questions remain after a journey comes to an end — What have you learned? Is that what you want?

Pardon me as a physics guy who believes in always asking the right questions to seek the truth by answering with due reasoning. In fact, asking the right questions are undoubtedly important as a Data Scientist (As will be explained later…).

To give you an overview, this post is mainly dedicated to three sections — Before, During, and After the internship to give you an overview of my riding journey. Feel free to jump to any of this section, depending on your current stage of learning. Let the journey begin!

Thank you! I was emotionally uplifted when my first ever post on Medium received overwhelming support from different people and even got featured and published by Towards Data Science. This has really become my motivation to continue sharing my learning experience with more people simple because, learning if fun, helping others is even better!

Before the Internship Started

I still vividly remember that I started reading the textbook — An Introduction to Statistical Learning — with Applications in R the day after my final exam paper in November 2017. That was my first touch with machine learning at the very fundamental and statistical levels.

Once I got the hang of the concepts I began to move on to one of the popular courses — Machine Learning taught by Andrew Ng on Coursera. Things were not that easy as they seemed at first, but Andrew Ng has always the natural capability to draw people’s attention despite the complexity of any concepts and simplify the concepts for digestion like no one else. I guess this was how I really got hooked up with machine learning. Just give it a try and you will find that this buzzword is truly not that complex as it sounds. I bet.

Meantime, I was also learning another focused area of artificial intelligence — Deep Learning. To give a hindsight of what this seemingly foreign term is, please take a look at the explanation of neural networks and how neural networks are used to compute any function. Okay, if after reading the suggested articles and you are like me who always needs some sort of visualization to understand how things work, then click here. Press the ‘Play’ button, sit back, relax, and watch how neural networks are used for classification and regression. Cool isn’t it? Feel free to let me know in the comment below for discussion and clarification.

All the reading, doing and learning have prepared myself (hopefully) before I commenced my internship in December 2017.

During the Internship

Quantum Inventions specializes in providing mobility intelligence to consumer, corporations and governments by leveraging on its integrated suite of mobility applications. enterprise logistics and analytics platforms. I was the first data scientist intern to join the R&D and analytics team.

Within the first few days, I was introduced to the amazing colleagues, various traffic terms in the industry, and the ongoing exciting projects. One of the things that I liked the most about my internship was the trust and freedom given to me as an intern to choose the project that I was interested in and just went all-in for it!

To my surprise, I realized that I was the one who was pioneering the project as no one has done it before. When nobody has done something before, research comes in and this is where I was grateful for, despite the uncertainties and difficulties. Why? Simple because I had the opportunity to experience the real data science workflow (if not all) from scratch.

Allow me to briefly list down the workflow that I have gone through as these are what that has built my foundation in Data Science. And I hope you will find it useful in some ways. 😊

1. Understanding the Business Problem

The project chosen was about Short Term Freeway Travel Time Prediction. However, like I said, asking the right questions is very important for a Data Scientist. A lot of questions were raised to really understand the real business problem before the project was finalized, be it data sources available, the end goals of the project (even after I left) etc. Essentially, our objective was to predict travel time for a freeway in Singapore N minutes ahead more accurate than the current baseline estimation.

2. Collecting Data Source

Excited with the new project, I started collecting data sources from database and colleagues (basically walking around the office to ask questions on data sources). Collecting the right data source is similar to the case where you are scraping data from different websites for data preprocessing later. It is so important that it could affect the accuracy of the models that you are building in the later stage.

3. Data Preprocessing

Real world data is dirty. We can’t expect a nicely formatted and clean data as provided by Kaggle. Therefore, data preprocessing (other people might call it data munging or data cleaning) is so crucial that I can’t stress enough how important it is. It is the most important stage as it could occupy 40%-70% of the whole workflow, just to clean the data to be fed to your models.

Garbage in, Garbage out

One of the things that I like about data science is that you have to be honest to yourself. When you don’t know what you don’t know, and you think the data preprocessed is already clean enough and ready to feed to your models, therein lies a risk of building the correct models with the wrong data. In order words, always try to question yourself if the data is technically correct with the domain knowledge that you have, scrutinize the data with stringent threshold to check for any other outliers, missing or inconsistent data in the whole datasets.

I was particularly careful about this after I made a mistake of feeding the models with the wrong data, just because of a simple flaw in one of the preprocessing steps.

4. Building Models

After some research, I proposed four models to be used in our project, which were Support Vector Regression (SVR), Multilayer Perceptron (MLP), Long Short Term Memory (LSTM), and State Space Neural Networks (SSNN). For the sake of brevity, you can find detailed explanation of each model on various websites.

Building different models from scratch was a steep learning curve for me as a person who was still learning from MOOCs and textbooks. Fortunately, Scikit-learn and Keras (with Tensorflow backend) came to my rescue as they are easy to learn for fast models prototyping and implementation in Python. In addition, I also learned how to optimize the models and fine-tuned the hyperparameters for each model using several techniques.

5. Models Evaluation

To evaluate the performance of each model, I used mainly a few metrics:

  1. Mean Absolute Error (MAE)

  2. Mean Squared Error (MSE)

  3. Coefficient of Determination (R2)

At this stage, Steps 3–5 were repeated (interchangeably) until the best model was determined that could outperform the baseline estimation.

After the Internship

Well, the internship has definitely reaffirmed my passion in Data Science and I am grateful that my works did leave some traction for future works. The research and development phase, the communication skills required to talk to different stakeholders, the curiosity and passion to solve business problems using data (just to name a few) have all contributed to my interest in this field.

Data Science industry is still very young and its job description could somehow seem vague and ambiguous to job seekers like us. It’s perfectly normal to not possess all the skills needed as most job description is idealistically created to align with their best expectation.

When in doubt, just learn the fundamentals from MOOCs, books, and articles (which I am still doing) and apply what you have learned through your own personal projects or internships. Be patient. The learning journey does take time. Learn from your journey with relish. Because…

At the end of the day, the questions remain after a journey comes to an end — What have you learned? Is that what you want?

Thank you for reading. I hope that this article could give you some brief (not exhaustive) Data Science workflow and documentation of my journey.

If you have any questions, feel free to leave your comments below!

52 views0 comments

Recent Posts

See All

Let's Connect.

Admond Lee © 2019

Get all my insights in your inbox.