Why Coding is Important for Data Scientists

Hits: 134

As a Data Scientist in training, much of my orientation to the field has been about what skills are needed to become one. In my research and exposure to the field of data science, the knowledge, experience, and skillsets that data scientists have are domain expertise, computer science, and statistics. It appears that the most successful data scientists have expertise in all 3 areas, in addition to their deep knowledge of a specific area(s). So in my pursuit of training as a data scientist, I have these 3 areas in mind when looking at filling-in the gaps in my knowledge and experience.

Master’s Program

I recently had an interview for a master’s program in data science, and I posed the question to them about the focus of their data science master’s program. The statistician professor answered the question by detailing their focus on statistics and machine learning, and how to apply appropriate models to specific problems, and also how to optimize and test such models. I was impressed with this answer, as it is important for data scientists to understand the algorithms that they are applying to datasets. However, there seems to be some in the field who treat the algorithms and analysis as a black box, where the only important features are the selection of the model and the output of the data analysis…they don’t care about how the analysis was performed, or why the chosen algorithm works better than the others. Fortunately, this master’s program was all about understanding the models and optimizing them, which are important skills for a data scientist.

To Code, Or Not To Code

However, when I asked about their approach to computer science and coding, the statistics professor’s reply was:

‘Coding is cheap…we just outsource that, so it is not that important.’

What the heck?

How can you say that understanding analysis and algorithms are important, and not treat it like a black box, then come out and say that coding is cheap?

I have a different opinion. Coding is a basic necessity for all data scientists…if you don’t understand your spoken language, then how are you supposed to communicate your solution, being the data scientist that is a liaison between the business analysts and the backend developers? If you can’t code, then you are not able to harness the power of computers, and thus not able to take advantage of that computing power via elegant and sophisticated algorithms. If you can’t code, then you can’t be innovative, and you can’t create new models for use in the CPUs and the data-lakes that are increasing in power and storage capacity at an exponential rate.

Manual Versus Automated Analysis

If you can’t code, then you will be forced to analyze your data manually, and you spend enormous chunks of time just extracting, cleansing, transforming, and migrating your data (also known as ETL), as you can’t code to automate those processes. If you can’t code, then you waste too much time on prepping your data. If you can’t code, you don’t have time left to perform a business requirements analysis, and no time is left to choose an appropriate model for analysis, and no time left to adequately train the model and optimize and fine-tune it, with different features and dimensions.

How do I know all this? Well, I have tried to do data prepping and analysis via manual methods, due to my previous lack of coding expertise. I previously spent too much time on the data prepping, manipulating, parsing, and migrating data, as I was doing things manually, and it took away time from my other roles as a data scientist, which includes the model selection and fine-tuning.

Data Science and Coding

Now I know better…I have since become proficient at Python, and consider myself a Python programmer. Life is now much easier, when you can code and use computing power to do all the things I used to do manually, like scraping websites, extracting and migrating data, ETL, and analysis. I can now spend most of my time with solution documents, requirements analysis, model selection, model fine-tuning and optimization, implementation planning, project management, and communication of the solution.

Basically, if you can’t code, then you will not be an effective Data Scientist. I’m not saying you have to be an expert coder, as you should leave the complex coding to the software engineers who can code in their sleep. What I’m saying is that you do have to speak the language of your profession, as you can’t be effective as a problem solver, designer, analyst, and communicator if you can’t code.

Summary

In summary, real Data Scientists are coders. Some choose not to code, but they still know how to code. And those who choose not to code are most likely thought leaders in the field, where their high level expertise is more valuable than their coding skills. Data Scientists are coders. Don’t hire one without coding skills. And if you can’t code, then you are not a Data Scientist.

 

photo credit: markus spiske html php java source code via photopin (license)

Analysis of Lottery Draws Between 2009 and 2017

Hits: 143

This project entails the analysis of a dataset of historical lottery draws between 2009 and 2017 inclusive, scraped from the website of a lottery by my colleague, Gregory Horne. We had a question whether the winning numbers could be predicted, based on past draws, but needed to know if the winning numbers clustered, or were randomly drawn.

In this lottery, ping-pong balls are labeled with one number, ranging from 1 to 49. One of each number is placed in a barrel. The barrel is spun to mix up all the balls, then one ball is drawn. This is repeated 5 more times for a winning number set of 6 winning numbers. In addition, there is a bonus draw, which gives 7 winning numbers.

We will first analyze the winning numbers from 2009 to 2015, then add the winning numbers from 2016 to 2017, to see how the analysis is changed with new data. Thus, we will analyze two lottery datasets, one from 2009 to 2015, and the other from 2016 to 2017.

We propose to perform cluster analysis on this lottery dataset. We hypothesize that the cluster analysis should be random, and therefore the datapoints should plot in a uniform manner in the feature space. This hypothesis is based on the premise that this specific lottery draw is indeed random in nature. However, if our analysis leads to clustering that is significant, then this can lead to further analysis and speculation on the method of determining winners for this specific lottery.

Please click on the following link for the detailed analysis: Lottery analysis.

photo credit: chrisjtse 41:366:2016 via photopin (license)