Why Coding is Important for Data Scientists

As a Data Scientist in training, much of my orientation to the field has been about what skills are needed to become one. In my research and exposure to the field of data science, the knowledge, experience, and skillsets that data scientists have are domain expertise, computer science, and statistics. It appears that the most successful data scientists have expertise in all 3 areas, in addition to their deep knowledge of a specific area(s). So in my pursuit of training as a data scientist, I have these 3 areas in mind when looking at filling-in the gaps in my knowledge and experience.

Master’s Program

I recently had an interview for a master’s program in data science, and I posed the question to them about the focus of their data science master’s program. The statistician professor answered the question by detailing their focus on statistics and machine learning, and how to apply appropriate models to specific problems, and also how to optimize and test such models. I was impressed with this answer, as it is important for data scientists to understand the algorithms that they are applying to datasets. However, there seems to be some in the field who treat the algorithms and analysis as a black box, where the only important features are the selection of the model and the output of the data analysis…they don’t care about how the analysis was performed, or why the chosen algorithm works better than the others. Fortunately, this master’s program was all about understanding the models and optimizing them, which are important skills for a data scientist.

To Code, Or Not To Code

However, when I asked about their approach to computer science and coding, the statistics professor’s reply was:

‘Coding is cheap…we just outsource that, so it is not that important.’

What the heck?

How can you say that understanding analysis and algorithms are important, and not treat it like a black box, then come out and say that coding is cheap?

I have a different opinion. Coding is a basic necessity for all data scientists…if you don’t understand your spoken language, then how are you supposed to communicate your solution, being the data scientist that is a liaison between the business analysts and the backend developers? If you can’t code, then you are not able to harness the power of computers, and thus not able to take advantage of that computing power via elegant and sophisticated algorithms. If you can’t code, then you can’t be innovative, and you can’t create new models for use in the CPUs and the data-lakes that are increasing in power and storage capacity at an exponential rate.

Manual Versus Automated Analysis

If you can’t code, then you will be forced to analyze your data manually, and you spend enormous chunks of time just extracting, cleansing, transforming, and migrating your data (also known as ETL), as you can’t code to automate those processes. If you can’t code, then you waste too much time on prepping your data. If you can’t code, you don’t have time left to perform a business requirements analysis, and no time is left to choose an appropriate model for analysis, and no time left to adequately train the model and optimize and fine-tune it, with different features and dimensions.

How do I know all this? Well, I have tried to do data prepping and analysis via manual methods, due to my previous lack of coding expertise. I previously spent too much time on the data prepping, manipulating, parsing, and migrating data, as I was doing things manually, and it took away time from my other roles as a data scientist, which includes the model selection and fine-tuning.

Data Science and Coding

Now I know better…I have since become proficient at Python, and consider myself a Python programmer. Life is now much easier, when you can code and use computing power to do all the things I used to do manually, like scraping websites, extracting and migrating data, ETL, and analysis. I can now spend most of my time with solution documents, requirements analysis, model selection, model fine-tuning and optimization, implementation planning, project management, and communication of the solution.

Basically, if you can’t code, then you will not be an effective Data Scientist. I’m not saying you have to be an expert coder, as you should leave the complex coding to the software engineers who can code in their sleep. What I’m saying is that you do have to speak the language of your profession, as you can’t be effective as a problem solver, designer, analyst, and communicator if you can’t code.

Summary

In summary, real Data Scientists are coders. Some choose not to code, but they still know how to code. And those who choose not to code are most likely thought leaders in the field, where their high level expertise is more valuable than their coding skills. Data Scientists are coders. Don’t hire one without coding skills. And if you can’t code, then you are not a Data Scientist.

 

photo credit: markus spiske html php java source code via photopin (license)

Leave a Comment

Your email address will not be published. Required fields are marked *