Retrieving, Processing, and Visualizing Enormous Email Data

Hits: 63

In this project, I will retrieve a large email data set from the web, and then I will place it into a database. I will then analyze and clean up the email data so that it can be visualized. I will run the Python script gmane.py to retrieve the email data, then will place the data into a database called content.sqlite:

Next, I clean up the email data in content.sqlite by running gmodel.py, which normalizes the email data into structured tables and primary/foreign keys, and sends the data into the database index.sqlite:

I then run the gbasic.py program to calculate basic histogram data on the email messages that were retrieved. I compute the top 5 email list participants and the top 5 email list participants:

Next, I produce a word cloud visualization for the messages that were retrieved, showing the most used words, using gword.py, and visualizing via gword.htm:

And finally, I produce a time line visualization of the email messages retrieved, by running the gline.py program and visualizing the results on gline.htm:

This retrieval, processing, and visualization project was part of a capstone for a Python certification course. All source codes and data are open source, with credits below.

Photo credit: kOoLiNuS☸ Sparrow App review – 005 via photopin (license).

Source code credit: obtained from www.py4e.com/code3/gmane/ under the Copyright Creative Commons Attribution.

Google Geocoding API with a Database and Visualization on Google Maps

Hits: 69

Today I will use the Google Geocoding API with a database, and visualize the mapping data on Google Maps. In this project, I have a list of university names with no location information, and I will proceed to add location coordinates to the universities using the Google Geocoding API, load those university names with location data into a database, then visualize that data on Google Maps. Here are the university names in where.data (I added Saint Mary’s University in Halifax in the second row- Go Huskies!):

 

I then run the Python script geoload.py to lookup all of the university entries in where.data, call on the Google Geocoding API to add location coordinates, and place all of this data into a database called geodata.sqlite:

I then run geodump.py to read the database and produce where.js:

Finally, I open where.html to visualize the location markers in Google Maps:

A Map of Information

About this Map

This is a cool map from carlocarandang.com.

This project was performed as part of a Python certification course. All materials in this project are open source, with credits below.

Photo credit: bertboerland Mapping geocoded tweets and flicks pics from me via photopin (license).

Source code and script credit: source codes and scripts obtained from www.py4e.com/code3/geodata/ under the Copyright Creative Commons Attribution.

Webcrawl and Pagerank of a Website

Hits: 77

Today, I will demonstrate a webcrawl and pagerank of a website. For the parser, I’m using a python code, spider.py, which incorporates BeautifulSoup, a Python library for pulling data out of HTML and XML files. I’ll limit the amount of pages to crawl to 100, and will crawl the website AnxietyBoss.com, a leading website for anxiety, and a rather large website with 1000’s of web posts. Here is a snippet of the webcrawl:

After running the spider.py script, I also ran a pagerank script, sprank.py, to rank the links that were crawled, based on the links going to that link, and ranking the pages based on the number and quality of the links. I went through 100 iterations. Here is a snippet of running the page ranking script:

Next, I use spdump.py to visualize the pageranks of the links crawled:

Finally, I further visualize the top 25 links using force.html:

Force-Directed Layout

You can play around with this visual by dragging the nodes (balls) around on the screen, to see their connections to other nodes (links) in different configurations. You can also click on each of the nodes to go to the specific link.

This project was performed as part of a capstone project for a Python certification course. All materials in this project are open source, with credits below.

Photo credit: kolacc20 Very simplified PageRank distribution graph via photopin (license).

Source codes and scripts credit: spider.py, sprank.py, spdump.py, force.js, force.css, and force.html obtained from www.py4e.com/materials under the Copyright Creative Commons Attribution.

A Tour of Data Science Educational Programs

Hits: 120

In my quest to become a data scientist, I have embarked on a series of educational journeys, which have included both formal, in-school, educational forums and self-learning, self-paced MOOCS (massive open online courses). Let me start at the beginning of my path to getting an education and training in data science.

Business Intelligence Analytics Advanced Diploma Program

I previously did work as a process engineer for an oil refinery, then switched careers and became a physician, a psychiatrist. After many years seeing patients, teaching medical students, and performing clinical research, I decided it was time to switch to yet another career.

As I was helping my son to pick out courses for his upcoming enrollment at a local community college, I stumbled upon a Business Intelligence Analytics Program at that same community college. As I read through the program description and courses, it looked very interesting to me at the time, and decided to enroll in that program, to learn about business intelligence analytics.

As I went through the program for the first term, it occurred to me that I was more interested in machine learning and predictive analytics, rather than just looking at historical data and presenting the descriptive statistics and what happened last quarter. Although this program gave me a good overview of databases, I found the program lacking in many ways, as the school did not put in the resources to link the students with coop jobs with the local industry in data analytics. So when I was unable to find a coop job in the summer after the first term was completed, I decided that I would need to augment my education, rather than just rely on the program at the school and let the summer go to waste.

MOOCS, Machine Learning, R, and Python

Over the summer break, since I could not find a coop job in data analytics, I decided to look into online learning, also called MOOCS (massive open online courses). I first started with a machine learning MOOC, which was quite informative and hands-on, and became skilled at using R when utilizing various models to apply regression, classification, and clustering to different datasets. From my first success in MOOCS with machine learning and R, I decided to take more courses in computer science and programming.

I particularly liked the MOOC where I was introduced to Python programming. Once I got introduced to Python, I was hooked, as I saw the versatility of Python for web scraping, data parsing, database, analysis, and visualization. Within just 4 weeks of daily Python immersion on MOOCS and solving problems on hackerrank.com, I became good enough to call myself a Python programmer.

It was also at that same time that it dawned on me that I had the requisite skills needed as a computer science student to apply to data science master’s programs. The idea of applying to master’s programs was now a reality, since I had accumulated sufficient skills in database, machine learning, and coding, which was required for many of the master’s programs in data science. Also, when I looked at the various job listings for data scientists, almost all of the companies required either a master’s degree, or 2 to 3 years of experience as a data scientist.

That cemented it for me to look into applying to grad school. Fortunately, I did quite well at the community college with the various database courses I took there, and was able to obtain excellent faculty references, which is one of the main requirements for applying to the master’s programs.

MS Analytics Program

As I mentioned in a previous post, I interviewed at an MS Analytics program, but it did not have computer science and coding as part of its curriculum, and was too focused on statistics. Given their total disregard for coding and computer science, I decided against that program (and that decision was mutual, as they did not offer me a position, citing stiff competition). But that statistics professor was way off base, and obviously did not know the definition of a data scientist, and confused them for statisticians. But data scientists are not only experts at statistics…they are also experts at computer science and coding, in addition to having domain expertise.

This being my first exposure to graduate programs in data science, I began to question their ROI (return on investment), especially when I already have two university degrees, one of which is a technical degree in chemical engineering. I was wondering if certificate and diploma programs in data science may be the better option for me, especially with my industry experience and engineering degree. And of course, I could take more MOOCS, as that is how I learned machine learning and how to code! But fortunately, my next interview with a graduate program really impressed me, and I impressed them (how do I know this…read on!).

MSc in Computing and Data Analytics

Good news! I was accepted into a Master of Science Program in Computing and Data Analytics. This program is a well-balanced mix of computer science, statistics, and business intelligence. I had to pass a programming test to get in, as they only accept data science grad students who can actually code…imagine that! I’m very glad to be in this program, and looking forward to starting grad school this Fall. For me, this program was the missing piece in my data science training. For me, I have to get a master’s degree in data science, given my other degrees and industry experience were not in the IT industry.

EMC Data Science Certifications

Even though I am slated to start grad school in a few weeks, I still decided to take the EMC Data Science Associate (EMCDSA) course, as many of the practicing data scientists have the EMCDSA certification. Once I obtain my EMCDSA this summer, I plan on continuing to the next level, and work on obtaining the EMC Data Science Specialist (EMCDSS) certification. The great thing about these are that they are also in MOOC format. Fortunately, I have a group to study these courses, and we meet in-person weekly to go over the material we learned during the week.

Summary

So that is my tour of data science educational programs. For me, getting the master’s degree in data science from a well-balanced program is key to my education and training as a data scientist. I don’t believe everyone needs to take my same path, but it is an example of how one person is getting training in this data science field which currently has no official standards for training. For me, the master’s degree will serve as my foundation, while the MOOCS and various data science certifications will augment and enhance my training and experience.

As a word of caution, if you are looking into a master’s program in data science, please pick programs that are well balanced in all the core areas of data science, including computer science (algorithms, coding), statistics, and business intelligence. Skip the ones that ignore computer science, and skip the ones that ignore statistics.

Good luck on your journey to becoming a data scientist, and please contact me should you have any questions.

photo credit: velkr0 classroom via photopin (license)

Why Coding is Important for Data Scientists

Hits: 134

As a Data Scientist in training, much of my orientation to the field has been about what skills are needed to become one. In my research and exposure to the field of data science, the knowledge, experience, and skillsets that data scientists have are domain expertise, computer science, and statistics. It appears that the most successful data scientists have expertise in all 3 areas, in addition to their deep knowledge of a specific area(s). So in my pursuit of training as a data scientist, I have these 3 areas in mind when looking at filling-in the gaps in my knowledge and experience.

Master’s Program

I recently had an interview for a master’s program in data science, and I posed the question to them about the focus of their data science master’s program. The statistician professor answered the question by detailing their focus on statistics and machine learning, and how to apply appropriate models to specific problems, and also how to optimize and test such models. I was impressed with this answer, as it is important for data scientists to understand the algorithms that they are applying to datasets. However, there seems to be some in the field who treat the algorithms and analysis as a black box, where the only important features are the selection of the model and the output of the data analysis…they don’t care about how the analysis was performed, or why the chosen algorithm works better than the others. Fortunately, this master’s program was all about understanding the models and optimizing them, which are important skills for a data scientist.

To Code, Or Not To Code

However, when I asked about their approach to computer science and coding, the statistics professor’s reply was:

‘Coding is cheap…we just outsource that, so it is not that important.’

What the heck?

How can you say that understanding analysis and algorithms are important, and not treat it like a black box, then come out and say that coding is cheap?

I have a different opinion. Coding is a basic necessity for all data scientists…if you don’t understand your spoken language, then how are you supposed to communicate your solution, being the data scientist that is a liaison between the business analysts and the backend developers? If you can’t code, then you are not able to harness the power of computers, and thus not able to take advantage of that computing power via elegant and sophisticated algorithms. If you can’t code, then you can’t be innovative, and you can’t create new models for use in the CPUs and the data-lakes that are increasing in power and storage capacity at an exponential rate.

Manual Versus Automated Analysis

If you can’t code, then you will be forced to analyze your data manually, and you spend enormous chunks of time just extracting, cleansing, transforming, and migrating your data (also known as ETL), as you can’t code to automate those processes. If you can’t code, then you waste too much time on prepping your data. If you can’t code, you don’t have time left to perform a business requirements analysis, and no time is left to choose an appropriate model for analysis, and no time left to adequately train the model and optimize and fine-tune it, with different features and dimensions.

How do I know all this? Well, I have tried to do data prepping and analysis via manual methods, due to my previous lack of coding expertise. I previously spent too much time on the data prepping, manipulating, parsing, and migrating data, as I was doing things manually, and it took away time from my other roles as a data scientist, which includes the model selection and fine-tuning.

Data Science and Coding

Now I know better…I have since become proficient at Python, and consider myself a Python programmer. Life is now much easier, when you can code and use computing power to do all the things I used to do manually, like scraping websites, extracting and migrating data, ETL, and analysis. I can now spend most of my time with solution documents, requirements analysis, model selection, model fine-tuning and optimization, implementation planning, project management, and communication of the solution.

Basically, if you can’t code, then you will not be an effective Data Scientist. I’m not saying you have to be an expert coder, as you should leave the complex coding to the software engineers who can code in their sleep. What I’m saying is that you do have to speak the language of your profession, as you can’t be effective as a problem solver, designer, analyst, and communicator if you can’t code.

Summary

In summary, real Data Scientists are coders. Some choose not to code, but they still know how to code. And those who choose not to code are most likely thought leaders in the field, where their high level expertise is more valuable than their coding skills. Data Scientists are coders. Don’t hire one without coding skills. And if you can’t code, then you are not a Data Scientist.

 

photo credit: markus spiske html php java source code via photopin (license)

Analysis of Lottery Draws Between 2009 and 2017

Hits: 143

This project entails the analysis of a dataset of historical lottery draws between 2009 and 2017 inclusive, scraped from the website of a lottery by my colleague, Gregory Horne. We had a question whether the winning numbers could be predicted, based on past draws, but needed to know if the winning numbers clustered, or were randomly drawn.

In this lottery, ping-pong balls are labeled with one number, ranging from 1 to 49. One of each number is placed in a barrel. The barrel is spun to mix up all the balls, then one ball is drawn. This is repeated 5 more times for a winning number set of 6 winning numbers. In addition, there is a bonus draw, which gives 7 winning numbers.

We will first analyze the winning numbers from 2009 to 2015, then add the winning numbers from 2016 to 2017, to see how the analysis is changed with new data. Thus, we will analyze two lottery datasets, one from 2009 to 2015, and the other from 2016 to 2017.

We propose to perform cluster analysis on this lottery dataset. We hypothesize that the cluster analysis should be random, and therefore the datapoints should plot in a uniform manner in the feature space. This hypothesis is based on the premise that this specific lottery draw is indeed random in nature. However, if our analysis leads to clustering that is significant, then this can lead to further analysis and speculation on the method of determining winners for this specific lottery.

Please click on the following link for the detailed analysis: Lottery analysis.

photo credit: chrisjtse 41:366:2016 via photopin (license)