Retrieving, Processing, and Visualizing Enormous Email Data

In this project, I will retrieve a large email data set from the web, and then I will place it into a database. I will then analyze and clean up the email data so that it can be visualized. I will run the Python script gmane.py to retrieve the email data, then will place the data into a database called content.sqlite:

Next, I clean up the email data in content.sqlite by running gmodel.py, which normalizes the email data into structured tables and primary/foreign keys, and sends the data into the database index.sqlite:

I then run the gbasic.py program to calculate basic histogram data on the email messages that were retrieved. I compute the top 5 email list participants and the top 5 email list participants:

Next, I produce a word cloud visualization for the messages that were retrieved, showing the most used words, using gword.py, and visualizing via gword.htm:

And finally, I produce a time line visualization of the email messages retrieved, by running the gline.py program and visualizing the results on gline.htm:

This retrieval, processing, and visualization project was part of a capstone for a Python certification course. All source codes and data are open source, with credits below.

Photo credit: kOoLiNuS☸ Sparrow App review – 005 via photopin (license).

Source code credit: obtained from www.py4e.com/code3/gmane/ under the Copyright Creative Commons Attribution.

Leave a Comment

Your email address will not be published. Required fields are marked *