Retrieving, Processing, and Visualizing Enormous Email Data

Hits: 183

In this project, I will retrieve a large email data set from the web, and then I will place it into a database. I will then analyze and clean up the email data so that it can be visualized. I will run the Python script gmane.py to retrieve the email data, then will place the data into a database called content.sqlite:

Next, I clean up the email data in content.sqlite by running gmodel.py, which normalizes the email data into structured tables and primary/foreign keys, and sends the data into the database index.sqlite:

I then run the gbasic.py program to calculate basic histogram data on the email messages that were retrieved. I compute the top 5 email list participants and the top 5 email list participants:

Next, I produce a word cloud visualization for the messages that were retrieved, showing the most used words, using gword.py, and visualizing via gword.htm:

And finally, I produce a time line visualization of the email messages retrieved, by running the gline.py program and visualizing the results on gline.htm:

This retrieval, processing, and visualization project was part of a capstone for a Python certification course. All source codes and data are open source, with credits below.

Photo credit: kOoLiNuS☸ Sparrow App review – 005 via photopin (license).

Source code credit: obtained from www.py4e.com/code3/gmane/ under the Copyright Creative Commons Attribution.

Google Geocoding API with a Database and Visualization on Google Maps

Hits: 189

Today I will use the Google Geocoding API with a database, and visualize the mapping data on Google Maps. In this project, I have a list of university names with no location information, and I will proceed to add location coordinates to the universities using the Google Geocoding API, load those university names with location data into a database, then visualize that data on Google Maps. Here are the university names in where.data (I added Saint Mary’s University in Halifax in the second row- Go Huskies!):

 

I then run the Python script geoload.py to lookup all of the university entries in where.data, call on the Google Geocoding API to add location coordinates, and place all of this data into a database called geodata.sqlite:

I then run geodump.py to read the database and produce where.js:

Finally, I open where.html to visualize the location markers in Google Maps:

A Map of Information

About this Map

This is a cool map from carlocarandang.com.

This project was performed as part of a Python certification course. All materials in this project are open source, with credits below.

Photo credit: bertboerland Mapping geocoded tweets and flicks pics from me via photopin (license).

Source code and script credit: source codes and scripts obtained from www.py4e.com/code3/geodata/ under the Copyright Creative Commons Attribution.

Webcrawl and Pagerank of a Website

Hits: 183

Today, I will demonstrate a webcrawl and pagerank of a website. For the parser, I’m using a python code, spider.py, which incorporates BeautifulSoup, a Python library for pulling data out of HTML and XML files. I’ll limit the amount of pages to crawl to 100, and will crawl the website AnxietyBoss.com, a leading website for anxiety, and a rather large website with 1000’s of web posts. Here is a snippet of the webcrawl:

After running the spider.py script, I also ran a pagerank script, sprank.py, to rank the links that were crawled, based on the links going to that link, and ranking the pages based on the number and quality of the links. I went through 100 iterations. Here is a snippet of running the page ranking script:

Next, I use spdump.py to visualize the pageranks of the links crawled:

Finally, I further visualize the top 25 links using force.html:

Force-Directed Layout

You can play around with this visual by dragging the nodes (balls) around on the screen, to see their connections to other nodes (links) in different configurations. You can also click on each of the nodes to go to the specific link.

This project was performed as part of a capstone project for a Python certification course. All materials in this project are open source, with credits below.

Photo credit: kolacc20 Very simplified PageRank distribution graph via photopin (license).

Source codes and scripts credit: spider.py, sprank.py, spdump.py, force.js, force.css, and force.html obtained from www.py4e.com/materials under the Copyright Creative Commons Attribution.