Introduction

Over the last three years I’ve held three different roles at Cochise College, the Dean of Extended Learning, the Dean of Institutional Research, Effeciveness, and Planning, and the Dean of Special Projects for the Vice President of Instruction/Provost. I wondered how my email messages changed as I changed roles. The ultimate goal of this project is to create a word cloud of the “To” addresses in the email messages I’ve sent to see if changes in my personal network is evident from my sent email patterns. Since it would be difficult to download the message header for every message I have sent over the past three years, I decided to consider only my email traffic for the month of October 2016, 2017, and 2018. I selected October because by that time of the year I was firmly entrenched in my new role and October is a reletively busy month at a community college.

The Google API can be used to access many user services, like Google Drive and Google Docs, but for this tutorial I’m using it to access Gmail. While the Gmail API is very robust and can do things like send messages and apply labels to messages, I am interested in only downloading a list of message IDs that match a specified search. I posted another file to show how I created my Google API access. Note: the Google API demonstration is part of the R demonstrations but the process creates a generic API access point that can be used with any scripting language.

For this project I decided to use Python since I wanted a nice little project that will let me explore that language and this one was well-defined and had a solid end point. For the record, I am using a distribution of Python 3.7 called Anaconda since it installs about 1400 Python packages related to data science along with several important additional tools, like the Spyder Integrated Development Environment.

Full Disclosure: Python is a new to me and, like a lot of programmers, I freely steal repurpose code that I find online. This project, like Frankenstein’s Monster, was assembled from pieces found at the Google Developer’s site, in blogs, at Stack Overflow, and other locations.

The first step in a Python program is to load the packages needed beyond the base Python language, which is similar to loading libraries at the start of an R script. I loaded these seven packages.

  • errors (from apiclient). The apiclient library is provided by Google as a simple way to use the Google APIs. The errors function catches errors encountered during an API call and permits script writers to manage those situations.

  • build (from apiclient.discovery). This function builds a batch request for the Google API.

  • Http (from httplib2). This library creates fast, persistent HTTP connections, especially for use with the Google API.

  • file, client, and tools (from oauth2client). The oauth2client library provides tools needed to access resources protected by OAuth 2.0, like the Google API.

  • WordCloud, STOPWORDS (from wordcloud). This is a library of functions that build word cloud displays from a provided list of words.

  • matplotlib.pyplot. This is a standard Python plot library. It is given an alias of plt in this script (which is rather common).

  • re. This is a library of regular expression (regex) functions.

from apiclient import errors
from apiclient.discovery import build
from httplib2 import Http
from oauth2client import file, client, tools
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
import re 

For simplicity, I set up the Gmail search string as a variable that I initialize at the top of my script. That way, if I reuse this script later it is easy for me to find and change the search string I send to Gmail. I crafted the Gmail search string by opening Gmail in my browser, used the “Advanced Search” options to look for messages, and then copied the search string Gmail reports at the top of the results screen.

I also set up a variable that will be used to generate two different output file names. For example, a file named “dec18.txt” and “dec18.png” is created without my trying to find those specific file name lines buried in the script.

msgQry = 'is:sent after:2016/09/30 and before:2016/11/01'
fDate = 'dec18' # monYY, used to name saved files

Now it is time to actually set up and call the Gmail API. This code was copied from the Google site and I made only a couple of minor changes.

  • SCOPES establishes the scope of this API access. Since I only needed readonly access to retrieve message headers, I set the scope for readonly. The Google API has a number of potential scopes so if I ever want to manipulate my Gmail through a Python script I may have to request higher priviledges.

  • store This creates an object that is used to manage my access to the Gmail data.

  • creds This uses the “get” function from the store object to attempt to retrieve access and refresh tokens from the Google API. If successful, it stores those tokens in the “creds” object.

  • if not… If my credentials could not be retrieved from the Google API then they are built from my local “secret” file.

  • flow This acquires the credentials needed to authorize me to access my Gmail data. The client_secret.json file was generated by Google when I created my API account and I downloaded and stored that file on my local computer. Note: I also included a line in my .gitignore file to not syncronize that file.

  • creds This object contains the access and refresh tokens needed by OAuth2 to access my Gmail data.

  • service The credentials are used to build a link to the Gmail API.

SCOPES = 'https://www.googleapis.com/auth/gmail.readonly'
store = file.Storage('credentials.json')
creds = store.get()
if not creds or creds.invalid:
    flow = client.flow_from_clientsecrets('client_secret.json', SCOPES)
    creds = tools.run_flow(flow, store)
service = build('gmail', 'v1', http=creds.authorize(Http()))

The next Python block is a function named list_messages_matching_query and this is the core of the entire script. Once the API is available, the next step is to find the identification numbers for a queried group of email messages and this function takes care of that chore. The function returns a list of identification numbers so there is another step necessary to request the header data for each of those messages, but I decided to put that job in another script. This function was pretty much copied from the Google API Developer’s page and I will not offer any further explanation of the various lines.

def list_messages_matching_query(service=service, user_id='me', query=''):
    try:
        response = service.users().messages().list(
            userId=user_id, q=query).execute()
        messages = []
        if 'messages' in response:
            messages.extend(response['messages'])
        while 'nextPageToken' in response:
            page_token = response['nextPageToken']
            response = service.users().messages().list(
                userId=user_id, q=query, pageToken=page_token).execute()
            messages.extend(response['messages'])
        return messages
    except (errors.HttpError) as error:
        print('An error occurred: %s', error)

I created another function called get_one_msg that will retrieve the headers from a single message whose identification number is passed to the function. Notice that I request only the To, CC, and BCC header information be returned.

def get_one_msg(msg_id):
    try:
        metaHeaders = ['To','CC','BCC']
        message = service.users().messages().get(
            userId = 'me', 
            id=msg_id, 
            format='metadata', 
            metadataHeaders=metaHeaders).execute()
        return message
    except (errors.HttpError) as error:
        print('An error occurred: %s', error)

Now it is time for me to call the list_messages_matching_query function. I created a variable named msg_id_and_thread that will contain a list of the email identification numbers returned from the function. I am sending service, which is the API credentials object, ‘me’, which is defined as the user initiating the API call, and the query string defined early in the script.

msg_id_and_thread = list_messages_matching_query(service, 'me', msgQry)

The list that is returned from Google contains a “dictionary” object for each email message found. Each dictionary object contains two elements: a message identification number and a thread identification number. Since I am only interested in the message identification number, I ran a short for loop to extract that number for each message returned into a list I named msg_ids.

msg_ids = []
for item in msg_id_and_thread:
    (key, val) = item.items()
    msg_ids.append(key[1])

The next step is to use the message identification numbers found in msg_ids to retrieve the message header from Gmail. The following code uses the get_one_msg() function defined above to get the header from a single message from Gmail. If the header includes a cochise college email address then that is added to the email_addr variable.

email_addr = ''
for one_msg in msg_ids:
    msg_header = get_one_msg(one_msg) # Get a message header
    for get_email_addr in msg_header['payload']['headers']: 
        # Include only cochise addresses
        if get_email_addr['value'].find('cochise.edu') != -1:
            email_addr += (get_email_addr['value'])

Once the Cochise College email addresses are all added to the email_addr string, I use a regular expression to find all instances of a “<” followed by characters then a “@”. I extract all of the characters between those two delimiters and place that in a variable named usr_ids. These are the user identifiers taken from the email string, in other words, the “selfg” part of my email address.

regex_exp = re.compile(r'(?<=\<)(.*?)(?=\@)')
usr_ids = re.findall(regex_exp,email_addr)

While the next block was not necessary for creating a word cloud, I decided to save the list of identifiers in a text file for future use. By doing that, I can create new word cloud images by resubmitting the same list of names without having to rerun the API call.

fName = fDate + '.txt'
with open(fName, 'w') as fh:
    for item in usr_ids:
        fh.write("%s\n" % item)

The next block calls sets up the word cloud function and then calls that function. First, the identifiers are conditioned a bit to allow for potential odd identifiers. The word cloud call is fairly easy to understand and the various parameters are set in the function call.

comment_words = ' ' # Words to map in the word cloud
stopwords = set(STOPWORDS) # Words to ignore 
# iterate through the data frame 
for val in usr_ids: 
    val = str(val) # cast val to string
    tokens = val.split() # split val
    for i in range(len(tokens)): 
        tokens[i] = tokens[i].lower() # convert tokens to lc
    for words in tokens: 
        comment_words = comment_words + words + ' '
wordcloud = WordCloud(width = 800, height = 800, 
                background_color ='white',
                stopwords = stopwords,
                min_font_size = 10).generate(comment_words) 

The wordcloud variable is sent to matplotlib where it is plotted and eventually saved as a .PNG file.

plt.figure(figsize = (6, 6), facecolor = None) 
plt.imshow(wordcloud) 
plt.axis("off") 
plt.tight_layout(pad = 0) 
fName = fDate + '.png'
plt.savefig(fName)
plt.close()

Summary

The three word clouds that I created with this script can be found in my blog post for December 4: Personal Email Network and readers are welcomed to take a look.

Other than my own comments in the Python file, which were not included in this tutorial, the code found in this tutorial is complete. Users who wanted to do so could copy/paste the chunks one after the other and recreate the entire script. This was a splendid little Python project and was a great way to begin to incorporate Python in my daily work.