Email-Counting-SQLite.ipynb

by tirthajyoti

notebooks/Email-Counting-SQLite.ipynb

Simple SQLite demo: Reading a text file, inserting email id's into the Database, and counting average emails received from a particular domain

Dr. Tirthajyoti Sarkar, Fremont, CA 94536

Import SQLite library of Python (it is built-in)

  • Create a connection
  • Create a cursor object
import sqlite3
conn = sqlite3.connect('emaildb.sqlite')
cur=conn.cursor()

Run SQL commands through the cursor

  • Drop the previous Table if it exists
  • Create a Table with email and counts as attributes
cur.execute('DROP TABLE IF EXISTS Counts')
cur.execute('''
CREATE TABLE Counts (email TEXT, count INTEGER)''')
<sqlite3.Cursor at 0x7fcb6c0eea40>

Use urllib method to establish a connection with a remote server to read the inbox data

import urllib.request, urllib.parse, urllib.error

print("Opening the file connection...")
# Following example reads Project Gutenberg EBook of Pride and Prejudice
fhand = urllib.request.urlopen('http://data.pr4e.org/mbox.txt')
Opening the file connection...

Read the text data from the remote server to create a local text file

txt_dump = ''
line_count=0
word_count=0
# Iterate over the lines in the file handler object and dump the data into the text string. 
# Also increment line and word counts
for line in fhand:
# Use decode method to convert the UTF-8 to Unicode string
    txt_dump+=line.decode()
    line_count+=1
    # Count the length of words in the line and add to the running count
    word_count+=len(line.decode().split(' '))

Prints basic informationn about the text data

print("Printing some info on the text dump\n"+"-"*60)
print("Total characters:",len(txt_dump))
print("Total words:",word_count)
print(f"Total lines: {line_count}")
Printing some info on the text dump
------------------------------------------------------------
Total characters: 6687002
Total words: 630427
Total lines: 132045

Open a local file handler with the text file

file = open('mbox.txt','w') 
file.write(txt_dump)
file.close()
fh=open('mbox.txt')

Show first few lines of the text data

show_text=fh.read(1000)
print(show_text)
From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008
Return-Path: <postmaster@collab.sakaiproject.org>
Received: from murder (mail.umich.edu [141.211.14.90])
	 by frankenstein.mail.umich.edu (Cyrus v2.3.8) with LMTPA;
	 Sat, 05 Jan 2008 09:14:16 -0500
X-Sieve: CMU Sieve 2.3
Received: from murder ([unix socket])
	 by mail.umich.edu (Cyrus v2.2.12) with LMTPA;
	 Sat, 05 Jan 2008 09:14:16 -0500
Received: from holes.mr.itd.umich.edu (holes.mr.itd.umich.edu [141.211.14.79])
	by flawless.mail.umich.edu () with ESMTP id m05EEFR1013674;
	Sat, 5 Jan 2008 09:14:15 -0500
Received: FROM paploo.uhi.ac.uk (app1.prod.collab.uhi.ac.uk [194.35.219.184])
	BY holes.mr.itd.umich.edu ID 477F90B0.2DB2F.12494 ; 
	 5 Jan 2008 09:14:10 -0500
Received: from paploo.uhi.ac.uk (localhost [127.0.0.1])
	by paploo.uhi.ac.uk (Postfix) with ESMTP id 5F919BC2F2;
	Sat,  5 Jan 2008 14:10:05 +0000 (GMT)
Message-ID: <200801051412.m05ECIaH010327@nakamura.uits.iupui.edu>
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
R

Read the text file line by line to extract the email address and INSERT INTO/UPDATE the SQL Table

for line in fh:
    if not line.startswith('From: '): continue
    pieces = line.split()
    email = pieces[1]
    cur.execute('SELECT count FROM Counts WHERE email = ? ', (email,))
    row = cur.fetchone()
    if row is None:
        cur.execute('''INSERT INTO Counts (email, count)
                VALUES (?, 1)''', (email,))
    else:
        cur.execute('UPDATE Counts SET count = count + 1 WHERE email = ?',
                    (email,))
conn.commit()

Execute SELECT-FROM command to read email count from the database and ORDER BY the count

sqlstr = 'SELECT email,count FROM Counts ORDER BY count DESC LIMIT 20'

for row in cur.execute(sqlstr):
    print(str(row[0]), row[1])
zqian@umich.edu 195
mmmay@indiana.edu 161
cwen@iupui.edu 158
chmaurer@iupui.edu 111
aaronz@vt.edu 110
ian@caret.cam.ac.uk 96
jimeng@umich.edu 93
rjlowe@iupui.edu 90
dlhaines@umich.edu 84
david.horwitz@uct.ac.za 67
ajpoland@iupui.edu 48
wagnermr@iupui.edu 44
gjthomas@iupui.edu 44
sgithens@caret.cam.ac.uk 43
ray@media.berkeley.edu 32
stephen.marquard@uct.ac.za 29
gsilver@umich.edu 28
josrodri@iupui.edu 28
nuno@ufp.pt 28
bkirschn@umich.edu 27

Run AVG command with a LIKE matching string to count the average number of emails from a particular source

sqlstr = 'SELECT AVG(count) FROM Counts WHERE email LIKE "%umich%"'
for row in cur.execute(sqlstr):
    print("Average email received from Univ. of Michigan address:",float(row[0]))
Average email received from Univ. of Michigan address: 44.63636363636363

Close the remote and local file handler and the cursor connection

fh.close()
cur.close()
fhand.close()