CS 5293 Spring 19

Logo

This is the web page for Text Analytics at the University of Oklahoma.

View the Project on GitHub oudalab/cs5293sp19

CS 5293, Spring 2019 Project 0

Due 2/21 end of day

In this activity, we are going to stretch the super powers you have learned thus far. Use your knowledge of Python and the Linux command line tools to extract information from a scraped file and add it to a csv file.

The Norman, Oklahoma police department regularly reports of incidents arrests and other activity. This data is hosted on their website. This data is distributed to the public in the form of PDF files.

The website contains three types of summaries arrests, incidents, and case summaries. Your assignment in this project is to collect just the arrests. To do so, you need to write code to:

Below we describe the assignment structure and each required function. Please read through this whole document before starting!

README.md

The README file should be all uppercase with either no extension or a .md extension. You should write your name in it, and example of how to run it, and a list of any web or external resources that you used for help. The README file should also contain a list of any bugs or assumptions made while writing the program. Note that you should not be copying code from any website not provided by the instructor. You should include directions on how to install and use the code. You should describe all functions and your approach to develop the database. You should describe any known bugs and cite any sources or people you used for help. Be sure to include any assumptions you make for your solution.

COLLABORATORS file

This file should contain a comma separated list describing who you worked with and a small text description describing the nature of the collaboration. This information should be listed in three fields as in the example is below:

Katherine Johnson, kj@nasa.gov, Helped me understand calculations
Dorothy Vaughan, doro@dod.gov, Helped me with multiplexed time management

Project Descriptions

Your code structure should be in a directory with the following format:

cs5293p19-project0/
├── COLLABORATORS
├── LICENSE
├── Pipfile
├── Pipfile.lock
├── README.md
├── project0
│   ├── __init__.py
│   └── main.py
├── docs
├── setup.cfg
├── setup.py
└── tests
    ├── test_download.py
    └── test_fields.py
    └── ... 

Feel free to combine or optimize functions as long as you code preserves the behavior of main.py.

main.py
# -*- coding: utf-8 -*-
# Example main.py
import argparse

import project0

def main(url):
    # Download data
    project0.fetchincidents(url)

    # Extract Data
    incidents = project0.extractincidents()
	
    # Create Dataase
    db = project0.createdb()
	
    # Insert Data
    project0.populatedb(db, incidents)
	
    # Print Status
    project0.status(db)


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument("--arrests", type=str, required=True, 
                         help="The arrest summary url.")
     
    args = parser.parse_args()
    if args.arrests:
        main(args.arrests)

Your code should take a url from the command line and perform each operation. After the code is installed, you should be able to run the code using the command below.

pipenv run python project0/main.py --arrests <url>

Each run of the above command should create a new normandb database files. You can add other command line parameters to test each operation.

Below is a discussion of each interface. Note, the function names are a suggestions and should be changed to suite the programmer.

Download Data

The function fetchincidents() takes no parameters, it uses the python urllib.request library to grab all the incident one of the pdfs for the norman police report webpage.

You can use the code below to grab the daily activity web page.

url = ("http://normanpd.normanok.gov/filebrowser_download/"
       "657/2019-01-24%20Daily%20Arrest%20Summary.pdf")

data = urllib.request.urlopen(url).read()

The locations of files can be stored to the file system perhaps in the /tmp directory, as a local variable, a config file, any other method of your choosing. As long as the next method can read the data from the arrest page.

Extract Data

The function extractincidents() takes no parameters and it reads data from the pdf files and extracts the incidents. The each incident includes a date/time, case number, arrest location, offense, arrestee address, status, and officer. A city, state, and zip code will typical be available if the arrested person(s) is not homeless or transient. This data is hidden inside of a PDF file.

To extract the data from the pdf files, use the PyPdf2.PdfFileReader class. It will allow you to extract pages and pdf file and search for the rows. Extract each row and add it to a list.

Here is an example python script that takes data from a bytes object that contained pdf data, writes it to a temporary file, and reads it using the PyPdf2 module.

import tempfile
fp = tempfile.TemporaryFile()

# Write the pdf data to a temp file
fp.write(data.read())

# Set the curser of the file back to the begining
fp.seek(0)

# Read the PDF
pdfReader = PdfFileReader(fp)
pdfReader.getNumPages()

# Get the first page
page1 = pdfReader.getPage(0).extractText()

This function can return a list of rows so another function can easily insert the data into a database. For this assignment you only need to consider the first page of any pdf.

Create Database

The createdb() function creates an SQLite database file named normanpd.db and inserts a table with the schema below.

CREATE TABLE arrests (
    arrest_time TEXT,
    case_number TEXT,
    arrest_location TEXT,
    offense TEXT,
    arrestee_name TEXT,
    arrestee_birthday TEXT,
    arrestee_address TEXT,
    status TEXT,
    officer TEXT
);

Note, all the columns correspond directly to the columns in the arrest pdfs. The arrest address contains information from the arrestee address, city, state, and zip code. Notice some “cells” have information on multiple lines, your code should take care of this.

Insert Data

The function populatedb(db, incidents) function takes the rows created in the extractincidents() function and adds it to the normanpd.db database. Again, the signature of this function can be changed as needed.

Status Print

The status() function prints to standard out, a random row from the database. Each field of the row should be separated by the thorn character (þ).

1/22/2019 20:13þ2019-00005814þ811 E MAIN STþWARRANT-COUNTYþJIMMY GARRETT DYEþ6/12/2000þ811 E MAIN ST Norman  OK  73071 þFDBDC (Jail) þ1816 - Ross;
...

Test

We expect you to create your own test files to test each function. Some tests involve downloading and processing data. To create your own test you can download and save a file locally. This is recommended, particularly because Norman PD will irregularly remove the arrest files. Tests should be runnable by using pipenv run pytest. You should discuss your tests in you README.

Create a repository for your project on your instance and GitHub

Create a private repository GitHub called cs5293sp19-project0

Add collaborators cegme and chanukyalakamsani by going to Settings > Collaborators.

Then go to your instance, create a new folder /project if it is not already created, and clone the repository into that folder. For example:

cd /
sudo mkdir /project
sudo chown `whoami`:`whoami` /project
chmod 777 /project
cd project
git clone https://github.com/cegme/cs5293sp19-project0

Submitting your code

When ready to submit, create a tag on your repository using git tag on the latest commit:

git tag v1.0
git push origin v1.0

The version v1.0 lets us know when and what version of code you would like us to grade. If you would like to submit a a second version before the 24 hour deadline, use the tag v2.0.

If you need to update a tag, view the commands in the following StackOverflow post.

Deadline

Your submission/release is due on Thursday, 21st at 11:45pm. Submissions arriving between 11:45:01pm and 11:45pm on the following day will receive a 10% penalty; meaning these scores will only receive 90% of its value. Any later submissions will receive not receive credits. You must have your instance running and code available otherwise you will not receive credit.

Grading

Grades will be assessed according to the following distribution:

Addendum

2019-02-11

2019-02-13

2019-02-18

2019-02-20


Back to Project List