CS 5293 Spring 21

Logo

This is the web page for Text Analytics at the University of Oklahoma.

View the Project on GitHub oudalab/cs5293sp22

CS 5293, Spring 2022 Project 0

In this activity, we are going to stretch the superpowers you have learned thus far. Use your knowledge of Python and the Linux command line tools to extract information from a CSV file on the web.

The Norman, Oklahoma police department regularly reports incidents, arrests, and other activities. This data is hosted on their website. This data is distributed to the public in the form of PDF files.

The website contains three types of summaries arrests, incidents, and case summaries. Your assignment in this project is to build a function that collects only the incidents. To do so, you need to write Python function(s) to do each of the following:

Below we describe the assignment structure and each required function. Please read through this whole document before starting!

README.md

The README file should be all uppercase with .md extension. You should write your name in it, and an example of how to run it, any bugs that should be expected, and a list of any web or external libraries that are used. You should describe all functions and your approach to developing the database. The README file should contain a list of any bugs or assumptions made while writing the program. Note that you should not be copying code from any website not provided by the instructor. You should include directions on how to install and use the Python package. We know your code will not be perfect, be sure to include any assumptions you make for your solution.

COLLABORATORS file

This file should contain a pipe-separated list describing who you worked with and a small text description describing the nature of the collaboration. If you visited a website for inspiration, including the website. This information should be listed in three fields as in the example is below:

Katherine Johnson | kj@nasa.gov | Helped me understand calculations
Dorothy Vaughan | doro@dod.gov | Helped me with multiplexed time management
Stackoverflow | https://example | helped me with a compilation but

The collaborator file is mainly used to ensure that code similarities are coincidental.

Project Descriptions

Your code should be in a private GitHub repository called (cs5293sp22-project0). Your code structure should be in a directory with the following format:

cs5293sp22-project0/
├── COLLABORATORS
├── LICENSE
├── requirements.txt
├── README.md
├── project0
│   ├── __init__.py
│   └── main.py
│   └── ...
├── docs/
├── setup.cfg
├── setup.py
└── tests
    ├── test_download.py
    └── test_date_times.py
    └── ... 

Feel free to combine or optimize functions as long as your code preserves the behavior of main.py. You may have more or fewer files in your directory as needed. You may have several additional tests and modules in your code.

setup.py / setup.cfg
from setuptools import setup, find_packages

setup(
	name='project0',
	version='1.0',
	author='You Name',
	author_email='your ou email',
	packages=find_packages(exclude=('tests', 'docs')),
	setup_requires=['pytest-runner'],
	tests_require=['pytest']	
)

Note, the setup.cfg file should have at least the following text inside:

[aliases]
test=pytest

[tool:pytest]
norecursedirs = .*, CVS, _darcs, {arch}, *.egg, venv
main.py

Here is an example main.py file we expect that yours will be different. Calling the main function should download data insert it into a database and print a summary of the incidents. Your code will likely differ significantly. You may have more or less individual steps. Below is an outline; it shows how to use argparse to pass parameters to the code.

# -*- coding: utf-8 -*-
# Example main.py
import argparse

import project0

def main(url):
    # Download data
    incident_data = project0.fetchincidents(url)

    # Extract data
    incidents = project0.extractincidents(incident_data)
	
    # Create new database
    db = project0.createdb()
	
    # Insert data
    project0.populatedb(db, incidents)
	
    # Print incident counts
    project0.status(db)


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument("--incidents", type=str, required=True, 
                         help="Incident summary url.")
     
    args = parser.parse_args()
    if args.incidents:
        main(args.incidents)

Your code should take a URL from the command line and perform each operation. After the code is installed, you should be able to run the code using the command below.

pipenv run python project0/main.py --incidents <url>

Each run of the above command should create a new normandb database file. You can add other command line parameters to test each operation but the --incidents <url> flag is required.

Below is a discussion of each interface. Note, the function names are suggestions and should be changed to suit your programmer.

Download Data

The function fetchincidents(url) takes a URL string and uses the Python urllib.request library to grab one incident pdf for the norman police report webpage.

Below is an example snippet below to grab an incident pdf document from the URL.

import urllib

url = ("https://www.normanok.gov/sites/default/files/documents/"
       "2022-02/2022-02-21_daily_incident_summary.pdf")

headers = {}
headers['User-Agent'] = "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17"                          

data = urllib.request.urlopen(urllib.request.Request(url, headers=headers)).read()                                                                               

The locations of files can be stored in the file system perhaps in the /tmp directory, as a local variable, a config file, any other method of your choosing. As long as the next method can read the data from the incident page. For official ways of handling config and temporary files, see the importlib library. Please discuss your choice in your README file.

Extract Data

The function extractincidents(incident_data) takes data from a pdf file and extracts the incidents. Each incident includes a date_time, incident number, incident location, nature, and incident ori. To extract the data from the pdf files, use the PyPdf2.pdf.PdfFileReader class. It will allow you to extract pages and pdf files and search for the rows. Extract each row and add it to a list.

Here is an example Python snippet that takes data from a byte object data that contained pdf data, writes it to a temporary file and reads it using the PyPdf2 module. To install the module, use the command pipenv install PyPDF2.

import tempfile
fp = tempfile.TemporaryFile()

import PyPDF2

# Write the pdf data to a temp file
fp.write(data)

# Set the curser of the file back to the begining
fp.seek(0)

# Read the PDF
pdfReader = PyPDF2.pdf.PdfFileReader(fp)
pagecount = pdfReader.getNumPages()

# Get the first page
page1 = pdfReader.getPage(0).extractText()
# ...

# Now get all the other pages
for pagenum in range(1, pagecount):
  p = pdfReader.getPage(1).extractText()
  # ...

This function can return a list of rows so another function can easily insert the data into a database. For this assignment, you wil need to consider EACH PAGE of the linked pdf.

Create Database

The createdb() function creates an SQLite database file named normanpd.db and inserts a table with the schema below.

CREATE TABLE incidents (
    incident_time TEXT,
    incident_number TEXT,
    incident_location TEXT,
    nature TEXT,
    incident_ori TEXT
);

Note, some “cells” have information on multiple lines, your code should take care of these edge cases.

You will need to access sqlite3 from python, be sure to look at the official docs https://docs.python.org/3.10/library/sqlite3.html .

Insert Data

The function populatedb(db, incidents) takes the rows created in the extractincidents() function and adds it to the normanpd.db database. Again, the signature of this function can be changed as needed.

Status Print

The status() function prints to standard out, a list of the nature of incidents and the number of times they have occurred. The list should be sorted first by the total number of incidents and secondarily, alphabetically by the nature. Each field of the row should be separated by the pipe character (|).

Abdominal Pains/Problems|100
Cough|20
Sneeze|20
Breathing Problems|19
Noise Complaint|4
...

Test

We expect you to create your own test files to test each function. Some tests involve downloading and processing data. To create your own test you can download and save test files locally. This is recommended, particularly because Norman PD will irregularly remove the arrest files. Tests should be runnable by using pipenv run python -m pytest. You should test at least each function. You should discuss your tests in your README.

Create a repository for your project on your instance and GitHub

Create a private repository GitHub called cs5293sp22-project0

Add collaborators cegme and jasdehart by going to Settings > Collaborators.

Submitting your code

When ready to submit, create a tag on your repository using git tag on the latest commit:

git tag v1.0
git push origin v1.0

Version v1.0 lets us know when and what version of code you would like us to grade. If you would like to submit a second version before, use the tag v2.0.

If you need to update a tag, view the commands in the following StackOverflow post.

We will update the submission instructions so keep an eye out!

Deadline

Your submission/release is due on Tuesday, March 8th. The late policy of the course will be followed. We will have more information on submission closer to the submission deadline.

Grading

Grades will be assessed according to the following distribution:

Addendum


Back to Project List