CS 5293 Spring 19

Logo

This is the web page for Text Analytics at the University of Oklahoma.

View the Project on GitHub oudalab/cs5293sp19

CS 5293, Spring 2019 Project 1

Due 3/8 end of day

The Redactor

Introduction

Whenever sensitive information is shared with the public, the data must go through a redaction process. That is, all sensitive names, places, and other sensitive information must be hidden. Documents such as police reports, court transcripts, and hospital records all containing sensitive information. Redacting this information is often expensive and time consuming.

Task Overview

In this project, you will use your knowledge of Python and Text Analytics to design a system that accept plain text documents then detect and redact “sensitive” items. Below is an example execution of the program.

pipenv run python redactor.py --input '*.txt' \
                    --input 'otherfiles/*.txt' \
                    --names --dates --addresses --phones \
                    --concept 'kids' \
                    --output 'files/' \
                    --stats stderr

Running the program with this command line argument should read all files ending with .txt in in the current folder and also all files ending in .txt from the folder called otherfiles/. All these files will be redacted by the program. The program will look to redact all names, dates, addresses and phone numbers. Notice the flag --concept, this flag asks the system to redact all portions of text that have anything to do with a particular concept. In this case, all paragraphs or sentences that contain information about “kids” should be redacted. It is up you to determine what represents a concept. All the redacted files should be transformed to new .txt files and written to the location described by --output flag. The final parameter, --stats, describes the file or location to write the statistics of the redacted files. Below we discuss each of the parameter in additional detail.

Parameters

–input

This parameter takes a glob that represents the files that can be accepted. More than one input flag may be used to specify groups of files. If a file cannot be read or redacted an appropriate error message should be displayed to the user.

–output

This flag should specify a directory to store all the redacted files. The redacted files, regardless of their input type should be written to txt files. Each file should have the same name as the original file with the extension .redacted appended to the file name.

Redaction flags

The redaction flags list the entity types that should be extracted from all the input documents. The list of flags you are required to implements are:

You are free to add you own! Note: gender should be any term that reveals the gender of a person (e.g. him, her, etc.). The other definitions of the rest of the terms should be straight forward. In your README.md discussion file clearly give the parameters you apply to each of the flags — be clear what constitutes a an address and phone number. The redacted characters in the document should be replaced upon with a character of your choice. Some popular characters include the unicode full block character (U+2588). You should redact only the words, not the whitespaces. If you believe that one should also redact whitespaces between words (e.g. in a first and last name) please discuss why in your README.md.

–concept

This flag, which can be repeated one or more times, takes one word or phrase that represents a concept. A concept is either an idea or theme. Any section of the input files that refer to this concept should be redacted. For example, if the given concept word is prison, a sentence (or paragraph) either containing the word or similar concepts, such as jail or incarcerated, that whole sentence should be redacted. In your README.md file, make your definition of a concept clear. Also, clearly state how you create the context of a concept and justify your method. You may be creative here!

—stats

Stats takes either the name of a file, or special files (stderr, stdout), and writes a summary of the redaction process. Some statistics to include are the types and counts of redacted terms and the statistics of each redacted file. Be sure to describe the format of your outfile to in your README file. Stats should help you while developing your code.

Submission

README.md

The README file name should be uppercase with an .md extension. You should write your name in it, and example of how to run it, and a list of any web or external resources that you used for help. The README file should also contain a list of any bugs or assumptions made while writing the program. Note that you should not be copying code from any website not provided by the instructor. You should include directions on how to install and use the code. You should describe all functions and your approach to develop the database. You should describe any known bugs and cite any sources or people you used for help. Be sure to include any assumptions you make for your solution.

COLLABORATORS file

This file should contain a comma separated list describing who you worked with and a small text description describing the nature of the collaboration. This information should be listed in three fields as in the example is below:

Katherine Johnson, kj@nasa.gov, Helped me understand calculations
Dorothy Vaughan, doro@dod.gov, Helped me with multiplexed time management

Project Descriptions

Your code structure should be in a directory with something similar to the following format:

cs5293p19-project1/
├── COLLABORATORS
├── LICENSE
├── Pipfile
├── Pipfile.lock
├── README.md
├── project1
│   ├── __init__.py
│   └── main.py
│   └── ... 
├── docs/
├── setup.cfg
├── setup.py
└── tests/
    ├── test_names.py
    └── test_genders.py
    └── ... 

setup.py

from setuptools import setup, find_packages

setup(
	name='project1',
	version='1.0',
	author='You Name',
	authour_email='your ou email',
	packages=find_packages(exclude=('tests', 'docs')),
	setup_requires=['pytest-runner'],
	tests_require=['pytest']	
)

Note, the setup.cfg file should have at least the following text inside:

[aliases]
test=pytest

[tool:pytest]
norecursedirs = .*, CVS, _darcs, {arch}, *.egg, venv

Tests

The general rule is you should aim to have a test for each feature. Including tests help people understand how your code works, in addition to verifying assumptions during development. Tests should be runnable by using pipenv run python -m pytest. You should discuss your tests in you README.

It is expected that you will use nltk to complete this assignment. However, you are welcomed to use other popular APIs. Some of these API’s a larger instance may require a larger instance, please let us know if you plan to do this. Also, some APIs require specialied keys, please let the TA know how you plan to use your keys. Below is the information for using Google tools.

Creating API Keys https://cloud.google.com/docs/authentication/api-keys

Google NLP https://googlecloudplatform.github.io/google-cloud-python/latest/language/usage.html#annotate-text

Create a repository for your project on your instance and GitHub

Create a private repository GitHub called cs5293sp19-project1

Add collaborators cegme and chanukyalakamsani by going to Settings > Collaborators.

Then go to your instance, create a new folder /projects if it is not already created, and clone the repository into that folder. For example:

cd /
sudo mkdir /projects
sudo chown `whoami`:`whoami` /projects
chmod 777 /projects
cd projects
git clone https://github.com/cegme/cs5293sp19-project0

You should regularly git add, git commit -m, and git push origin master your code changes to GitHub.

Submitting your code

When ready to submit, create a tag on your repository using git tag on the latest commit:

git tag v1.0
git push origin v1.0

The version v1.0 lets us know when and what version of code you would like us to grade. If you need to submit an updated version, you can use the tag v1.1. If you would like to submit a a second version before the 24 hour deadline, use the tag v2.0.

If you need to update a tag, view the commands in the following StackOverflow post.

Deadline

Your submission/release is due on Friday, 8th at 11:45pm. Submissions arriving between 11:45:01pm and 11:45pm on the following day will receive a 10% penalty; meaning these scores will only receive 90% of its value. Any later submissions will receive not receive credits. You must have your instance running and code available otherwise you will not receive credit.

Grading

Grades will be assessed according to the following distribution:

Addenda


Back to Project List