CS 5293 Spring 19

Logo

This is the web page for Text Analytics at the University of Oklahoma.

View the Project on GitHub oudalab/cs5293sp19

Project 2 (Due April 30th)

###The Unredactor

[toc]

Introduction

Whenever sensitive information is shared with the public, the data must go through a redaction process. That is, all sensitive names, places, and other sensitive information must be hidden. Documents such as police reports, court transcripts, and hospital records all containing sensitive information. Redacting this information is often expensive and time consuming.

As part of phase 2, you will be creating the Unredactor. The unredactor will take a redacted document and the redaction flag as input, in return it will give the most likely candidates to fill in the redacted location. The unredactor only needs to unredact names. To predict the name you have to solve the Entity Resolution problem.

As you can guess, discovering names is not easy. To discover the best names, we will have to train a model to help us predict missing words. For this assignment, we will use the Large Movie Review Data Set. Please use this link to download the data set. This is a data set of movie reviews from IMDB containing. The initial goal of the data set is to discover the sentiment of each review. For this project, we will only use the reviews for their textual content. Below is some sample code to extract and print names from the movie reviews.

#!/usr/bin/python3
# -*- coding: utf-8 -*-

import glob
import io
import os
import pdb
import sys

import nltk
from nltk import sent_tokenize
from nltk import word_tokenize
from nltk import pos_tag
from nltk import ne_chunk


def get_entity(text):
    """Prints the entity inside of the text."""
    for sent in sent_tokenize(text):
        for chunk in ne_chunk(pos_tag(word_tokenize(sent))):
            if hasattr(chunk, 'label') and chunk.label() == 'PERSON':
                print(chunk.label(), ' '.join(c[0] for c in chunk.leaves()))


def doextraction(glob_text):
    """Get all the files from the given glob and pass them to the extractor."""
    for thefile in glob.glob(glob_text):
        with io.open(thefile, 'r', encoding='utf-8') as fyl:
            text = fyl.read()
            get_entity(text)


if __name__ == '__main__':
    # Usage: python3 entity-extractor.py 'train/pos/*.txt'
    doextraction(sys.argv[-1])

Note that in the code above we restrict the type to PERSON. Given a new document such as the one below that contains at least one redated name. Create python code to help you predict the most likely unredacted name. If you are curious, the redacted name is Ashton Kutcher.

'''This movie was sadly under-promoted but proved to be truly exceptional.
Entering the theatre I knew nothing about the film except that a friend wanted to see it.

I was caught off guard with the high quality of the film.
I couldn't image ██████ ███████ in a serious role, but his performance truly 
exemplified his character.
This movie is exceptional and deserves our monetary support, unlike so many other movies.
It does not come lightly for me to recommend any movie, 
but in this case I highly recommend that everyone see it.

This films is Truly Exceptional!'''

In order to discover the item redacted item, you need to create a function that takes a file with redacted items in it, and output the most likely name for each redaction. Below is pseudocode the a naive solution:

# Training step:
for file in list_of_files:
    collect all entities in the file
    for each entity, increment the features associated with that entity

# Prediction phase
for file in list of files:
    extract all candidate entities
    extract features associated with each candidate entity
    Find the top-k matching entities

There are about 90K files available in the imdb data set split into test and train folders. Use the train folders to create features and evaluate your techniques. It is up to you to generate a training set of redacted files. One of the most important aspects of this project is creating the appropriate set of features. Features may include, n-grams in the document, number of letters in the redacted word, the number of spaces in the redacted word, etc. Creative development of the feature vector is important for entity extraction.

The Google Graph Search Api may also prove useful. Using this, you can find a list of all important entities (Entity Search). Given a set of candidate matches you can use the information in google knoweldge graph to give you a ranked set of people.

In this project, we would like you to create a function to execute a redactor. It is your task to create a reproducible method of training,testing, and validating your code. Use the README to explain all assumptions. Be sure to give an examples of usage. Give clear directions. Add tests for each part of your code so we can better evaluate your code.

Note: there are several techniques for creating the unredactor. You can use a simple rule based approach, you can create a classifier using the DictVectorizer in sklearn, or you may use a different package to support your work. We are looking for well-thought out and reasoned approaches — use the tests and your README file to evaluate your code. Unredacted texts should go in the output folder location.

Submission

For this project, supply one README document, code package, and compress it as a .tgz file. Ensure that your code can be, downloaded, extracted, and re-executed on your instance. We will not ask you to keep your code hosted on your instance.

Package your code using the setup.py and directory structure discussed earlier in the course. You should additionally add tests to this project submission.

from setuptools import setup, find_packages

setup(
	name='unredactor',
	version='1.0',
	author='You Name',
	authour_email='your ou email',
	packages=find_packages(exclude=('tests', 'docs')),
	setup_requires=['pytest-runner'],
	tests_require=['pytest']	
)

When ready to submit, create a tag on your repository using git tag on the latest commit:

git tag v1.0
git push origin v1.0

The version v1.0 lets us know when and what version of code you would like us to grade. If you need to submit an updated version, you can use the tag v1.1. If you would like to submit a a second version before the 24 hour deadline, use the tag v2.0.

If you need to update a tag, view the commands in the following StackOverflow post.

In the tests folder, add a set of files that test the different features of your code. Test do not have to be too creative but they should show that your code works as expected. There are several testing frameworks for python, for this project use the py.test framework. For questions use the message board and see the pytest documentation for more examples http://doc.pytest.org/en/latest/assert.html . This tutorial give the best discussion of how to write tests https://semaphoreci.com/community/tutorials/testing-python-applications-with-pytest .

redactor/
		redactor/
			__init__.py	
			redactor.py
			unredactor.py
		tests/
			test_download.py
			test_flagrecognition.py
			test_stats.py
			test_train.py
			test_unredact.py
			...
		docs/
		Pipfile
		Pipfile.lock
		README
		requirements.txt
		setup.cfg
		setup.py

Note, the setup.cfg file should have at least the following text inside:

[aliases]
test=pytest

[tool:pytest]
norecursedirs = .*, CVS, _darcs, {arch}, *.egg, venv

Typing pipenv run python setup.py test should execute your tests using the pytest-runner.

If you use external APIs that require keys or special permission please include instructions on how we can also obtain access.

The README file should be a full discussion of your approach, what works, and what doesn’t work. Please remove any extra environment files (e.g. env/venv) before submission. It is not needed and can make your .tgz file too large to grade. In-class upload a .tgz to Canvas, and commit your code to a private github repository.

Note:

Grading Part- In person grading during TA hours for both Online and offline students. Online students can schedule google hangout session on Saturday and Sunday too. In class students can meet me during TA hours and after class too.

Submission: File upload (.tar.gz format) for both online and offline(mandatory).

Evaluations should be done within one week of the due date.

Extra links and Notes

Notes, to use some api’s a larger instance may be needed. If you decide to use a larger instance please let us know and be sure to add this into your README.

Creating API Keys https://cloud.google.com/docs/authentication/api-keys

Google NLP https://googlecloudplatform.github.io/google-cloud-python/latest/language/usage.html#annotate-text