This is the web page for Text Analytics at the University of Oklahoma.
Whenever sensitive information is shared with the public, the data must go through a redaction process. That is, all sensitive names, places, and other sensitive information must be hidden. Documents such as police reports, court transcripts, and hospital records all containing sensitive information. Redacting this information is often expensive and time consuming.
For project 2, you will be creating an Unredactor. The unredactor will take redacted documents (that you generate) and the redaction flag as input, in return it will give the most likely candidates to fill in the redacted location. The unredactor only needs to unredact people names — however, if there is another category that you would prefer to unredact, please discuss this with the instructor. To predict the name you have to solve the Entity Resolution problem.
As you can guess, discovering names is not easy. To discover the best names, we can have to train a model to help us predict missing words. For this assignment, you are expected to use the Large Movie Review Data Set. Please use this link to download the data set. This is a data set of movie reviews from IMDB containing. The initial goal of the data set is to discover the sentiment of each review. For this project, we will only use the reviews for their textual content. Below is some sample code that uses the default NLTK model to extract and print names from the movie reviews.
#!/usr/bin/python3
# -*- coding: utf-8 -*-
# grabthenames.py
# grabs the names from the movie review data set
import glob
import io
import os
import pdb
import sys
import nltk
from nltk import sent_tokenize
from nltk import word_tokenize
from nltk import pos_tag
from nltk import ne_chunk
def get_entity(text):
"""Prints the entity inside of the text."""
for sent in sent_tokenize(text):
for chunk in ne_chunk(pos_tag(word_tokenize(sent))):
if hasattr(chunk, 'label') and chunk.label() == 'PERSON':
print(chunk.label(), ' '.join(c[0] for c in chunk.leaves()))
def doextraction(glob_text):
"""Get all the files from the given glob and pass them to the extractor."""
for thefile in glob.glob(glob_text):
with io.open(thefile, 'r', encoding='utf-8') as fyl:
text = fyl.read()
get_entity(text)
if __name__ == '__main__':
# Usage: python3 entity-extractor.py 'train/pos/*.txt'
doextraction(sys.argv[-1])
Note that in the code above we restrict the type to PERSON. Given a new document such as the one below that contains at least one redated name. Create Python code to help you predict the most likely unredacted name. If you are curious, the redacted name is Ashton Kutcher.
'''This movie was sadly under-promoted but proved to be truly exceptional.
Entering the theatre I knew nothing about the film except that a friend wanted to see it.
I was caught off guard with the high quality of the film.
I couldn't image ██████ ███████ in a serious role, but his performance truly
exemplified his character.
This movie is exceptional and deserves our monetary support, unlike so many other movies.
It does not come lightly for me to recommend any movie,
but in this case I highly recommend that everyone see it.
This films is Truly Exceptional!'''
In order to discover the redacted name, you need to create a function that takes a file with redacted items in it, and output the most likely name for each redaction. Below is pseudocode the a naive solution:
# Training step:
for file in list_of_files:
collect all entities in the file
for each entity, increment the features associated with that entity
# Prediction phase
for file in list of files:
extract all candidate entities
extract features associated with each candidate entity
Find the top-k matching entities
There are about 90K files available in the imdb data set split into test and train folders. Use the train folders to create features and evaluate your techniques. It is up to you to generate a training set of redacted files you could use your previous code or develop a new method. One of the most important aspects of this project is creating the appropriate set of features. Features may include, n-grams in the document, number of letters in the redacted word, the number of spaces in the redacted word, the sentiment score of the review, etc. Creative development of the feature vector is important for entity resolution.
Many students wil use Spacy for this assignment but use other Python libraries with approval from the instructor. SKlearn has a Dictionary Vectorizer that may be very useful. The Google Graph Search Api may also prove useful. Using this, you can find a list of all important entities (Entity Search). Given a set of candidate matches you can use the information in google knoweldge graph to give you a ranked set of people.
In this project, we would like you to create a function to execute the unredation process. It is your task to create a reproducible method of training, testing, and validating your code. Use the README to explain all assumptions. Be sure to give an examples of usage. Give clear directions. Add tests for each part of your code so we can better evaluate your code.
We are looking for well-thought out and reasoned approaches — use the tests and your README file to evaluate your code. There are several techniques for creating the unredactor. You can use a rule-based approach, you can create a classifier using the DictVectorizer in sklearn, you could use word2vec style prediction, or you may use a different package to support your work. Unredacted texts should go in the output folder location.
For this project, supply one README document, code package, and demonstration video of you walking through all the features. Package your code using the setup.py and directory structure discussed earlier in the course. You should add tests to this project submission.
from setuptools import setup, find_packages
setup(
name='unredactor',
version='1.0',
author='You Name',
author_email='your ou email',
packages=find_packages(exclude=('tests', 'docs')),
setup_requires=['pytest-runner'],
tests_require=['pytest']
)
When ready to submit, create a tag on your repository using git tag on the latest commit:
git tag v1.0
git push origin v1.0
The version v1.0 lets us know when and what version of code you would like us to grade.
If you need to submit an updated version, you can use the tag v1.1
.
If you would like to submit a a second version before the 24 hour deadline, use the tag v2.0
.
If you need to update a tag, view the commands in the following StackOverflow post.
In the tests folder, add a set of files that test the different features of your code. Test do not have to be too creative but they should show that your code works as expected. There are several testing frameworks for python, for this project use the py.test
framework. For questions use the message board and see the pytest documentation for more examples
http://doc.pytest.org/en/latest/assert.html . This tutorial give the best discussion of how to write tests https://semaphoreci.com/community/tutorials/testing-python-applications-with-pytest .
redactor/
redactor/
__init__.py
redactor.py
unredactor.py
tests/
test_download.py
test_flagrecognition.py
test_stats.py
test_train.py
test_unredact.py
...
docs/
Pipfile
Pipfile.lock
README
requirements.txt
setup.cfg
setup.py
Note, the setup.cfg
file should have at least the following text inside:
[aliases]
test=pytest
[tool:pytest]
norecursedirs = .*, CVS, _darcs, {arch}, *.egg, venv
Typing pipenv run python setup.py test
should execute your tests using the pytest-runner.
If you use external APIs that require keys or special permission please include instructions on how we can also obtain access.
The README file should be a full discussion of your approach, what works, and what doesn’t work. We ask that you upload you:
To use some api’s a larger instance may be needed. If you decide to use a larger instance please let us know and be sure to add this into your README.
Creating API Keys https://cloud.google.com/docs/authentication/api-keys
Google NLP https://googlecloudplatform.github.io/google-cloud-python/latest/language/usage.html#annotate-text