CS 5293 Spring 21

Logo

This is the web page for Text Analytics at the University of Oklahoma.

View the Project on GitHub oudalab/cs5293sp22

Project 3

The Unredactor

Introduction

Whenever sensitive information is shared with the public, the data must go through a redaction process. That is, all sensitive names, places, and other sensitive information must be hidden. Documents such as police reports, court transcripts, and hospital records all containing sensitive information. Redacting this information is often expensive and time consuming.

For project 3, you will be creating an Unredactor. The unredactor will take redacted documents and return the most likely candidates to fill in the redacted location. The unredactor only needs to unredact people names — however, if there is another category that you would prefer to unredact, please discuss this with the instructor.

As you can guess, discovering names is not easy. To discover the best names, we can have to train a model to help us predict missing words. For this assignment, you are expected to use the Large Movie Review Data Set. Please use this link to download the data set. This is a data set of movie reviews from IMDB containing. The initial goal of the data set is to discover the sentiment of each review. For this project, we will only use the reviews for their textual content.

Unredactor Training Set

The class will collaborativley create a training, test, and validation unredaction dataset. Your full redacted data examples must be submited before the start of finals. We encourage you to submit it earlier.

Pull Reqest

You must add your 50 training sentences, 30 validation sentences, and 10 test sentences to the unredactor.tsv file. To add your data to the file, you must submit a pull request. The unredactor.tsv is a tab separated file with tabs. The first column is the github username of the person who added the file. The second column specifies whether the file is in training, testing, or validation. The third column is the name of the entity that was redacted. The last column is the redaction context.

The Redaction Context Column

The redaction context should have a single set of block characters representing the redacted text. The length of the redaction block should be equal to the number of characters in the redacted name. There should be no spaces if the redaction is multiple words. The length of the redaction context window should be reasonable (please keep it below 1024 characters). You could uses your previous redactor.py code to generate the 90 redactions you will add to the data set. You can use redactions from any test or train dataset in the IMDB dataset to create test, validation, or train samples in the unredactor.tsv.

The Task

The task for this project is to train a model that learns from the training set and be evaluated on the validation set. You need to design your README.md describing your code (as usual) and with instruction clear enough to be followed. The key to this task is to (1) make it easy for a peer to use your code to execute the model on the valudations set; (2) generate a precision, recall, and f1-score of the code for the dataset. Note to do this, you will have to generate code that will understand where the example redaction is in the training set, generate features, and run the model.

Helpful code

Below is sample code that uses the default NLTK model to extract and print names from the movie reviews. (Errors may exist)

#!/usr/bin/python3
# -*- coding: utf-8 -*-
# grabthenames.py
# grabs the names from the movie review data set

import glob
import io
import os
import pdb
import sys

import nltk
from nltk import sent_tokenize
from nltk import word_tokenize
from nltk import pos_tag
from nltk import ne_chunk


def get_entity(text):
    """Prints the entity inside of the text."""
    for sent in sent_tokenize(text):
        for chunk in ne_chunk(pos_tag(word_tokenize(sent))):
            if hasattr(chunk, 'label') and chunk.label() == 'PERSON':
                print(chunk.label(), ' '.join(c[0] for c in chunk.leaves()))


def doextraction(glob_text):
    """Get all the files from the given glob and pass them to the extractor."""
    for thefile in glob.glob(glob_text):
        with io.open(thefile, 'r', encoding='utf-8') as fyl:
            text = fyl.read()
            get_entity(text)


if __name__ == '__main__':
    # Usage: python3 entity-extractor.py 'train/pos/*.txt'
    doextraction(sys.argv[-1])

Note that in the code above we restrict the type to PERSON. Given a new document such as the one below that contains at least one redated name. Create Python code to help you predict the most likely unredacted name. If you are curious, the redacted name is Ashton Kutcher.

'''This movie was sadly under-promoted but proved to be truly exceptional.
Entering the theatre I knew nothing about the film except that a friend wanted to see it.

I was caught off guard with the high quality of the film.
I couldn't image ██████████████ in a serious role, but his performance truly 
exemplified his character.
This movie is exceptional and deserves our monetary support, unlike so many other movies.
It does not come lightly for me to recommend any movie, 
but in this case I highly recommend that everyone see it.

This films is Truly Exceptional!'''

There are about 90K files available in the IMDB data set split into test and train folders. One of the most important aspects of this project is creating the appropriate set of features. Features may include, n-grams in the document, number of letters in the redacted word, the number of spaces in the redacted word, the sentiment score of the review, previous word, next word.

Many students will use spaCy for this assignment but use other Python libraries with approval from the instructor. Sklearn has a Dictionary Vectorizer that may be very useful. The Google Graph Search Api may also prove useful. Using this, you can find a list of all important entities (Entity Search). Given a set of candidate matches you can use the information in google knoweldge graph to give you a ranked set of people.

In this project, we would like you to create a function to execute the unredation process. It is your task to create a reproducible method of training, testing, and validating your code. Use the README to explain all assumptions. Be sure to give an examples of usage. Give clear directions. Add tests for each part of your code so we can better evaluate your code.

We are looking for well-thought out and reasoned approaches — use the tests and your README file to evaluate your code. There are several techniques for creating the unredactor. You can use a rule-based approach, you can create a classifier using the DictVectorizer in sklearn, you could use word2vec style prediction, or you may use a different package to support your work.

Submission

For this project, supply one README document, code package, and a less that 5-minute demonstration walkthrough of your code. We given you several tools in class. You may choose how you would like to develop and present your approach. All code and a link to the video should be posted in a private GitHub repository cs5293sp22-project3; cegme and jasdehart should be added as collaborators.

Peer evaluations of the unredacted code must be performed. If you use external APIs that require keys or special permission please include instructions on how we can also obtain access. The README file should be a full discussion of your approach, what works, and what doesn’t work. It is your job to make peer evaluations as easy as possible. Evaluation of your project should be doable by a peer within 30 minutes — including all execution time.

Peer evaluation assignments and evaluation rubrics will be given at a later date.

When ready to submit, create a tag on your repository using git tag on the latest commit:

git tag v1.0
git push origin v1.0

The version v1.0 lets us know when and what version of code you would like us to grade. If you need to submit an updated version, you can use the tag v1.1.

We will also ask you to submit all code files on Gradescope.

In summary you should submit:

Grading

You will receive a follow up instructions with rubrics in the following week.

  Percentage
Peer Grading 30%
Labeled Data 10%
README and Evaluation 60%
  100%

Extra links and Notes

To use some API’s a larger instance may be needed. If you decide to use a larger instance please let us know and be sure to add this into your README. Ease of use is important.

Extracting features with sklearn dictvectorizers

An additional helpful dictVectorizer example with NLTK

FeatureUnions May also be helpful in combining vectorizors.

Addendum