Assignment 5

Classification Activity

Below are a set of activites to be completed next class. You should submit your assignment as a single file as a pdf. ~~You will be given the first ~30 minutes of class to complete and submit the assignment.~~

Task 1

Given the labels below. Calculate precision, recal, accuracy, and f1 score (assume spam is the positive label.

actual_labels = ['spam', 'ham', 'spam', 'spam', 'spam',
               'ham', 'ham', 'spam', 'ham', 'spam',
               'spam', 'ham', 'ham', 'ham', 'spam',
               'ham', 'ham', 'spam', 'spam', 'ham']

predicted_labels = ['spam', 'spam', 'spam', 'ham', 'spam',
                    'spam', 'ham', 'ham', 'spam', 'spam',
                    'ham', 'ham', 'spam', 'ham', 'ham',
                    'ham', 'spam', 'ham', 'spam', 'spam']

Task 2

Create a sample corpus, by creating an array of sentences. Using NLTK and/or sklearn, create one function to do the following.

Create a bag of words for each sentence.
Create a bag of words using 3-grams.
Create a tfidf value for each 3-gram in the sentence.

Task 3

Review the sentiment analysis classifier creation code. Recall that sentiment analysis is a classificatiion tasks for positive and negative classes. Use the sad.thorn file. This file has three columns, the id, the label (positive or negative), and the text of the tweet. Split the data into a 70%/30% train test split. (Note: The better practice is to also create a validation set but that is not necessary here.) Read each tweet using the TfidfVectorizer. Then, use the sklearn MultinomialNB classifier to classify the sentiment of tweets as positive of negative. Report the precision, recall, and f1-score.

from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X, y) # Add the appropriate test information
clf.predict(Z) # Try to predict new tweets