Python Project 1-TextAnalyzer Solved

40.00 $

Category:

Description

Rate this product

 

Introduction

What if there was a way to identify the author of an anonymous text? In this project, we will build the infrastructure for analyzing texts so they can be compared for similarity.

Data scientists have successfully achieved this by comparing the frequency of common words in an author’s writings. These frequencies form a descriptor of an author’s style, which tends to stay constant across their works. We can compare the frequencies of different writings, to see which writings are similar.

This method was able to identify “Robert Galbraith” as the pen name of JK Rowling (note the overlap of the light blue and dark blue dots in the graph).

(To try different word combinations on this graph, see Can You Identify an Author By How Often They Use the Word “The”?)

In this project…

In this project, you will develop a TextAnalyzer class. A TextAnalyzer object will read in a file and do all of the analysis needed to create the frequency “fingerprint” for that text.

Here’s an example of how the TextAnalyzer works, using a short text. File tinyfile_3.txt

sentence_count 3
word_count
12

vocabulary
[‘coffee’, ‘hate’, ‘i’, ‘juice’, ‘love’, ‘much’, ‘tea’]
frequencies
{‘hail’: 2, ‘michigan’: 1, ‘champions’: 1, ‘west’: 1}
frequency of ‘much’
3
percent_frequencies
{‘i’: 0.25, ‘love’: 0.1666666, ‘coffee’: 0.083333333, ‘much’: 0.25, ‘tea’: 0.0833333, ‘hate’: 0.0833333, ‘juice’: 0.0833333}
most common
[‘i’, 3]
five_least_common
[(‘coffee’, 1), (‘tea’, 1), (‘hate’, 1), (‘juice’, 1), (‘love’, 2)]

The code

You will create a class called TextAnalyzer with the following methods. Implement the methods so that all provided test cases pass:

class TextAnalyzer:
   def __init__(self, filepath):
        """Initializes the TextAnalyzer object, using the file at filepath.
        Initialize the following instance variables: filepath (string),
       words (list)"""
    def sentence_count(self):
        """Returns the number of sentences in the file (seperated by .)
        Note that if there are no '.' in the sentences return 1”""
    def words(self):
        “”” Returns a list of words without punctuation and all lowercase. For
         example : 'Cat!' Should be 'cat'. “””
    def remove_stopwords(self, words):
        “”” This takes in the list of words that are not punctuated and are lowercase.
         Returns a list of words with the stopwords provided by the file
        ‘stopwords.txt’ removed. “””
    def word_count(self):
        """Returns the number of words in the file not including the stopwords. A word
        is defined as any text that is separated by whitespace (spaces, newlines, or
        tabs)."""
    def vocabulary(self):
        """Returns a list of the unique words in the text, sorted by
        alphabetical order. Capitalization, punctuation, and stopwords should be
        ignored, so Cat!' is the same word as 'cat'. The returned words should be all
        lowercase, without punctuation or stopwords."""
    def frequencies(self):
        """Returns a dictionary of the words in the text and the count of how
        many times they appear. The words are the keys, and the counts are the
        values. All the words should be lowercase and without punctuation and should
        not include stopwords. The order of the keys doesn't matter."""
    def frequency_of(self, word):
    """Returns the number of times the word appears in the text. Capitalization,
    punctuation, and stopwords should be ignored, so 'Cat!' is the same word as
    'cat'. If the word does not exist in the text, then return 0"""
def percent_frequencies(self):
    """Returns a dictionary of the words in the text and the frequency of the
    words as a percentage of the text. The words are the keys, and the
    counts are the values. All the words should be lowercase, without
    punctuation or stopwords. The order of the keys doesn't matter."""
def most_common(self):
    """Returns the most common word in the text and its frequency in a list.
    There might be a case where multiple words have the same frequency,
    in that case return one of the most common word which should be lowercase,
    without punctuation or stopwords."""
    # Example output : ['officer', 6]
def five_least_common(self):
    “””Returns the five least common words in the text and its frequency as a list
   of tuples. If there are not five words in the text, return all the least common
   words. There might be a case where multiple words have the same frequency, in
   that case, return any of the least common words which should be lowercase,
   without punctuation or stopwords.”””
   # Example output: [(‘ants’, 1), (‘apple’, 1), (‘bat’, 1), (‘cat’,2)]
def read_sample_csv(self):
    """Reads the sample.csv file and returns the list of fieldnames"""
    # Output Format: filepath, total words, word count removing stopwords, line
    count, most common word
def write_analysis_details(self, csvfile):
    """Writes the details of the textual analysis to the csvfile.
    Refer to sample.csv for an example of how this should look.
    Note that for most_common, just write the word and not its frequency"""
    # Output Format: filepath, total words, word count removing stopwords, line
   count, most common word
def similarity_with(self, other_text_analyzer):
    """Extra credit. Calculates the similarity between this text and
    the other text using cosine similarity. Words should be lowercase, without
    punctuation or stopwords. """

 

Work on one method at a time. Choose the one that you think is the easiest, and work on it until you can get all the tests related to that method to pass. This is a great strategy, since the solution to some methods can be used to quickly complete other methods.

Make sure you are using Python 3!! Some of the tests won’t pass if you are using Python 2. Extra Credit: Calculating similarity – 15 points

Now let’s see how one text compares to another text. Here are two different texts:
File tinyfile_1.txt File tinyfile_3.txt
[‘coffee’, ‘is’, ‘so’, ‘good’] [‘i’, ‘love’, ‘coffee’, ‘so’, ‘so’, ‘so’, ‘so’, ‘much’,

‘i’, ‘love’, ‘tea’, ‘so’, ‘so’, ‘so’, ‘so’, ‘much’, ‘i’, ‘hate’, ‘juice’, ‘so’, ‘so’, ‘so’, ‘so’, ‘much’]

One way to measure their similarity is to compare the frequencies of the different letters in these texts. We can use the frequency that’s calculated by the TextAnalyzer, but let’s make sure that the letters are the same in each.

Frequency for tinyfile_1.txt

Frequency for tinyfile_3.txt

word

frequency

‘coffee’’

1

‘good’’

1

word

frequency

‘i’

3

‘love’

2

‘coffee’

1

‘much’

3

‘tea’

1

‘hate’

1

‘juice’

1

Only one of the words are used in both texts, therefore we don’t expect these texts to be very similar! These frequencies create a sort of vector for each text. We can measure the similarity of two vectors using something called the cosine similarity. So, we can measure the similarity between two texts using the cosine similarity as well.

The cosine similarity of two vectors is:

The Dot product of the two vectors / (Magnitude of the first vector * Magnitude of the second vector)

Step 1 : Get the most common word’s frequency from both text
Step 2 : Create a list of the shared words
Step 3 : Calculate the dot product of the shared words (i.e) in this case it will be only ‘coffee’ which is only 1 but if there were 2 words shared words : ‘coffee’ : 1, ‘coffee’ : 2; ‘good’:1,‘good’:1,thenthedotproductwill be1*2+1*1=3

More info : https://simple.wikipedia.org/wiki/Dot_product
(Tip: Note that only the words that both texts have in common actually matter in this calculation)

How to calculate the magnitude:

Step 4 : The most common word’s frequency is the magnitude for each of the text Step 5 : Now dot product / text1 mag * text 2 mag

Here the answer would be 1 / (2 *1) = 0.5

Miscellaneous

Useful string methods: split() and strip()

>>> s = “I love cats. I love every kind of cat!\n”
>>> s. split()
[‘I’, ‘love’, ‘cats.’, ‘I’, ‘love’, ‘every’, ‘kind’, ‘of’, ‘cat!’] >>> s = ‘cats.’
>>> s.strip(“.!”)
‘cats’
>>> s = ‘cat!’
>>> s.strip(“.!”)
‘cat’

Useful function: sorted()

>>> l = [‘love’, ‘every’, ‘kind’, ‘of’, ‘cat’] >>> sorted(l)
[‘cat’, ‘every’, ‘kind’, ‘love’, ‘of’]
>>> sorted(l, reverse = True)

[‘of’, ‘love’, ‘kind’, ‘every’, ‘cat’]
>>> sorted(l, key = lambda x : x[-1]) # sort by the last letter [‘kind’, ‘love’, ‘of’, ‘cat’, ‘every’]

According to the Python documentation: “It is best to think of a dictionary as an unordered set of key: value pairs”.

  • proj1-main-yc07kb.zip