Description
P4: Text Statistics
Contents
Project Overview
Booklamp was a recent Boise startup (two of our CS department’s alumni were part of it!) that has created a sophisticated system for analyzing the text in books using techniques similar to what you will be doing in this lab. Booklamp software matches readers to books through an analysis of writing styles. The Booklamp technology allows us to find books that are written with a similar tone, tense, perspective, action level, description level and dialogue level. Booklamp was acquired by Apple on 25th July, 2014.
Objectives
- Use arrays.
- Use command-line arguments.
- Read from text files using the Scanner class.
- Parse and analyze text input.
- Implement an interface.
Getting Started
- Create a new Eclipse project for this assignment.
- The easiest way to import all the files into your project is to download and unzip the starter files directly into your project workspace directory (using the command-line or dolphin). The starter files are available here: http://cs.boisestate.edu/~cs121/projects/p4/stubs (You should download p4-stubs.zip).
- After you unzip the files into your workspace directory outside of Eclipse, go back to Eclipse and refresh your project.
- Create
ProcessText
andTextStatistics
classes in your project. - Start by implementing
ProcessText
. You can ignore the command-line arguments to start. Just hard-code a file name so you can test yourTextStatistics
class as you write it. Create aFile
object and check to see that the file actually exists.- If the file does exist, your program will create a
TextStatistics
object for that file and print out the statistics for the file to the console. - If the file does not exist, a meaningful error message needs to be printed to the user.
- If the file does exist, your program will create a
- Next you can start implementing
TextStatistics
according to the specifications. - At this point, you should go back and add command-line argument processing to
ProcessText
as described in the specifications. To make sure it correctly handles command line arguments, run it from the command line with no arguments, files that don’t exist, and files that do exist. - Make sure to test your program thoroughly. We are giving you the test program and scripts we will use to grade your program, so take advantage of this and make sure they pass!
Specification
Project files
For this assignment you are going to implement two classes. TextStatistics
will be the class that reads a text file, parses it, and stores the information about the words and characters in the file. ProcessText
is the driver class that gets a list of one or more filenames from the command line and collects statistics on each of the files using an instance of the TextStatistics
object.
- Classes that you will create: ProcessText.java, TextStatistics.java
- Existing interface and class that you will use: TextStatisticsInterface.java, TextStatisticsTest.java
You should be able to develop this program incrementally in such a way that you can turn in a program that runs even if you don’t succeed in implementing all the specified functionality.
ProcessText.java
main
method which processes one or more files to determine some interesting statistics about them.
-
Command-line validation
The names of the files to process will be given as command line arguments. Make sure to validate the number of command line arguments. There should be at least one file name given.
If no files are given on the command line, your program must print a usage message and exit the program immediately. The message should read as follows.
Usage: java ProcessText file1 [file2 ...]
This lets the user know how they should run the program without having to go look up the documentation.
-
Processing command-line arguments
If valid filenames are given on the command line, your program will process each command line argument by creating a
File
object from it and checking to see that the file actually exists.- If a file does exist, your program will create a
TextStatistics
object for that file and print out the statistics for the file to the console. - If a file does not exist, a meaningful error message needs to be printed to the user. Continue processing the next file. An invalid file in the list should not result in the program crashing or exiting before all files have been processed.
The example, CmdLineArgs.java, shows how to use command line arguments in your program. The
args
parameter of themain
method is an array ofString
objects that contains the command line arguments to the program. For your program, the array should contain the names of the files to be processed. - If a file does exist, your program will create a
TextStatistics.java
-
Implement the Interface
Your
TextStatistics
class must implement the givenTextStatisticsInterface
(don’t modify the interface, it just provides a list of methods that your class must include).To implement an interface, you must modify your class header as follows
public class TextStatistics implements TextStatisticsInterface { }
Adding “implements TextStatisticsInterface” will cause an error in Eclipse. Select the quick fix option to “Add unimplemented methods” and it will stub out the required methods for you.
-
Instance variables
Include a reference to the processed
File
. Include variables for all of the statistics that are computed for the file. Look at the list of accessor methods in theTextStatisticsInterface
to determine which statistics will be stored. -
Constructor
Takes a
File
object as a parameter. The constructor should open the file and read the entire file line-by-line, processing each line as it reads it.You should only have to read through each file once if you are doing this program properly. By the end of the constructor, theTextStatistics
object should have collected all of its statistics and calls to its accessor methods will simply return the stored values.- Your constructor needs to handle the
FileNotFoundException
that can occur when theFile
is opened in aScanner
. Use a try-catch statement to do this. Don’t just throw the exception. - As each line is read, collect the following statistics:
- The number of characters and lines in the file. The number of characters should include all whitespace characters, punctuation, etc. The number of lines should include any blank lines in the file.
- The number of words in the file.You must use a
Scanner
on each line to count the number of words in each line of the text file.To ensure everyone’s results are consistent, you must use the exact delimiter given below rather than making up your own.private static final String DELIMITERS = "[\\W\\d_]+";
Use
useDelimiter(DELIMITERS)
on your lineScanner
to set the delimiters that theScanner
will use for separating words in the file.The scanner will not return any of the delimiter characters. For example, using
lineScan.next()
on the string.scheme, and the "plan" (for us)
will give the following tokens.
scheme and the plan for us
The UseScannerDelimiter.java example shows the different results you get using the default delimiters and user-specified delimiters.
- The number of words of each length that appears in the file. Assume that the maximum word length is 23. You do not need to print lengths that have a count of zero.
- The average word length for the file.
- The number of each letter that appears in the file – do not separate upper and lower case, just convert all characters to lower case before counting.See LetterCount.java for a similar approach.
- Your constructor needs to handle the
-
Getter (accessor) methods
Implement the accessor methods for the number of characters, number of words, number of lines, average word length and for the arrays that contain the number of words of each length and the number of times each letter occurs in the file.
-
toString method
Write a
toString()
method that generates and returns aString
that can be printed to summarize the statistics for the file as shown in the sample output.
Testing
TextStatisticsTest.java: Automated testing based on the interface.
We have provided a test program that tests your TextStatistics
class using three sample text files. This test program will not compile unless you have properly implemented the required interface. This is available in your starter files.
autograde.sh: Testing based on program output.
- autograde.sh
- A shell script (a program made up of shell commands) that you can run to see if your program is going to work with the shell script used for grading the programs.
- testfile.txt and etext
- Sample text files used by autograde.sh
- testresults
- The expected output of autograde.sh. Your output should match the contents of this file.
To use these files to test your program, copy them into your program directory on onyx and make sure that autograde.sh is executable. (Doing a ls -l on the file should give -rwx------. If the x is missing, type chmod +x autograde.sh).
Now you run the test by typing
./autograde.sh
If your program does not compile and run, you need to fix it if you want any points for the program. Make sure all the files have the names specified.
Sample Sessions
Sample output for bad arguments/non-existing files
[you@onyx p4]$ java ProcessText Usage: java ProcessText file1 [file2 ...] [marissa@onyx p4]$ java ProcessText not-a-file.txt Invalid file path: not-a-file.txt [marissa@onyx p4]$ java ProcessText not-a-file.txt testfile.txt Invalid file path: not-a-file.txt Statistics for testfile.txt ========================================================== 11 lines 79 words 465 characters ------------------------------ a = 27 n = 25 b = 1 o = 26 c = 11 p = 5 d = 10 q = 0 e = 33 r = 21 f = 9 s = 30 g = 7 t = 35 h = 24 u = 7 i = 25 v = 1 j = 0 w = 10 k = 2 x = 1 l = 18 y = 2 m = 5 z = 0 ------------------------------ length frequency ------ --------- 1 3 2 13 3 24 4 13 5 10 6 2 7 5 8 3 9 1 10 3 11 2 Average word length = 4.24 ==========================================================
Sample output for the input file testfile.txt
[you@onyx p4]$ java ProcessText testfile.txt Statistics for testfile.txt ========================================================== 11 lines 79 words 465 characters ------------------------------ a = 27 n = 25 b = 1 o = 26 c = 11 p = 5 d = 10 q = 0 e = 33 r = 21 f = 9 s = 30 g = 7 t = 35 h = 24 u = 7 i = 25 v = 1 j = 0 w = 10 k = 2 x = 1 l = 18 y = 2 m = 5 z = 0 ------------------------------ length frequency ------ --------- 1 3 2 13 3 24 4 13 5 10 6 2 7 5 8 3 9 1 10 3 11 2 Average word length = 4.24 ==========================================================
Expected output for TextStatisticsTest
[you@onyx p4]$ java TextStatisticsTest Testing on data file:testfile.txt Passed! getCharCount() Passed! getWordCount() Passed! getLineCount() Passed! getAverageWordLength() Passed! Arrays frequencies Passed! Letter frequencies Testing on data file:etext/Gettysburg-Address.txt Passed! getCharCount() Passed! getWordCount() Passed! getLineCount() Passed! getAverageWordLength() Passed! Arrays frequencies Passed! Letter frequencies Testing on data file:etext/Alice-in-Wonderland.txt Passed! getCharCount() Passed! getWordCount() Passed! getLineCount() Passed! getAverageWordLength() Passed! Arrays frequencies Passed! Letter frequencies
Submitting Your Project
Documentation
Javadoc Comments
If you haven’t already, add javadoc comments to your program. They should be located immediately before the class header and before each method. If you forgot how to do this, go look at the Documenting Your Program section from lab.
- Have a class javadoc comment before the class.
- Have javadoc comments before every method that you wrote. Comments must include
@param
and@return
tags as appropriate. - To build and view your comments, run the following commands.
javadoc -author -d doc *.java google-chrome doc/index.html
README
Include a plain-text file called README that describes your program and how to use it. Expected formatting and content are described in README_TEMPLATE. See README_EXAMPLE for an example.