Homework Assignment 5; CS 450

The goal for this assignment is to improve your existing program and to prepare it for automated testing by adhering to certain standards for consistency.

Part 1: Modify your program in response to everyone's comments

In the CHANGE_LOG, describe what you changed and if applicable what you didn't change in response the comments from your peer reviews and grading comments.

Part 2: Modify your program to support the following user interface

In order to support standardized testing, you must modify your user interface to match a specific requested interface. (Hopefully, you modularized the user interface portion and it won't cause too many changes. The user interface is probably the most likley source of change requests so keeping it isolated from the rest of the code is always a good idea.)

cs450words [-l] [-t] [-d] [-c] [infile]

cs450words [--help]

The -t flag is optional and if present specifies that the total number of words should be output in the following format:

TOTAL: "count"

The -d flag is optional and if present specifies that the number of distinct words should be output in the following format:

DISTINCT: "count"

The -c flag is optional and if present specifies that the most common word and its count should be output in the following format:

MOST COMMON WORD: "word" "count"

If more there is a tie for most common word, you should output a MOST COMMON WORD line for each one.

The -l flag is optional and if present specifies that a list of distinct words with their frequencies should be output in the following format:

"word" "count"

If no flags are specified, then flags -l, -t, -d and -c should be invoked and output will occur in the order: word list, total count, distinct count and most common word. For example:

bar 1
baz 1
foo 2
TOTAL: 5
DISTINCT: 3
MOST COMMON WORD: foo 2

In all cases, output will occur in this order. However, only a subset of it may be output depending on the flags specified by the user. For example, if the user specifies "-t -d", then only the total count and distinct count will be given (in that order).

There should be no extraneous blank lines in the output - either before, between or after the specified output lines.

If the --help option is present, the program should print out a helpful message describing this interface and then exit without any further processing.

The name of the input file may be specified on the command line. If not, the input should be taken from STDIN (this is very important because our automated testing script will work this way). As in for example:
cat inputFile | cs450words -t > outputFile

Output should be sent to STDOUT. Error messages should be sent to STDERR.

It it helps you may assume that the flags when present will appear in the order given above (i.e. -l comes before -t which comes before -d which comes before -c).

Your code can accept other arguements as desired, but these 4 must be supported.

You should call your executable cs450words.

IMPORTANT NOTE: This is the only input you can get from the user. You may not query the user for additional information.

You may find the getopt or getopt_long functions helpful. Here are some references http://www.tac.eu.org/cgi-bin/man-cgi?getopt+3 and http://www.cs.utah.edu/dept/old/texinfo/glibc-manual-0.02/library_22.html .

Part 3: What is a word.

In addition to standardizing the command line interface and the output format, we will standardize on the following defintion of a word:
A word is a string of alpha-numeric characters. All letters in the input should be converted to lowercase. In addition, some word internal punctuation is allowed. Specifically, the following six characters may occur in the middle of a word: /_.-'@. However, only one punctuation mark may occur between any pair of alphanumeric characters (i.e. one in a row) or it breaks the word (i.e. is treated like white space). In other words, the only time punctuation should appear in a word is when there is only one punctuation mark in a row *and* it is one of the valid punctuation characters.

No punctuation mark may occur at the beginning of a word. There is one punctuation mark that may occur at the end of a word: '

It would be nice to handle some special cases. For example, a word ending in - followed by white space (tab, space or carriage return) should be appended to the word following the white space with the hyphen removed. (It is actually a good test for whether your modular decomposition of the problem matchs the problem domain.) This will not however be a part of our word definition. So, for automated testing it is *important* that this feature not be enabled.

Here are some examples .

Part 3: Submit your assignment.

Remember to update your CHANGE_LOG, README, MANPAGE, TIME_LOG,BUG_LOG,Makefile, etc to reflect these newest changes. You may submit all files to /afs/clarkson.edu/class/cs450/students/YOUR_USERNAME/hw4.

These names *must* be exact!!! As we approach the automated testing phase, I reserve the right not to grade any file with the wrong name or with incorrect contents.

<