cs450words [-l] [-t] [-d] [-c] [infile]
cs450words [--help]
The -t flag is optional and if present specifies that the total number of words should be output in the following format:
TOTAL: "count"
The -d flag is optional and if present specifies that the number of distinct words should be output in the following format:
DISTINCT: "count"
The -c flag is optional and if present specifies that the most common word and its count should be output in the following format:
MOST COMMON WORD: "word" "count"
If more there is a tie for most common word, you should output a MOST COMMON WORD line for each one.
The -l flag is optional and if present specifies that a list of distinct words with their frequencies should be output in the following format:
"word" "count"
If no flags are specified, then flags -l, -t, -d and -c should be invoked and output will occur in the order: word list, total count, distinct count and most common word. For example:
bar 1
baz 1
foo 2
TOTAL: 5
DISTINCT: 3
MOST COMMON WORD: foo 2
In all cases, output will occur in this order. However, only a subset of it may be output depending on the flags specified by the user. For example, if the user specifies "-t -d", then only the total count and distinct count will be given (in that order).
There should be no extraneous blank lines in the output - either before, between or after the specified output lines.
If the --help option is present, the program should print out a helpful message describing this interface and then exit without any further processing.
The name of the input file may be specified on the command line.
If not, the input should be taken from STDIN (this is very
important because our automated testing script will work
this way). As in for example:
cat inputFile | cs450words -t > outputFile
Output should be sent to STDOUT. Error messages should be sent to STDERR.
It it helps you may assume that the flags when present will appear in the order given above (i.e. -l comes before -t which comes before -d which comes before -c).
Your code can accept other arguements as desired, but these 4 must be supported.
You should call your executable cs450words.
IMPORTANT NOTE: This is the only input you can get from the user. You may not query the user for additional information.
You may find the getopt or getopt_long functions helpful. Here are some references http://www.tac.eu.org/cgi-bin/man-cgi?getopt+3 and http://www.cs.utah.edu/dept/old/texinfo/glibc-manual-0.02/library_22.html .
In addition to standardizing the command line interface and
the output format, we will standardize on the following
defintion of a word:
A word is a string of alpha-numeric characters. All letters
in the input should be converted to lowercase. In addition,
some word internal punctuation is allowed. Specifically,
the following six characters may occur in the middle of a word:
/_.-'@.
However, only one punctuation mark may occur between
any pair of alphanumeric characters (i.e. one in a row)
or it breaks the word (i.e. is treated like white space).
In other words, the only time punctuation should appear
in a word is when there is only one punctuation mark
in a row *and* it is one of the valid punctuation characters.
No punctuation mark may occur at the beginning of a word. There is one punctuation mark that may occur at the end of a word: '
It would be nice to handle some special cases. For example, a word ending in - followed by white space (tab, space or carriage return) should be appended to the word following the white space with the hyphen removed. (It is actually a good test for whether your modular decomposition of the problem matchs the problem domain.) This will not however be a part of our word definition. So, for automated testing it is *important* that this feature not be enabled.
Here are some examples .
These names *must* be exact!!! As we approach the automated testing phase, I reserve the right not to grade any file with the wrong name or with incorrect contents.