Computer Science 1511
Computer Science I
Programming Assignment 6
Text Files (35 points)
Due Tuesday, November 23, 1999
Introduction
In this assignment you will read and analyze a text file.
The problem is to read a text file character by character and print out
a ``token map'' of the file, where tokens are meaningful objects from the
file like words, numbers or punctuation marks.
In the token map you will print out a message Xlength for each word
or number, where length is the number of characters in the token and
X is:
- 's' - if the token is a word containing three or fewer letters
- 'L' - if the token is a word containing four or more letters and
starts with a capital letter
- 'l' - if the token is a word containing four or more letters and
starts with a lower case letter
- 'n' - if the token is a number
- 'p' - if the token is a punctuation mark
After each number token you should print out, between parentheses, the value
of that number. Also, at the end of the program you should print out the sum
of all the number tokens in the file.
For example, suppose the text file contains the following:
Constitution of the United States of America
(In Convention, September 17, 1787)
Preamble
We the people of the United States, in order to form a more
perfect union, establish justice, insure domestic tranquility,
provide for the common defense, promote the general welfare, and
secure the blessing of liberty to ourselves and our posterity, do
ordain and establish the Constitution of the United States of
America.
Your program should produce the following response:
1: L12 s2 s3 L6 L6 s2 L7
2: p s2 L10 p L9 n2(17) p n4(1787) p
3:
4: L8
5: s2 s3 l6 s2 s3 L6 L6 p s2 l5 s2 l4 s1 l4
6: l7 l5 p l9 l7 p l6 l8 l11 p
7: l7 s3 s3 l6 l7 p l7 s3 l7 l7 p s3
8: l6 s3 l8 s2 l7 s2 l9 s3 s3 l9 p s2
9: l6 s3 l9 s3 L12 s2 s3 L6 L6 s2
10: L7 p
11:
Sum of numbers in file: 1804
Note that:
- Words consists of sequences of characters
'A' to 'Z' or 'a' - 'z'.
- Numbers are sequences of characters consisting of digit characters
'0' to '9'.
- Punctuation characters are: . , ! ? ; : ( ) " '
- White-space (spaces, newlines, and tab characters) should be ignored.
- You do not have to deal with any other characters.
- You will read the data from a test file that you create.
You should create more than one test file to test all aspects of
your code.
- You may write the ``token map'' to an output text file.
- You may (if you wish) print an extra line number for the program at
the end of the output.
How To Proceed
I suggest that you proceed in stages:
- Implement the program given in class to count the number of
characters in a line (it is in the on-line class notes) and run it to
get practice.
- Change the program to print out the line number 1 before the first
line and then print out a new line number after each newline
character is encountered.
- Change the program to print out a temporary code w every time a word
starts, an n every time a number starts, and a p every time you
encounter punctuation. (You are at the start of a word or number if
the previous character is NOT part of a word or number).
- Add code to count the length of the word or number (you will need
mechanisms to determine that a word or number continues to be read).
Print out these lengths at the end of the word/number.
- When printing out the length of words, print out both the length
and an l or s for a long or short word. (Then get rid of the temporary
w code by simply not printing it out).
- When starting a word, store the first character of that word. Use
this character to print out an L instead of an l when appropriate.
- While processing a number add code to determine the integer ``value''
of the number (see the end of the Chapter 7 notes). Print this value
after printing out the length of numbers.
- Add code to add the values of these numbers.
While you are free to design your program any way you wish, you must follow
good top-down design principles.
For example, you might write your program such that each time the
start of a word or number was read a function or functions would be
called that would read to the end of the word or number.
What To Hand In
Hand in a lab report with a copy of each of your test data files and the output
for each.
Include a second copy of each test data file.
On this copy underline each of the words and the numbers using different colors
of ink (for example, you might underline short words in red, long words
starting with capital letters in green, long words starting with lower case
letters in blue, numbers in black and punctuation in purple).
Also write a value indicating the length of the word/number over the word/number.
EXTRA CREDIT
2 extra points - make it so that one single quote character may appear
in a word (though not as the first character).
For example, don't would count as one word of length 5 rather than as
one word of length 3, a punctuation mark, and then another word of length 1.
2 extra points - allow multiple dashes (and ONLY dashes) as in --
(2 consecutive dashes) or --- (three consecutive dashes) to be treated
as a single punctuation mark.
Make it so that if the punctuation mark is not a single character, your
program will print out not only p, but the number of characters in the
punctuation, but only if the punctuation has more than 1 character.