Assignment 03

DUE: Tuesday, April 30, 2024 by 11:59 p.m.
(willing to accept this without penalty by May 8 at 11:59)

CS209 - Data Structures and Mathematical Foundations
Skidmore College
Instructor: Michael Eckmann

=============================================================================

0. save paper.txt, words.txt, mostCommon5001.txt and config.txt to a local directory and make sure your python
   code files are in the same directory for this assignment.

1. You are to write a program to create a spell checker.  You must be able 
   to store and search words in near constant time utilizing hash tables.  You 
   must write your own code for the hash table (you are encouraged to start from
   the hash table code we wrote in class); you mustn't use python's dictionary built-in types.
   You may start from the htqp.py file containing the class hashtable_qp code posted on April 15
   on class notes page.
   
   You must implement your hash table class with double hashing collision strategy.
   
   You may assume that table sizes in the config file are the larger of twin primes.
   Notice these two pairs of twin primes 600071, 600073 and 
                                         10037,  10039
          
   Your program must read a configuration file named config.txt that contains 
   five lines in order:

1. Hash table size 1
2. Input text file name 1
3. Hash table size 2
4. Input text file name 2
5. Check text file name

Example configuration file:

TABLESIZE1 600073
ALLWORDS words.txt
TABLESIZE2 10039
COMMONWORDS mostCommon5001.txt
CHECKFILE paper.txt

Note: you can expect these five lines to be in that order, therefore you can just 
ignore the first word on each line.

--------------------------------------------------------------------------------------

For this assignment you may not use the hash functions we wrote in class exactly as they are.
You must write 2 hash functions different from the ones we wrote in class.  They should function well enough to cause not too many collisions.  Read on for suggestions.

Note well:  Since you will want the words spread out well in the hash table and the large hash table will be of size > 600000 you'll want the words to hash to a large variety of values 
0 .. 600072.  If you used the first hash function that we wrote in class that simply adds the
ord of each character together (and then mods by the table size of course), you can expect
to get many of your words hashing to a small range of values (that are all low).  
Notice that even long words will have quite small hash values: 'ambidextrous' would hash 
to 1303.  The bulk of your hash values with be less than 1000.  That would cause tons of collisions and hence poor performance.

Similarly the one that we wrote in class that adds in a division and even the one that multiplies
the sum by the last character will aslo generate numbers in too small a range and cause poor 
performance.  The only one that we wrote that generates large numbers (before modding by the
table size) is the one that concatentes the numbers into a string and then takes the int version
of it.

Double hashing requires two hash functions.  For the first hash function, a suggestion is to come up with a way to generate large values (even values larger than 600000 because when modded by table size it will be a valid index) from words.  Not only should you make these numbers large, they should be fairly unique.  Those two features will hopefully cause you to have few collisions.
The second hash function should generate a large variety of values as well, but it is only used if there is a collision with the first hash function.

--------------------------------------------------------------------------------------

Your program should create 2 hash tables.  One hash table of size TABLESIZE1 containing all the words in the ALLWORDS file and the other hash table of size TABLESIZE2 containing all the words in the COMMONWORDS file.  You are then to read the CHECKFILE file one line at a time.  On each line may be multiple words and you should look up each word in the ALLWORDS hash table.  If all of the words in that line exist in the ALLWORDS hash table, then show nothing to the user for that line, and continue to the next line of CHECKFILE.  If one or more of the words in the line are not found in the ALLWORDS file, then those words are considered to be misspelled.  Show the line number, the original line and possible correct word spellings (see section titled HOW TO FIND POTENTIAL CORRECT SPELLINGS) for each of these words in turn as close as you can to the example below.  Then continue on to the next line of CHECKFILE.

--------------------------------------------------------------------------------------

Example output of a line with 2 words incorrectly spelled:

Line number 2013
the time for all Goood men to come to the akd of 
Goood is spelled incorrectly, possible correct word spellings are:
good
good
good
good

Line number 2013
the time for all Goood men to come to the akd of 
akd is spelled incorrectly, possible correct word spellings are:
ad
add
aid
and

--------------------------------------------------------------------------------------

HOW TO FIND POTENTIAL CORRECT SPELLINGS

Given an incorrectly spelled word, make it all lowercase by using something like:
word = word.lower() 
and then generate possible correct word spellings for the word by doing all of the following.  Utilize the code you wrote for lab 11 for this assignment.

1. create all words that are one letter shorter than the original word by removing each 
   letter
2. create all words that are one letter longer than the original word by adding one each, 
   'a' through 'z', before each letter in the word (and after the last letter in the word)
3. create all words that have the same length as the original word but have one consecutive 
   pair of letters swapped
4. create all words that have the same length as the original word but have one letter 
   replaced by a through z
5. create all pairs of words from the word by separating the original word into two in all 
   possible locations (that is, check if the user mistakenly didn't type a space) --- but IMPORTANTLY only display if BOTH of the pair of words are found as actual words in COMMONWORDS file

Look each of these up in the COMMONWORDS hash table and if they exist, display them.  The reason you are to look these up in the COMMONWORDS hash table instead of the ALLWORDS hash table, is because the ALLWORDS hash table contains so many uncommon words that the procedure above would generate many words that the user probably will not be interested in seeing.

However, if NONE of your manipulations exist in the COMMONWORDS file, then check them in the ALLWORDS file and suggest those.

Hints

1. The config file must be read line by line and you can split each line with a 
space " " as the delimiter.  The split method already works this way.
e.g. 
line = "TABLESIZE1 600073"
list_of_items = line.split()
# list_of_items would be: ['TABLESIZE1', '600073']

2. The two word list files must be read line by line.  Each line is a word.  You should use strip() to remove the '\n' character at the end of each line.

e.g.
#if word = 'hello\n'
word = word.strip()
# now word contains 'hello' without the \n character

3. I suggest reading the CHECKFILE file which will be spellchecked, line by line and 
for each line split it by using something like the following:

#if the variable containing the line is named line then the 
#following code will split it into separate words

list_of_words = re.split('\s|\.|,|\-|\?|!|;|\$|\*|\(|\)', line)

Note: you will need to loop through the list_of_words list returned by split in each 
line and some of these might be empty string which you are to ignore.

4. The words in the COMMONWORDS file contain only lowercase words.  Therefore, for words 
   that you determine are misspelled, you should use lower() on them BEFORE generating 
   the new possible spellings from them.  

   Suppose you have a string variable named word which contains your misspelled word (that 
   is you already searched for it in the ALLWORDS hash table and it wasn't there), you
   should then do:

   word = word.lower()
   
   # then pass this word into the functions that add a letter, remove a letter, 
   # swap consecutive letters, etc. so that you are only using words with lowercase letters

5. The words in the ALLWORDS file (the one containing 479623 words) contain words that are 
   capitalized and even words that contain apostrophes, hyphens, etc. so you should be   
   searching the ALLWORDS hash table for the word as is (not converted to lowercase first) and if found,
   consider it spelled correctly.  However if it is not found, then convert it to lowercase and search
   ALLWORDS again.  If it is found this time, consider it spelled correctly.  If it not found
   either as is or lowercased then consider it not spelled correctly.  The reason for this is words at the 
   beginning of sentences are capitalized and they may be spelled correctly but do not appear in ALLWORDS file
   capitalized --- they may only appear in the ALLWORDS file all lowercase.
   

--------------------------------------------------------------------------------------

Extra Credit:

Don't generate duplicate words for the suggested spellings list (or remove duplicates from being displayed from the suggested spellings list before output.)  You could consider using python set data type.

Try out at least three different hash functions for hash function 1 (all that should work fairly well - that is generates a variety of values in the range up to 600000).  Count up the collisions for each when building the all words hash table.  Report on the number of collisions for each.
You then should consider three different hash functions for hash function 2 and use them with the best of the hash function 1.  Then report on the number of collisions caused by each of the hash function 2's.

=======================================================================

Submit .py file(s) for ALL the code you wrote for this assignment (and lab 11) on the Spring
I should be able to save your .py file(s) in a folder on my computer and as long as my folder already contains the config.txt, and the 3 text files specified in config.txt I should be able to run your program without any edits.

Note: I should be able to run your program using a different config file (e.g. one that uses
a different size for the common words hash table and a different file name for the common words file.) and it should work.