## Description

# Introduction

In Part 1, the project explores the performance of different hash functions with two types of open addressing for collision management. You will explore the impact of the hash functions and the load factor on the number of collisions.

Part 2 introduces you to a new data structure, Bloom Filters. You are asked to implement the basic operations of a Bloom filter and to experimentally analyze the impact of the size of the Bloom filter and the number of hash functions used on the false positive rate of lookup operations. Both parts of the project ask you to prepare a report on your experimental results.

# Learning Objectives

- Explain practical differences between the hash functions and the collision management strategies.
- Identify the advantages and disadvantages of Bloom filters.
- Demonstrate how the different parameters used by Bloom filters impact performance and false positive rate.
- Give examples of applications that should use/should not use Bloom filters.
- Visualize and explain results generated by experiments in a clear and meaningful way.

# Part 1: HashTable

In Part 1, you will implement a HashTable data structure that holds a collection of String objects.

The table may contain up to *n *elements, none of which may be duplicates, and it must support insertion and lookup operations. The HashTable will use different hash functions, and your implementation and experimental work will observe and report differences between them.

Two open addressing approaches are used to resolve collisions in an insertion: linear probing and double hashing. In both approaches_{, }the index into the table is incremented by step size *delta *(see class slides for more details).

- In linear probing, the step size
*delta*is set to 1, and an insertion probes consecutive locations in the array to find a free location. - In double hashing, the step size
*delta=h*_{2}(*X*) is determined by a secondary hash function*h*_{2}. When inserting element*X*, the locations probed are (*h*_{1}(*X*) +*j h*_{2}(*X*)) mod*m*, j=0,1,2,…, where*h*_{1 }and*h*_{2 }are the primary and secondary hash functions, and*m*is the size of the table.

Note that no deletions have to be handled. Both collision management strategies need to ensure that probing does not turn into an infinite loop as the table fills up. No resizing of the hash table is done.

## 1.1 Hash Functions

Provided with the template/starter code is the Hasher class, which implements the hash functions, listed below. ** Do not modify this class**.

Each hash function is a static method that takes a String as input and returns a signed 32-bit integer hash value. Note that the string is converted to an integer inside the hash function.

public static int crc32(String str)

Returns the cyclic redundancy check (CRC) of the input string, using the Guava library implementation. The CRC is a checksum typically used to verify data integrity, but here we use it as a hash function.

public static int adler32(String str)

Returns the Adler-32__ __checksum__ __of the input string, using the Guava library implementation.

public static int murmur3_32(String str, int seedIdx)

Returns the 32-bit MurmurHash3 of the input string, using the Guava library implementation. MurmurHash3 requires a random seed as input. The Hasher class provides a static array of seeds to use, and the seedIdx argument indexes into this array.

public static int polynomial(String str, int primeIdx)

Computes a polynomial rolling hash of the input string. It uses polynomial accumulation to generate an integer and then uses the division method to generate the hash value.

Specifically, the function computes:

,

where *p *is a prime number and *c*_{1},…,*c** _{k }*are the characters in the string. Similar to MurmurHash3, the Hasher class provides a static array of prime numbers to use for

*p*, and the primeIdx argument indexes into this array.

## 1.2 HashTable Implementation

The template code includes the HashTable class. Your task is to implement the class methods listed below. Internally, the HashTable maintains an array of String objects of a fixed size, which is used for storing and locating elements of the table.

public HashTable(int capacity, String hashType)

HashTable constructor. Allocate the String array with the given capacity (i.e., size of the hashtable). The hashType argument determines which hash functions from the Hasher class to use, as well as the collision resolution strategy.

Possible values for this argument are “crc32-lp”, “crc32-dh”, “adler32-lp”,

“adler32-dh”, “murmur3-lp”, “murmur3-dh”, “poly-lp”, and “poly-dh”. The “-lp” suffixes indicate linear probing, whereas the “-dh” suffixes indicate double hashing.

Note: Only valid arguments will be passed to test your code.

public int size()

Return the number of elements currently inserted into the table.

public int capacity()

Return the maximum number of elements that can be stored in the table.

public float loadFactor()

Return the ratio of inserted elements to table capacity. Expected values between 0 and

1.

private int hash1(String str) & private int hash2(String str)

Compute the primary and secondary hash indices of the input string using the hash functions specified by the constructor’s hashType argument.

Refer to the table below for the functions to use. The returned index must be a valid index into the String array, that is, the possible values should be in the range [0, *m*-1], where *m *is the array length. The hash2() function is only needed for double hashing.

hashType | hash1(str) | hash2(str) |

“crc32-lp” | Hasher.crc32(str) | |

“crc32-dh” | Hasher.crc32(str) | Hasher.adler32(str) |

“adler32-lp” | Hasher.adler32(str) | |

“adler32-dh” | Hasher.adler32(str) | Hasher.crc32(str) |

“murmur3-lp” | Hasher.murmur3_32(str, 0) | |

“murmur3-dh” | Hasher.murmur3_32(str, 0) | Hasher.murmur3_32(str, 1) |

“poly-lp” | Hasher.polynomial(str, 0) | |

“poly-dh” | Hasher.polynomial(str, 0) | Hasher.polynomial(str, 1) |

Java does not have an unsigned integer type. Thus all of the hash functions in Hasher return a signed 32-bit integer. In Java, the modulus operator % returns a negative value when the dividend is negative. Instead, use the Integer.remainderUnsigned() method provided by Java, which interprets a signed integer as though it were unsigned.

public boolean insert(String str)

Attempt to insert the given string into the table, and return whether the insertion was successful. Unsuccessful insertions occur if the element is already inserted, if the table is full, or if the collision resolution runs into an infinite loop. It is your responsibility to detect infinite loops if and when they occur. You must also keep track of the number of collisions encountered during the insertion, to be returned by numCollisions().

True is returned for a successful insertion and false is returned for an unsuccessful insert.

public int numCollisions()

Return the number of collisions encountered during the last call to insert(), even if the insertion failed.

public boolean contains(String str)

Return whether the given string is already contained in the table.

# Testing your code

We provide a set of *white box unit tests *for the HashTable class. We strongly recommend that you proceed to Section 1.3 only after passing all of these tests. The unit tests verify both the internal state of the hash table and the return values of its methods as insertions are performed. Due to the white-box nature of the tests, your HashTable implementation must conform exactly to the specification described in this handout. In particular, your collision resolution logic must correspond to the exact open addressing schemes described on Page 2.

The project root should contain the following provided files: An input directory, an expected directory, a HashTableTest.java class containing JUnit methods and nested classes, and a Utils.java utility class. Obtain the input and expected directories by downloading and decompressing the local_data.zip file from BrightSpace. Before running the tests, set Utils.ROOT to the absolute path of the project root on your computer. Below, we describe the contents of these files:

- The input directory contains
*input*text files. Each file contains a list of six-letter lowercase strings, one string per line. - The expected directory contains pairs of
*output*and*state*text files. Each line of an output file contains three or four space-delimited values. Each line of a state file contains a sequence of space-delimited six-character strings. - The HashTableTest class contains the test cases as JUnit methods. The test suite can be run using IntelliJ’s JUnit runner. We set reasonable timeouts for each test case; if your code takes longer to execute, you likely have an infinite loop. Below, we will further describe the nested classes InsertTest and ContainsTest, which respectively implement
*insertion tests*and*containment tests*. - The Utils class contains methods which compare two given values and throw an

AssertionError if the values differ.

Each test case contains a collection of *tests *that check one aspect of the HashTable implementation, described in the test case’s DisplayName. Associated with a test are four properties: inName, capacity, hashType, and outName. When each test starts running, it prints a message listing its properties. The test loads the input file ${inName}.txt, instantiates a hash table with the associated capacity and hashType, loads the output file

${outName}_output.txt and the state file ${outName}_state.txt, inserts the input strings into the hash table one by one, and checks the hash table against the output and state files after each insertion.

- The value returned by the𝑘 insert() call matches the𝑘 first value in the𝑘 𝑘th output line; An
*insertion test*will, after the th insertion, check that - size() returns the second value in the th output line;
- loadFactor() returns a value within 1e-4 of the third value in the th output line;
- numCollisions() returns the fourth value in the th output line
^{[1]}; and - The table array matches the 𝑘th line of the state𝑘 file, where null values in the array correspond to horizontal lines in the state file.
- Calling contains() with each string in the hash table𝑘 returns true; and A
*containment test*will, after the th insertion, check that - Calling contains() with some strings (listed in inputs/small_excluded.txt) not present in the hash table returns false.

**Note: We use JUNIT test cases for this project. For this you have to add JUNIT 5.2 to your classpath as well. You can do this from the project structure panel. ( File > Project Structure > Modules. (Shortcut – cmd + ;)). We trust that you will be able to resolve this on your own.**

## 1.3 Comparing Experimental Performance

In this section we will be evaluating/measuring the performance of different hash table implementations (hash function – collision management pairs).

You are provided two input files – randstrs.txt, words_ordered.txt. These are on Brightspace as they were too large for Vocareum to handle. These files have strings which you will use to evaluate your hash table implementations and then analyze your findings to answer the questions asked in the report section. For the report, remember to draw graphs and analyze the results.

Modify your main()method in HashTable.java to generate the data for the experimental analysis of every pair of hash function and collision management method.

For each data file record the number of collisions for every pair of hash function-collision management method.

- You will insert strings (one by one) into a hash table of size m until the load factor is 0.9.
- To record the number of collisions, partition the load factors from 0 to 0.9 into 90 bins(intervals) and sum up collisions per bin. (more in the example below).
- Note that double hashing needs to handle
**not**finding a free location even when there is one.

An example with CRC32 hash function with linear probing (hashType=”crc32-lp”) has been provided below.

- Given an input file and capacity
*m*, first, create a HashTable of size*m*with hashType=”crc32-lp”. - Next, read and insert strings from the input file into the HashTable while the load factor is < 0.9. To analyze the relationship between load factor and collisions, proceed as follows:

○ Divide the range of load factors into 90 bins and accumulate the total number of collisions in each bin.

■ For example, bin 0 covers the load factor range [0, 0.01).

■ All collisions for load factors in the range [0, 0.01) are accumulated in bin

0.

■ Be sure to include collisions for insertions that fail as well.

○ For each insertion, record the load factor prior to the insertion to find the bin. Add the number of collisions for the insertion to the corresponding bin.

Repeat this process for each hash function type and collision management method. This will result in 4(hash functions) x 2(collision management methods) = 8 sets of 90(bins) values.

Print these values as a table in Comma-Separated Value (CSV) format with 9 columns and 91 rows.

- The first line should print the hashType parameter for each column.
- The first column should print the load factor bin’s upper-bound, printed to 2 decimals, as shown in the example below (the X’s in the last two rows are placeholder values):

bin,crc32-lp,crc32-dh,adler32-lp,adler32-dh,murmur3-lp,murmur3-dh,poly-lp,poly-dh, 0.01,1,1,1,1,1,1,2,2,

0.02,3,3,5,5,6,6,7,6,

0.03,7,6,8,10,11,11,11,9, .

.

.

0.89,X,X,X,X,X,X,X,X,

0.90,X,X,X,X,X,X,X,X,

Note: There should be no new line character and the end of the last line. Do not change the order of the hashTypes from above. The order is crc32-lp, crc32-dh, adler32-lp, adler32-dh, murmur3-lp, murmur3-dh, poly-lp, poly-dh.

The first line should begin with “bin” to indicate the first column is the bin, followed by 8 strings indicating the 8 pairs of hash function and collision management method. The second line should start with 0.01 for the first bin [0, 0.01), followed by a list of 8 numbers representing the numbers of collisions corresponding to 8 pairs of hash function and collision management method for the bin [0, 0.01). The next 89 lines represent the other bins, and collisions with respect to the bin for the corresponding hash function – collision management methods pair.

You will run your program with two command line arguments: the input filename and capacity *m*, in that order.

Perform the above procedure and print the results to the standard output stream System.out.

Save the output to a .csv file by using the output redirection operator >:

$ java HashTable input_file.txt 30000 > output_file.csv

The above command line means after you compile HashTable.java, you need to specify two arguments to run the HashTable. The first argument is the input_file.txt, which contains the

String objects. The second argument is a positive integer, indicating the table capacity.

Do not change the order of the hash functions in the csv file. Provided with the template code are two input files:

- txt, containing 1,000,000 unique strings of 32 random alphanumeric characters.
- txt containing 370,107 unique English words in sorted order..

Run your program on each of these two input files twice, with *m *chosen as 30,000 and 300,000, and save each of the resulting outputs in a separate .csv file. You should end up with 4 .csv files in total. These four files do not need to be submitted with your code.

# Report

Using spreadsheet software or another external library (e.g., matplotlib), create 4 different line charts (one from each of the .csv files you generated). These charts should be included into the report along with a discussion answering the question below. Each chart should ● plot all 8 hashTypes as separate lines.

- color the lines according to the hash function, and make linear probing solid lines and double hashing dashed lines. For example, “crc32-lp” and “crc32-dh” should be the same color, but the second one should be dashed.
- have the x-axis be the load factor and the y-axis be the number of collisions in logarithmic scale.
- include titles, axis labels, and at least one legend. ● be readable and clear.

Your report should include all relevant plots. Include your responses to the following questions in a clear and precise way.

- What do the experimental results tell you about the different hash functions?
- Do the experimental results provide insight into the performance of the two collision handling strategies? Explain and justify any conclusions made.
- What are your recommendations on the maximum load factor? Justify your recommendations.
- Clearly explain how your code avoids an infinite loop when using double hashing.
- Do infinite loops appear on your graphs? Explain.
- Are your recommendations on the load factor the same for all hash functions and both collision strategies? Explain.

Explain your reasoning clearly and in a well-articulated way. Upload your report_p3_part1.pdf file to Gradescope before the deadline for Part 1.

# Code Requirements for Experimental Work

The Hasher class relies on the Guava library, included with the template code. To compile it, you will need to specify the path to the .jar file as part of the classpath:

$ javac -classpath .:guava-31.1-jre.jar Hasher.java

Alternatively, we recommend you set the CLASSPATH environment variable to simplify the command:

$ export CLASSPATH=.:guava-31.1-jre.jar

$ javac Hasher.java

To do this on IntelliJ go to the File > Project Structure > Modules. (Shortcut – cmd + 😉 Under the export click the + sign and add guava-31.1-jre.jar then click apply.

Compile the remaining classes similarly:

$ javac HashTable.java

To run the experiments in section 1.3, run the HashTable program with two arguments: the input filename and table capacity *m*. Redirect the output to a csv file as shown below:

$ java HashTable input_file.txt 30000 > output_file.csv

The above examples assume the environment variable CLASSPATH has been set.

# Structure of the Report

**Overall report guidelines and expectations**

- The questions the report for part 1 should address are stated in section 1.3.
- The report needs to be typed (use LaTeX or Word).
- Your name (Last, First), your Purdue email, and your section number ([email protected]:30 or [email protected]:30) need to be on top of page 1.
- The RC statement needs to be on page 1 under your name.
- All illustrations and figures need to be clearly readable. They cannot be handwritten or drawn, for example, on an iPad. Handwritten tables are not acceptable.
- Reports are uploaded to Gradescope and code is uploaded to Vocareum.
- The suggested length of the report is up to 6 pages, with font size 12 and 1.5 spacing. If you have additional supporting tables exceeding the page limit, you can include them as an Appendix (which has no page limit). However, only your report will be graded. The appendix may be consulted by the grader for additional information.