This question is on frequent itemset mining. Implement Apriori Algorithm and to mine frequent itemsets. Apply it on the Dataset: http://fimi.uantwerpen.be/data/webdocs.dat.gz. You may assume all item IDs are non-negative integers.
a. Please name your bash file RollNo.sh. For example, if CS19M100 is your roll number,
your file should be named CS19M100.sh. Executing the command “./RollNo.sh retail.dat
X -apriori <filename>” should generate a file filename.txt containing the frequent
itemsets at >=X% support threshold with the apriori algorithm. Notice that X is in
percentage and not the absolute count. Your implementations must ensure that the
transactions are not loaded into main memory. This means, that it is not allowed to
parse the complete input data and save it into an array or similar data structure.
However, the frequent patterns and candidate sets can be stored in memory. (20 points)
filename.txt should strictly follow the following format.
i. Each frequent itemset must be on a new line.
ii. The items must be space separated and in ascending order of item IDs .
Your grade will be (F-score)*20
b. Evaluate the growth rate of running time of Apriori algorithm against the support
threshold. Executing the command “./RollNo.sh retail.dat -plot” should generate a plot
using matplotlib where the x axis varies the support threshold and y axis contains the
corresponding running times. It should plot the running times of Apriori algorithm at
support thresholds of 10%, 25%, 50%, 70% and 90%. Explain the results that you
observe. (20 points)
c. Efficiency Competition: We will have a competition among all submitted
implementations of the Apriori algorithm. The fastest would get full points. If X=
(running_time_of_fastest_submission/running_time_of_your_submission), then your
score would be X*(total_marks). You would be in this competition only if you get full
points in part (a), i.e., you have the correct implementation of Apriori algorithm. (20
points)
Bash scripts you need to provide:
• compile.sh that compiles your code with respect to all implementations. Specifically running
./compile.sh in your submission folder should create all the binaries that you require. Any
optimization flags like O3 for g++ should be included here itself
• RollNo.sh as specified earlier
Submission Instructions:
• Submit all your files as a zip file. The root folder should have the same name as zip file. This
folder should contain all the source files and all the bash scripts. In addition, it should all contain
a README.txt explaining all the files you bundled, and explanation report of part b.
• <Rollno>.sh is the main script that will be used in part a, b, and c.
• Since your submissions will be auto graded, it is essential to ensure your submissions conform to
format specified
Compiler Specification:
• GCC version 7.1.0
• Java version 1.8
• Python3 version 3.6.5
• Python2 version 2.7.13