Problem statement:
You are given 2 CSV data sets:
(a) A course dataset containing details of courses offered
(b) A job description dataset containing a list of job descriptions
(Note: Each field of a job description record is demarcated by " ")
You have to design and implement a distributed recommendation system using the data sets, which will recommend the best courses for up-skilling based on a given job description. You can use the data set to train the system and pick some job descriptions not in the training set to test. It is left up to you how you pick necessary features and build the training that creates matching courses for job profiles.
These are the suggested steps you should follow :
Step 1: Setup a Hadoop cluster where the data sets should be stored on the set of Hadoop data nodes.
Step 2: Implement a content based recommendation system using MapReduce, i.e. given a job description you should be able to suggest a set of applicable courses.
Step 3: Execute the training step of your MapReduce program using the data set stored in the cluster. You can use a subset of the data depending on the system capacity of your Hadoop cluster. You have to use an appropriate subset of features in the data set for effective training.
Step 4: Test your recommendation system using a set of requests that execute in a distributed fashion on the cluster. You can pick a set of 3-5 job descriptions in the data set to show how they are executed in parallel to provide corresponding course recommendations.
Output:
1. Document the design of your logic including training, query and feature engineering. Submit a word file with details.
2. Distributed setup of Hadoop with 2-3 data nodes and store the data sets. Include your Hadoop environment and data layout details in the word document mentioned in (1) above. You can include some screenshots if needed.
3. Code files with comments for your MapReduce implementation of training and query steps.
4. Demonstrate training step and parallel execution of multiple queries to generate recommendations. Submit a short video of the execution steps. Do a screen recording that is clear and shows working code of training and then 3-5 parallel queries.
You will not be evaluated for accuracy of your recommendation engine but mainly on the data engineering and system implementation aspects.
Output instructions:
1. Output should be a zip containing PDF document, working code files and the demo video.
2. Do not attach data files even though you may have used subsets of the data given system limitations.