Practise the principle of MapReduce and try out an example on Azure HDInsight.
Demonstrate an understanding of the potential of MapReduce in speeding up tasks on big data sets.
Carry out the tasks described below and answer the questions in your submission.
This task should be completed in the fourth lab.
This task should take no more than 2 hours to complete.
Discuss your answers with the tutorial instructor.
Get started on module 5.
Pass Task 4 — Submission Details and Assessment Criteria
Write down the questions and answers in a text or Word document and upload to Doubtfire. Your tutor will give online feedback and discuss the tasks with you in the lab when they are complete.
Run the wordcount MapReduce code already on Azure to count the words of a file you choose and upload with the following steps:
As you know, in Hadoop tasks run on a cluster of nodes. First, you have to create the cluster.
Assuming you are already logged in, go to your dashboard. In the search field top centre of the page, type HDInsight. Choose HDInsight clusters from the options.
Click on ‘+Create’.
Select correct subscription (containing COS80023) and your resource group.
Set the cluster name to s<yourstudentnumber>cluster (no upper case letters allowed),
Choose Australia East as location.
Choose Hadoop 3.1 as cluster type. Leave the default cluster username and ssh user. Choose a password with upper case, lower case, numbers and a special character.
Question 1: Do you think the choice of location matters? Why/why not?
Click Next to proceed to Storage. Select Azure Storage. Click Create new. Name your new storage <yourstudentnumber>storage.
For the container choose <yourstudentnumber>container. Leave the other options as default.
Click Next to proceed to Security and Networking. Do not change the default options.
Click Next to proceed to Configuration+pricing. Examine the default resources for the cluster. There are head nodes, Zookeeper nodes and worker nodes.
You can not change the number of nodes for first 2 options but you must change number of nodes to 2 for worker node.
For the nodes choose the following options
Observe the information about available cores in Australia East.
Question 2: How many cores are available in total in this area? Did you expect more/less?
Do not make changes in Script actions section.
Click Review+create. On the summary page, you get to create the cluster. It typically takes a few minutes for the cluster to be up and running.
To find out about the progress (and possible errors), click on notifications on the top right (bell-shaped icon).
To analyse a file using MapReduce, you have to put the file where MapReduce can find it. There are two options, Data Lakes and Azure Storage. We will use Azure Storage that we have created beforehand.
Go to the storage account when it has been created. Click on Storage browser (preview).
Click on Blob Container. You should see container you created earlier. Click on it. Click upload and find the file you want to use to count the words of on your file system. This is what the dashboard should look like:
When the deployment has completed in progress, open a command window and type (or copy) the command for an ssh connection:
ssh sshuser@<yourstudentnumber>cluster-ssh.azurehdinsight.net Type the password. If you type it correctly, you will see:
This is to tell you that your computer has never had any dealings with this host and does not recognise its signature. You can safely say yes.
Invoke the wordcount example already on HDInsight:
yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce- examples.jar wordcount
Question 3: What does the interface tell you? How do you think you can fix this?
The file you are using should be here:
You can use this as an output directory:
Question 4: What does the wasb prefix mean, and how does it relate to HDFS?
If the wordcount example runs successfully, it creates a file called part-r-00000 (it would create more files with different numbers if the input file was bigger).
Show the part-r-00000 file on the command line. Use the command:
hdfs dfs -cat wasb://<directory-path>/output/part-r-00000
Take a screenshot of the command and the beginning of the file and put it into your answer file for Doubtfire. Example:
Different MapReduce code example
Find a different MapReduce example and print the code on your submission document.
Draw a diagram of the Map and Reduce (and possible Combine phases), similar to this: