Learn how to use Pig for information extraction on Azure HDInsight.
Demonstrate competency at extracting information from flat files using Pig.
Carry out the tasks described below and answer the questions in your submission.
This task should be completed in the fifth lab class or the week after and submitted to Doubtfire for feedback. It should be discussed and signed off in tutorial 6 or 7.
This task should take no more than 2 – 2 ½ hours to complete.
Discuss your answers with the tutorial instructor.
Get started on module 6.
Pass Task 5 — Submission Details and Assessment Criteria
Write down the questions and answers in a text or Word document and upload to Doubtfire. Your tutor will give online feedback and discuss the tasks with you in the lab when they are complete.
Run Pig Latin commands on an Azure shell. Pig is a part of the Hadoop distribution and you have access to it when you create a Hadoop cluster using HDInsight.
As you know, in Hadoop tasks run on a cluster of nodes. First, you have to create the cluster.
Assuming you are already logged in, go to your dashboard. In the search field top center of the page, type HDInsight. Choose HDInsight clusters from the options.
Click on ‘+Create’.
Select correct subscription (containing COS80023) and your resource group.
Set the cluster name to s<yourstudentnumber>cluster (no upper case letters allowed),
Choose Australia East as location.
Choose Hadoop 2.7.3 as cluster type. Leave the default cluster username and ssh user. Choose a password with upper case, lower case, numbers and a special character.
Make a note of the password, you will need it later.
Click Next to proceed to Storage. Select Azure Storage. Click Create new. Name your new storage <yourstudentnumber>storage.
For the container choose <yourstudentnumber>container. Leave the other options as default.
Click Next to proceed to Security and Networking. Do not change the default options.
Click Next to proceed to Configuration+pricing. Examine the default resources for the cluster. There are head nodes, Zookeeper nodes and worker nodes.
Choose the following options.
Notice that this time we are using only 1 worker node.
Click Review+create. On the summary page, you get to create the cluster. It typically takes a few minutes for the cluster to be up and running.
To find out about the progress (and possible errors), click on notifications on the top right (bell-shaped icon).
In your Dashboard, go to Storage Accounts.
You should see the storage account that you have created. If you can’t then try selecting the correct subscription. Click on your storage account.
In the left-hand side of the page you should see “Storage Browser”, click on it. Click on your blob container to see the folders inside.
Create a new folder by clicking +Add Directory, name your new folder pigdata and press ok. This will take you inside pigdata folder. You need to upload the file SalesJan2009.csv now. You can find this file on doubtfire, under resources for this task. If you do not upload the file and go back, then creation of pigdata folder will not materialize.
Your Storage Explorer should look similar to this:
Open terminal and type the command for an ssh connection:
Alternatively, when your cluster is open in your dashboard, you can click on the item
‘SSH+Cluster login’ to be able to copy the connect string.
Once you see the command prompt for the sshuser, you can type pig and Enter. If your command prompt reads grunt>, you are connected to Pig.
First, inspect the content of the SalesJan2009.csv file and find out what it represents. Run the following line:
rawsales = LOAD '/pigdata/SalesJan2009.csv' USING PigStorage(',') AS (Transaction_date:chararray, Product:chararray, Price:int, Payment_type:chararray, Name:chararray, City:chararray, State:chararray, Country:chararray);
Question 1: What does each part of this line do?
It may help to visualise the outcome of the command by typing DUMP rawsales;
The following command strips the headers (first line) from the table:
salesdata = FILTER rawsales BY Price IS NOT NULL;
Question 2: What does each line do (explain every detail)?
salesandcountry = FOREACH salesdata GENERATE (chararray) Country AS country, (int) Price AS price;
countrygroups = GROUP salesandcountry BY country;
salespercountry = FOREACH countrygroups GENERATE group as country, SUM(salesandcountry.price) AS totalsales;
Question 3: What do these two lines do?
sortedsales = ORDER salespercountry BY totalsales desc;
STORE sortedsales INTO '/pigdata/SalesByCountry' USING PigStorage(' ');
Use the same data to extract how the sales transactions distribute over the types of credit card.
Run the code in Pig using the terminal. Store the answer in a different directory in pigdata (not SalesByCountry but some other name you choose).
Question 4: – Document your code in your answer document.
Take a screenshot of the content of the new folder you created in pigdata and add to your answer document.
The code for this is not provided. There are plenty of examples on line. Run them using your file you used for MapReduce (or make a new one if you have lost it).
Explain the steps and provide a screenshot of the content of the result (file or variable).
Observe the notifications (bell icon). When the cluster has been deleted, refresh and put a screenshot of the resources without the cluster in your answer document.