Pass Task 3: Extraction, Transformation and Loading
Practise ETL with an example that starts from a .csv (comma-separated table) file and use a Hadoop cluster and Apache Hive to transform it by applying the relational model and storing it in a relational database with the help of Apache Sqoop.
Demonstrate an understanding how to use Hadoop tools to automate the transformation of data into the relational model.
Carry out the tasks described below and answer the questions in your submission.
This task should be completed in the fourth lab class or before and submitted to Doubtfire for feedback. It should be discussed and signed off in tutorial 3 or 4.
This task should take no more than 2 – 2 ½ hours to complete.
Discuss your answers with the tutorial instructor.
Get started on module 4.
Pass Task 3 — Submission Details and Assessment Criteria
Write down the questions and answers in a text or Word document and upload to Doubtfire. Your tutor will give online feedback and discuss the tasks with you in the lab when they are complete.
Create a Data Lake Gen2 on Azure, upload a data file in tabular format, clean it and upload it to a relational database using Hive and a Hadoop cluster on Azure.
The process has the following steps:
A managed identity is a concept that avoids having to use passwords for authentication to different services. In this exercise, we are using a managed identity to authorise Hadoop Hive to work on a Data Lake we create. Managed identities are stored in Active Directory (AD). AD is a repository of users and their passwords that lets people use the same user name and passwords for several services. We have a similar setup at Swinburne. You can access all your services at Swinburne with one SIMS password (Please do not use your SIMS password for these exercises.).
Type Managed Identity in the search field in the top middle of the portal. Click on the
entry with the same name under ‘Services’ to start creating a managed identity. Press +Create or click on Create managed identity.
Select the correct subscription (containing COS80023) and your resource group(your student number).
Select Australia East as your region.
In the resource name field, type <yourstudentnumber>managedidentity. Click Review+Create.
Observe the notifications (bell-shaped icon). Once it tells you the resource has been created, click on Refresh. You should now see your new managed identity.
Question 1: Follow this link: https://www.blue-granite.com/blog/bid/402596/top-five- differences-between-data-lakes-and-data-warehouses
Answer the questions, what is a data warehouse, and how does a data lake differ from it?
To use tools on your data, you have to store it in a data store that Hadoop can access. There are two options, Data Lake and Azure Storage. We will use Data Lake Gen2.
In your Azure Portal, go to Storage Accounts. Press +Create or click on the Create storage account in the middle.
Choose the correct subscription and resource group. Name your new storage <yourstudentnumber>datalake. Set the location to Australia East.
Leave the default for all other settings on this tab.
Move to the Advanced tab. Enable the Data Lake Storage Gen2 hierarchical namespace.
Create the storage account.
Select the Storage Blob Data Owner role.
In the box Assign role to, choose User assigned managed identity.
Click +Select members. In the Select managed identities section under Managed identity select “User-assigned managed identity”, in the select box, choose your managed identity that you created earlier and click select.
You have created an identity that has access to your data lake.
(This is repetition from last week)
Go to your Dashboard. Choose SQL Databases under Azure Services(top middle). Click
+Create above the (empty listing), or the ‘Create SQL database’ in the middle of the
Select the correct subscription (containing COS80023) and your resource group.
Name your database <yourstudentnumber>rdb.
Create a new server <yourstudentnumber>server. Choose Australia East as a location. Leave the default authentication method untouched.
Choose user<yourstudentnumber>, e.g. user12345678 as you admin user. (This cannot start with a number).
Choose a strong password that is more than 8 characters long, has special characters and upper and lower case letters. (The system will tell you immediately whether a password is acceptable).
Click Ok to create the server.
Choose this <yourstudentnumber>server for your database. Tick no to using an Elastic pool.
Leave the other options untouched.
Review the settings and click ‘Review+Create’.
When the database has been deployed, click SQL Databases in the leftmost navigation pane. Choose the <yourstudentnumber>rdb. Click on ‘Overview’ in the menu on the left. Above the entry of your server, there is an option Set server firewall. Click it.
Click “Selected networks” option.
Under ‘Rule’ field click Add your client IPv4 address option and save.
Alternatively, you can click +Add a firewall rule and then in the ‘Rule name’ field, type
sp_set_firewall_rule. Type your IP address into both the Start IP and End IP fields.
Important: Tick Allow Azure services and resources to access this server. (This is different from last lab).
Now you can create a table (as you did last week, by logging into the query editor). The table structure is this:
CREATE TABLE [dbo].[accidents]( [day_week_description] [nvarchar](50) NOT NULL, [no_of_vehicles] float,
CONSTRAINT [PK_accidents] PRIMARY KEY CLUSTERED
This is where our ETL result eventually goes.
In the search field top centre of the page, type HDInsight. Choose HDInsight clusters from the options.
Click on ‘Add’.
In the ‘Basic’ tab, set the cluster name to s<yourstudentnumber>cluster (no upper case letters allowed), e.g. s12345678cluster.
Choose the subscription and resource group selected in previous tasks earlier. Choose Australia East as location.
Choose Hadoop 2.7.3 as cluster type. Leave the default cluster username and ssh user. Choose a password with upper case, lower case, numbers and a special character.
Leave the admin users for cluster login and ssh as is. Provide a cluster password (you can use the same password throughout this exercise).
Click Next to proceed to Storage. Choose Azure Data Lake Storage Gen2. Choose the
<yourstudentnumber>datalake you created before.
Change the file system name to <yourstudentnumber>filesystem. Select the identity your created earlier.
Security and Networking
In the User-assigned managed identity field, choose the identity you created earlier. If not visible then type it yourself.
Leave all other options as they are. Click next.
Leave all defaults and click Next.
Choose the smallest possible options. No need to add script action.
On the summary page, you get to create the cluster after the settings have been validated. It typically takes a few minutes for the cluster to be up and running.
Examine the staging script
The resource file on Doubtfire is reporting.zip. Unzip it on your harddrive. It contains reporting.csv and staging.hql.
First, open reporting.csv and find out what data we are working with. What information does the file contain?
The file staging.hql is written in HiveQL, but if you can read SQL, you can guess what the file is doing. We will be studying the theory of Hive a little later. For now, just open the script in a text editor and examine it, to answer the following question:
Question 2: What is the script going to do if you run it?
Run the staging script
In the script staging.hql, change the name of the file system to your file system on Azure, and the name of the data lake to the name of the data lake you created. Be careful to use the correct names.
After this, rezip staging.hql and reporting.csv into a file called reporting.zip. Do not include a directory.
Open a terminal (cmd on Windows) and upload the data from Doubtfire onto the cluster using scp (secure copy protocol).
You have to navigate to the directory on your own file system where you have stored the
.zip file from Doubtfire. Use the cd command to navigate.
(E.g. cd .. takes you up one level in the file structure, cd data takes you into a subdirectory called data.)
scp reporting.zip sshuser@s<yourstudentnumber>cluster- ssh.azurehdinsight.net:reporting.zip
You will be prompted for a password. Type the password you specified when you created the cluster.
Connect to the cluster using ssh (secure socket shell).
If your connect string and password were correct, you now see a command prompt:
This means you are talking to the head node (hn0) of your cluster. You can now issue commands to the cluster. First, you unzip your data:
As you are now working on a Linux system, you can check the content of the current directory by running:
You should see the three files reporting.zip, staging.hql and reporting.csv now. (If you do not, you have to troubleshoot before continuing.)
Use the following command to create the Data Lake Storage Gen2 file system.
Use the following command to create a directory.
hdfs dfs -mkdir -p abfs://<yourstudentnumber>filesystem@<yourstudentnumber>datalake.dfs.core.w indows.net/accidents/data
Change the command to create a second directory called script, which also is a subdirectory of accidents.
Use the following command to copy the reporting.csv file to the directory data: hdfs dfs -put "reporting.csv"
Check if the file is really there:
hdfs dfs –ls abfs://<yourstudentnumber>filesystem@<yourstudentnumber>datalake.dfs.core.w indows.net/accidents/data
Change the hdfs dfs –put command to put the staging.hql file into the script
To run the staging.hql script, we have to call Hive:
beeline -u 'jdbc:hive2://localhost:10001/;transportMode=http' -f staging.hql
You should have an idea what the script is doing at this stage. If you don’t, discuss with
a mate or your tutor.
After the staging.hql script has finished running, use the following command to open an interactive Hive session:
beeline -u 'jdbc:hive2://localhost:10001/;transportMode=http'
You should see the following prompt:
Now you can extract data using:
INSERT OVERWRITE DIRECTORY '/accidents/output' ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'
SELECT regexp_replace(day_week_description, '''', ''),
sum(no_of_vehicles) FROM accidents_in_hive
WHERE no_of_vehicles IS NOT NULL GROUP BY day_week_description;
Question 3: What is the ‘accidents_in_hive’ object that comes after the ‘FROM’
keyword? What information, do you think, the query is extracting?
Of course this is no fun if you can’t check the outcome of the HiveQL. Use the
hdfs dfs –ls
command you used before to inspect the content of the output directory. You should find a file there. Use this command:
hdfs dfs –cat abfs://
to display the content of the file. Is this what you expected to see when you examined the HiveQL you ran interactively?
In this final step, you load the data you just extracted into a Hive file into the SQL Server database you created earlier. We are going to use sqoop for this.
First, check if sqoop can see the database and table you created earlier:
sqoop list-databases --connect jdbc:sqlserver://<yourstudentnumber>server.database.windows.net:1433 -- username user<yourstudentnumber> --password <yourstrongpassword>
You should see the rdb database you created earlier. If you do, you are good to load the data:
sqoop export --connect 'jdbc:sqlserver://<yourstudentnumber>server.database.windows.net:1433;datab ase=<yourstudentnumber>rdb' --username user<yourstudentnumber> --password
<yourstrongpassword> --table 'accidents' --export-dir 'abfs://<yourstudentnumber>filesystem@<yourstudentnumber>datalake.dfs.core. windows.net/accidents/output' --fields-terminated-by 't' -m 1
Now you should be able to go back to your Azure Portal and find the SQL database created at the start. Go back to the Query Editor you used before when you created the accidents table.
Run a select-all query:
SELECT * FROM accidents;
Question 4: On what weekday do the most accidents happen? How many vehicles are involved? Document the answer with a screenshot.
Question 5: What steps and tools were involved in this process? Draw a diagram that shows the tools and files used. You can draw in Powerpoint or by hand and take a photo to document the workflow.