Skip to content

rhasson/reinvent2018_aim416

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

AIM416 Build an ETL pipeline to analyze customer data

Introduction

In this workshop we will use AWS Glue ETL to enrich Amazon Product Review dataset. To do that we will pass the product review text to Amazon Comprehend to detect sentiment, extract entities and key phrases which will be added to the original review dataset. We will then save the dataset to Amazon S3 in Apache Parquet format so it can be quiered with Amazon Athena and visualized with Amazon QuickSight.

NOTE: This workshop assumes you are running in us-east-1 region. If you prefer to run in another region you will need to update the accompanying scripts. Also be warned that the Amazon product review dataset is hosted in us-east-1, accessing it from another region may inccur additional data transfer costs.

Helpful links

  1. Apache Spark API documentation
  2. Amazon Comprehend Boto3 documentation

Authorization and permissions

Before we can start we need to define the appropriate policies and permissions for the differnet services to use. In this workshop I assume you are logged in as a root user or a user with enough privileges to be able to create IAM roles and assign policies. If you don't have that level of permission, either ask the owner of your account to help or create a personal account where you have more permission.

If you get stuck, there is more information here:

  1. Setting up IAM Permissions for AWS Glue
  2. Overview of Managing Access Permissions for Amazon Comprehend Resources

Create AWS Glue service role

Open the IAM console, select Roles and click on Create Role. In the list of services select Glue and click next. Add the following managed policies: AmazonS3FullAccess, AWSGlueServiceRole, ComprehendFullAccess, CloudWatchLogsReadOnlyAccess. Click next and finish creating the role. In the list of roles, select your new role. Click the Trust Relationships tab, click Edit Trust Relationship and make sure your trust policy looks like the following:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": [
          "comprehend.amazonaws.com",
          "ec2.amazonaws.com",
          "glue.amazonaws.com"
        ]
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

Getting Started

Start by downloading comprehend_api.py to your local computer. Then open the AWS Console and navigate to the S3 console. If you don't already have one, create a bucket with a unique name. S3 uses a flat namespace which means bucket names must be unique across all S3 customers. Inside the bukcet, create a subfolder and name it deps and upload comprehend_api.py into the deps folder. The comprehend_api.py file includes a few functions that allow us to take a row field from our dataset and pass it to the Comprehend API.
For this function to work as a UDF (user defined function) in Apache Spark which AWS Glue uses under the hood, we need to wrap our function in a special UDF function factory. To simplify our code for reability and extensibility we broke out these functions into their own file.

Crawling Source Dataset

The next thing we need to do is use AWS Glue crawler to discover and catalog our source dataset which will then create a schema for us in the AWS Glue Data Catalog. Open the AWS Glue console and navigate to Crawlers under the Data Catalog section on the left side of the console. Click on Add Crawler and follow the wizard accepting all of the defaults. When it asks you to specific an Include Path select specficied path in another account and enter s3://amazon-reviews-pds/parquet/ in the include path text box. Next, when asked to choose an IAM role select create an IAM role and give it a name. Note that does role will create IAM S3 policies to access only the S3 path we listed in the include path earlier. If you previously used AWS Glue and already created a generic service role, feel free to use it. Next, when asked to configure the crawler output, create a new database and give your source table a prefix such as source_ Accept the rest of the defaults and save the crawler. Now in the list of crawlers, check the checkbox next to your crawler name and click run crawler

Once the crawler is done a new table would be created for you in the database you selected. Open the Athena console and select the database you created from the dropdown on the left handside. You will then see a list of tables for which you should find the one the crawler created, remember it has a prefix of source_ Click the three dots to the right of the table name and select preview table. If everything went well you should see some data.

Creating AWS Glue ETL job

The next thing we need to do is create an AWS Glue ETL job that when executed will read the product review dataset, enrich it with Comprehend and write it back out into S3 so we can later analyze and visual the results. Open the Glue console and select Jobs from the left hand side. Click Add Job and follow along with the wizard. For the IAM Role select the role you created earlier. Select the A new script to be authored by you in the This job runs sections. Under the Advanced Properties dropdown enable Job metrics. Under the Security configuration, script libraries and job parameters dropdown, enter the full S3 path to the comprehend_api.py you uploaded previously in the Python library path textbox, i.e. s3://my-bucket/aim416/deps/comprehend_api.py Further down under Concurrent DPUs per job run change it to 5. A Data Processing Unit or DPU is a measure of resources that we can assign to our job. A single DPU equates to 4 CPU cores and 16GB of memory. For this excercise and to reduce cost, 5 DPU is sufficient. Continue on with the wizard accepting the defaults and finally save the job.

Editing the ETL script

Now that your job has been created, you should have the ETL script editor open in the console. Select all of the text in the current script and delete it. Come back to this repo, open workshop_job.py and copy its contents and paste them into the AWS Glue ETL script editor. Click save. Review the script and note the TODO comments and make the appropriate changes to the script.

As you will see, the script does the following:

  1. Read the raw dataset as defined by the AWS Glue Data Catalog
  2. Filters the dataset to only return rows for the US marketplace (so we have English only text), where the review is long enough to have meaningul results and finally grabs only a few rows so we can quickly see results.
  3. We add a column called sentiment that will hold the sentiment value detected by Comprehend on our review text. The getSentiment function is declared in comprehend_api.py
  4. We drop the review_body from the results just to make it easier to explore the resulting table but you can remove that line if you like
  5. We write the result dataset to S3 in Apache Parquet format

Querying our results

Once the script finish executing the output Parquet data will be in the S3 location you defined in the script. We will now need to create another AWS Glue crawler to discover and catalog our new data. Follow the previous instructions to create a crawler. Make sure you create a new IAM role so it has permissions to access your output location. Once completed, open the Athena console, select your table and preview it. Scroll all the way to the right side to see the new sentiment column.

Next Steps

Assuming you completed all of the previous steps you should be ready to move forward with exploring how we can further leverage Comprehend to enrich our dataset.

Step 1

As you may have already noticed I included a getEntities function in the comprehend_api.py file that you should now add to your ETL script. Go ahead and try importing it and adding it to your script, similar to how getSentiment is used in the withColumn API. Note that once the new data is written to S3 (same output location), you will need to first delete the old table in AWS Glue Data Catalog and then rerun the crawler so it can a new table with the new schema.

Step 2

Amazon Comprehend also includes an API to extract key phrases from text. I've created a skeleton UDF in comprehend_api.py to get you started. In this part of the workshop, go ahead and implement the getKeyPhrases UDF. You will need to edit comprehend_api.py on your local machine, make the code changes and upload it back to S3. You can either overwrite the existing file or upload it under a different name. Don't forget to confirm you have the correct path and filename under in Python library path field of the job settings.

Step 3

The product review dataset has reviews in other languages. Both German (DE) and French (FR) are supported by Comprehend and are also available in our dataset. In your AWS Glue ETL script create another DataFrame representing only German lanuage reviews. You then need to further extend the comprehend_api.py UDFs to allow you to pass a second parameter representing the lanuage. Go ahead and try this with Sentiment analysis and see what you get.

Step 4

Open Amazon QuickSight and configure appropriate IAM permissions. Once setup, add a Data Set representing the table we created in Step 3. Go ahead and explore creating visualizations beased on the data.

About

AIM416 workshop material for AWS re:Invent 2018

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages