Purpose of the article: To explain how to automate the complete GLUE ETL processing in one click
Intended Audience: For those who are exploring GLUE and its components, For those who are finding a solution for automated ETL process on AWS cloud in a serverless fashion
Tools and Technology: AWS GLUE (AWS Cloud)
Keywords: AWS Glue Workflow, Glue Workflow
Author: Aman Maheshwari
AWS Glue Workflows: A Way of Automating Glue Processing
- Discuss what is AWS Glue Workflow with its advantages
- An example to show the working of AWS Glue Workflow
AWS glue introduction:
A fully managed ETL service that makes it simple to categorize your data, clean it, transform it, and move it reliably between various data stores and streams in a cost-effective manner.
GLUE has a central metadata repository, AWS Glue Data Catalog, an ETL engine that works on Python or Scala code, and features like dependency resolution, job monitoring, retries with change data capture. It is a serverless service, so there are no infrastructure related configurations to handle.
AWS glue workflows:
In AWS Glue, Workflows are used for creating and visualizing complex ETL activities involving multiple crawlers, jobs, and triggers. Each workflow can manage the execution of all components and monitor all added jobs and crawlers.
- Runs each component, either a job or a crawler, via a trigger
- Records execution of individual components with progress and status, and also
- Create a workflow using two methods:
- AWS Glue blueprint
- Manually (build a workflow using the AWS Management Console or the AWS Glue API)
Components of AWS glue workflow:
- Triggers: Triggers within workflows can start both jobs and crawlers and can be fired when jobs or crawlers are completed. The different types of triggers start options available are:
- Schedule – The workflow is started based on the schedule applied. The schedule can be periodical or customized based on a cron expression
- On-demand – The workflow is started manually from the AWS Glue console, API, or AWS CLI.
- EventBridge Event – The workflow is commenced upon the occurrence of a single Amazon EventBridge event or a batch of Amazon EventBridge events. A common use case is the upload/copy/put/delete of a new file/folder in an Amazon S3 bucket.
- Crawlers: This component of AWS GLUE is used for crawling data on different data sources like S3, Redshift, DynamoDB, etc., to extract schema based on the classifier selected. This, in the end, creates a data catalog.
- ETL Jobs: Using the metadata in the Data Catalog, AWS Glue can automatically generate Scala or Pyspark scripts with Glue extensions that you can use and modify to perform ETL operations.
- Workflow run properties: This option helps you give run properties you need to provide to a job before it runs in the workflow.
- Workflow graph: It depicts the flow in which you want your glue components to run. The first trigger is a manual, scheduled or based on an event, where later triggers can depend on the status of previous component execution (as shown in the diagram)
Steps to create a glue workflow:
We will be now creating a glue workflow that comprises three sections in a single flow:
STEP 1: Trigger a crawler to create a data catalog on data available in the raw bucket (can be manual, schedules, or EventBridge).
STEP 2: The success of a crawler run (STEP 1) runs another trigger that triggers a glue ETL job which performs some transformation on data and stores it back into S3.
STEP 3: The successful Glue ETL job (STEP 2) runs another crawler, this time on cleansed data to create its data catalog.
Setting up the workflow:
- Create a Glue Workflow:
- Add a trigger, which will trigger the 1st Crawler (STEP 1):
- Add a crawler to be run by above trigger
- The next step is to add a new trigger that triggers a successful crawler run triggered in the previous step. Hence the trigger type will not be on-demand as in 1st step.
- Attach a glue job to this trigger
- Add a final trigger that triggers a glue crawler (on successfully completing the job) and crawls over the new cleansed data in s3.
- Final stage to add the crawler to the trigger.
The workflow setup is completed, and it looks like this:
Running the workflow:
The below image shows how you can run a workflow.
Note: On-demand means to run workflow manually; otherwise, it will run automatically when scheduled or triggered by an EVENT.
Workflow run history:
- When this workflow runs, you can go and check the crawlers and jobs running and their status with logs (if enabled)
- Glue crawler: