Stepping out of the comfort zone
I was recently tasked to look into setting up a data pipeline to feed a data warehouse built using redshift in AWS. One of the tools used in this process was AWS Glue. The purpose of using Glue was to perform ETL ( Extract Transform Load ) functions. In this post, my goal is to outline my experience using AWS Glue and some of the things that you should look out for. This is most applicable if you’re already leveraging AWS in your org.
What is AWS Glue?
“AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. AWS Glue provides all the capabilities needed for data integration so that you can start analyzing your data and putting it to use in minutes instead of months.” (source: aws)
One of the main use cases for Glue is creating event driven ETL pipelines. If you’re familiar with AWS Lambda, or server-less applications in general, then the concept of Glue would be familiar. You’re provisioning some resources to execute a given task based on some input. This is a good Segway to contrast the difference between ETL and ELT.
ETL is simply transforming the data between the source and destination storage. This transform can be filtering the data, modifying column types, removing duplicated data are just a few examples. In ELT, the source data is written to the destination storage as is. There is going to be a process running at the destination storage that transforms the data before it’s used.
Now, I’ll admit I’m no expert in the field of data science, so this post is by no mean an end-all be all. Also, this project was greenfield, there was no pre-existing infrastructure. This influenced the decisions greatly and what worked here, may not work for you, with that out of the way, let’s continue.
The two pieces of the puzzle
AWS Glue is split into two main functions, the catalog and the transforms.
AWS Data Catalog
This is essentially a pointer to the underlying the data. It’s described as “metadata” about the data. It is AWS’s way of understanding your data. It is the first step in setting up an ETL pipeline. Glue stores this metadata in tables. There are a couple ways of defining your tables, with a crawler being the method that was chosen. A crawler seeks to remove some of the manual work of setting up the metadata tables by looking at the schema and tries to infer schema information for the underlying data. There are a couple ways of doing this however, listed here. Your ETL script is going to use the table/s in the catalog as the input to do transforms.
AWS Glue ETL
In the ETL, there are essentially three parts, your Data Source, your transformation function and the destination. You can join data sets, and even filter and write to multiple destinations, but essentially you’re doing these three things.
Sample use case
Let’s say you have some relational data living in an MySQL database, and you would like to get this data in your Data Warehouse. I want to walk you through a thought process, from the “ideal” solution, to a solution that was implemented.
In the conventional approach in AWS, based on my research, the typical pipeline consists of pushing all data to S3/Redshift and using Redshift Spectrum to formulate a sort of data lake-house approach. Where your data lives in S3 and is queried using Redshift. This is the ideal typically, it is Separating processing from storage. Things don’t scale well when you have terabytes of data or even petabytes and your storage and processing is not separated. Storing this much in Redshift storage can get expensive. But what if your data is no where close to the petabyte level? What if it’s just hundreds of GB, or even a couple of terabytes. In this case, it seemed like the more economical thing to put the data directly into Redshift, with RA3 nodes there for offsetting the cost of scaling processing and storage. Time is also a key factor, how fast do you want to see updates in your system? If your update interval is every 24 hours, then the architecture of your Data ETL and Storage is simplified greatly.
Summary: We tried to put the model outlined below into practice, and immediately found that this architecture wasn’t suited for us. Maybe you don’t have enough data to justify the cost of this approach, or even the size of technical team required to deliver this.
This is a simple ETL pipeline:
Some things to consider when using AWS Glue
- It was difficult to get a developer pipeline set up. ( this could have been from inexperience as well ). Initially, we followed this post from AWS. What worked for us may not work for you. Depending on your organization, testing with AWS credentials may not be an option when developing locally. Docker may be best. I recommend against using the AWS Glue Studio ETL library as we encountered quite a bit of path issues with it. Using Jupyter notebooks was helpful to get started, but became a little bit of a pain point as it seems like we had to move between our spark script ( in CI/CD) and the notebook manually, and had to keep a copy of each.
- That brings me to my second issue, you don’t know what you don’t know. These services seem to be built with the assumption that you’re in the data science space, if you’re not, like me, then it’s difficult to get started. Since these services are quite niche as well, finding information online is currently a bit difficult. I suspect this will get better with time. Just a tip, In order to understand Glue, you need to have foundational knowledge of Python, PySpark, and Apache Spark. Glue also has abstractions for Spark functions, eg: the dynamicFrame class.
- If your data is messy, the abstractions that Glue provides may be insufficient to resolve data issues. In this case you should move between the functions that Glue provides to Spark functions for more flexibility.
- Resolving ambiguous types: As a rule of thumb, I recommend casting fields at the beginning of your script. I had most trouble with Integer Primary key fields from my RDS database, as well as some varchar fields. Glue only supports these datatypes.
data_res1 = myDataSource.resolveChoice(
specs=[
("dbIntColumn", "cast:int")
],
transformation_ctx="data_res1")
- Debugging is a bit difficult. There are three places that you need to look for logs when running in scripts in the AWS console. I usually have the three tabs open when debugging and bounce among them.
- Controlling the Tables of the output DB. Glue can actually manage the database schema for you. This is a double edged sword, as depending on how you resolve ambiguous types, it may result in dropped columns or duplicate columns each with it’s own data type. eg, dbIntColumn_int, dbIntColumn_float. If there’s one thing I wish AWS had big and bold in their documentation is the fact that Glue can manage your output tables when using
glueContext.write_dynamic_frame.from_jdbc_con
f. There’s no need to create the tables yourself in Redshift, this was a major pain point, especially when using columns that contained JSON data if that data were to be extracted. The Statement that the script would generate would almost always not match our pre-defined schema. - Lastly, Spend the time and prepare your data. This was the most difficult part and the part that I’ll admit, I didn’t do well, and arguably, the most important. Spend time understanding your data, prepping and cleaning the data as much as possible before embarking on your ETL journey. It can also help you keep your ETL script simple.
That’s a wrap for now,
As you can see, there is still quite a bit for me to learn, but hopefully I outlined enough for things for you to look out for, with the hope that it saves you and your team some time.
Have you used AWS Glue for ETL before? What’s your experience? I would love to hear.
Leave a comment down below.