What's an ETL/ELT
ETL stands for Extract Transform Load, while ELT stands for Extract Load Transform.
They are two kinds of tools that are specialised in Data processing and transformation. They are mainly used by Data Engineer, BI engineer and some Data Scientists.
They are specialized in
- extracting data from different kind of databases and datasources (eg. API and files)
- transforming the data : rename, aggregate, compute, filter, clean, ...
- loading the data into another data source (or database)
The main difference between an ETL and ELT is "where" the processing occures:
- The ETL extracts the data and process it on a dedicated server, then loads the result on a target database. Extract -> Transform -> Load
- The ELT moves the data from a source to a target, and then use the power of the database server to perform the transformations. Extract -> Load -> Transform
That's the philosophical difference and it defines two big families of tools. Each family has many competing tools. I mainly used Stambia (ELT) and Talend (ETL) and this article is mainly based off my experience with those two softwares.
On a hardware point of view, the main difference between the ETL and the ELT is the power of your middle server. An ETL processes the data on its own server, so it will need to be very powerful. Talend has multithreading capabilities that can dramatically speedup the processing, but will need a strong server. On the other hand, the ELT relies on the database power. The server hosting Stambia, for example, won't need to be that strong, as it is only a SQL queries orchestrator.
Why are they used for
They are mainly used for data pipelines. See, your company has valuable data all over the place : in operational databases mainly, and in user files (you know, all those infamous Excel files!). Let's say you are working in a retail company. Your company will probably have :
- a Warehouse Management System
- a Customer order management System
- a Transport Management System
- and much more
If your company wants to know the value of sold good that were transported on a specific day, you'll need to retrieve and mix data from all those sources in order to provide an answer. ETL/ELT will dramatically help you !
Main advantage :
Speed ! If I had to name only one advantage of those tools, it would be speed. Development speed, learning speed, speed to undestand a flow, speed to produce documentation.
Complexity is hidden through the use of components or templates. You don't actually "code". You simply choose the right component for your purpose and configure it a bit. An expert already took a lot of time to give you a ready to use component that's designed for your usecase.
So you just have to fill the right parameters and don't worry about the "how".
There's also something called "metadata". They are descriptions of data structures, such as a CSV file, a database table, a json structure. Those descriptions may be infered from the data, or database dictionnary, with a single click. That's a major dev speedup and one less source of mistakes.
BI engineers can focus on understanding the data and processing it, instead of having to worry about how to retrieve the data and push it to a database. They also keep working mainly with their usual language : SQL.
General Pros & Cons:
Pros:
- Easy to read : an experienced user can grasp the global concept of a data flow in a minute.
- Easy to document :
- being graphical interfaces, those tools allow different level of documentation.
- they are usually able to generate pdf or html docs on their own
- Fast development
- No need to know the specific tweaks of your database
- Can connect to almost any database without problem
- Stambia and Talend are based on Eclipse, giving you access to all the Eclipse plugins
- Components are easier to use and remember than functions
- Errors and logs are usually clear and helpful
Cons:
- There's a learning curve, as always
- ETL may be slow, if you don't have the right server or don't use the best practices
- ELT are more limited in terms of functionnalities than ETL
- IDE (Eclipse) is heavy
- You don't really dive into the code. It's generated and bloated.
One more word about Talend :
Talend is the first ETL I learned to use. It's under continuous development and offer a very large amount of components (more than 300), and can do much more than just connecting to databases. There are orchestrations, logs, file managements, error management and system dedicated components. It's very versatile and even allow you to write your own (java) code. There is a free open source version called Talend Open Studio which will allow anyone interested to try it and replicate my next posts.
Conclusion :
Clearly those kind of tools are productivity tools. They aim at getting the job done fast and stay readable. If you love and want to code, then they're probably not for you. But if coding is just a way to achieve a goal, and that this goal revolve around processing data, then they'll be your best friends.
Future posts
I have some ideas I'd like to test on Hive or which are crypto related. I'll write about those here and go deeper in the inners of Talend with you.
I'd like to fiddle with :
- Hive engine and API interactions
- Google Drive connectivity
- BigQuery connectivity
- Interaction with a python machine-learning model
- ...
Little disclaimer : my job and family life does not leave me with a lot of free time, so posts won't be on a regular schedule !