
Matillion Avancée
Passez à la vitesse supérieure avec Matillion : cette formation vous permet d’aller plus loin dans l’intégration de données cloud avec des traitements avancés, des API et la gestion de versions.
Article written by Eric Moss – ActinVision BI Consultant
Matillion has created a data processing tool that extracts data, loads it into a database and then transforms it. Unlike other tools with similar uses, such as Talend, Informatica PowerCenter or Oracle Data Integrator, Matillion is 100% Cloud-based. So there’s no need to install a server. Simply rent a virtual machine via Amazon Web Services, Google Cloud Platform or Microsoft Azure.
In this article
This tool is known as ELT: Extract, Load and Transform. Unlike other tools with similar uses, such as Talend, Informatica PowerCenter or Oracle Data Integrator, Matillion is 100% Cloud-based. So there’s no need to install a server. Simply rent a virtual machine via Amazon Web Services, Google Cloud Platform or Microsoft Azure.
To understand how Matillion works, we first need to understand the difference between an ELT and an ETL(Extract, Transform and Load).
In both approaches, the first common phase is the extraction of data from one or more sources. Sources can be as diverse as flat files, an API, a database, ERP, etc. The difference between ETL and ELT lies in the second step. The difference between ETL and ELT lies in the second stage: unlike an ETL, an ELT loads the data directly into the data warehouse(DWH). It inserts the data in raw form into a part of the warehouse known as the staging base. In this way, the data are already in the target database where they will be processed. In the case of ETL, it is generally necessary to install a server on which the data processing tool is located. In this case, the data is first loaded, then transformed and finally inserted into the data warehouse. With an ELT, the data is loaded once, then transformed.
For transformations, an ETL uses the resources of the server where it is located; it therefore needs its own computing power, independently of that allocated to the database – processing is “externalized”.ELT, on the other hand, makes direct use of the database’s computing power, since processing is carried out in situ, within the target database. Processing is written in the database’s proprietary language (SQL), as the data never leaves the database. TheELT uses the target database as a tool to perform the required manipulations. It pools computing resources for processing, exposure and storage.
The Matillion data processing tool is easy to use. From the interface, you select the components to be placed in the workspace. Then, simply order your requirements. With the tool’s various components, you can :
The interface consists of 5 parts:
1- The tree structure of created Jobs;
2- The list of available components;
3- The workspace of the open Job;
4- The characteristics area of the selected component;
5- And finally, the execution list of tasks and Logs.
Matillion offers what several other tools on the market (Informatica PowerCenter, Talend…) do, but in ELT. All the actions available in the interface are translated into SQL code and passed on to the Data Warehouse, which is responsible for executing this code. Matillion only generates the SQL code, so it cannot function without a data warehouse. Matillion is the processing and transformation orchestrator.
What sets Matillion apart is that it is 100% cloud-based, and relies exclusively on cloud-based data warehouse solutions. Matillion makes it easy to connect to these data warehouses and take advantage of their benefits: available computing power at lower cost, customized pricing, coupling with other services also managed in the Cloud.
Matillion comes in three solutions, all connected to a Data Warehouse Cloud service. They include :
These solutions provide different functionalities specific to the data warehousing services concerned. This enables users to choose their version of Matillion according to their Cloud host and preferences; each of the data warehouses has its own advantages and disadvantages. By way of example, Snowflake supports machine auto-shutdown after a user-selectable period of inactivity, which saves a considerable amount of money, unlike Redshift which, for the moment, does not.
What’s more, all three Matillion solutions have exactly the same interface; the orchestration and processing components are also the same. They can, however, be differentiated by the functionalities that each of the three Cloud warehouses offers.
The advantages of Matillion :
Matillion has 2 categories of jobs: orchestration jobs and transformation jobs. Orchestration jobs are used to retrieve data from one or more sources outside the data warehouse, to modify the data model, and to load and orchestrate the various processes. Transformation Jobs, on the other hand, are concerned with the reading of retrieved data, the various joins, the calculations to be performed and the writing.
As with Talend, an orchestration job can call other orchestration or transformation jobs. This means that a process can be reused several times, with or without conditions. The components used by Jobs differ according to the type of Job used. A component used in an orchestration job is unlikely to be available in a transformation job, and vice versa.
Matillion has a large number of different components. The major difficulty lies in finding the right component to meet your needs, as it can be easy to get lost or take a while to figure out exactly which component to use. For example, if we need to retrieve data from an ERP, it’s necessary to rummage through the various folders containing components to try and figure out which one is right for us. What’s more, since the solution is relatively new, there are currently few users in the community to turn to for help.
Matillion is similar to other data processing tools (Talend, PowerCenter…):
Drag & Drop of different components to build flows;
Availability of numerous native functions;
Ease of orchestrating different processes.
Anyone familiar with one or more similar tools will quickly get the hang of Matillion. In the final analysis, Matillion is very visual, and it’s easy to get an idea of the data flow involved. It’s also easy to see what data is loaded and where it’s processed, what manipulations are carried out, and all this step by step. For this reason, I believe that those new to data processing tools can quickly understand the logic behind Matillion.
One of the tool’s plus points is that there is a validation “check” for each modification made. There’s also the option of sampling processed data: this enables you to quickly validate your work, check that processing meets your needs, and correct any errors.
Matillion makes every effort to connect to the various existing data sources. The range of components available is an effective response to this problem. The only resource that Matillion can’t access is local data; the tool will inevitably need to be able to connect online to the source in order to retrieve the data. It’s important to bear in mind, however, that processing is carried out by the data warehouse on which Matillion relies. Not everything is possible, therefore, and some projects may be better carried out with a particular integration tool.
From my point of view, the strengths of the solution are :
Areas for improvement :