Wednesday, 16 July 2014

A Typical ETL (Extract, Transform, Load) Architecture for Data Integration

A Typical ETL (Extract, Transform, Load) Architecture for Data Integration

It is pretty much a common question in interviews to ask about ETL architecture. Let this not be confused with the various data warehouse architectures which is a bigger universe and we will discuss about data warehouse architecture in our next post. The diagram below is simple illustration of the ETL architecture. The ETL tool used here can be Informatica or any other tool such as SSIS or Data Stage.

The diagram below shows that ETL as a data integration tool can pull data from multiple sources. The source can be flat files, VSAM, web services, XML, or relational databases such as Oracle, DB2,SQL server etc.
Landing Area: The data from these sources in a normal ETL architecture is loaded first into a landing area. This landing area will be table(s) in some database that holds the source data as is. Not many transformations is applied to data while being loaded into landing area. If this involves daily loads, the batch keys can be used to identify the load for a particular day. Once the data is in the landing area, various transformations and business logic can be applied and data moved to staging area.

Staging Area: The staging area contains data to which the application logic is applied. Data from a single table or multiple tables from landing area can be combined to create table(s) in the staging area. The data is transformed, aggregated, filtered, and all the business logic applied before loading the data into staging area. For example: If there are daily loads and if the data from the latest batch updates the historical data, then the staging area is supposed to hold the most latest version. If data from multiple columns in landing table are used to derive a column for a business, then this transformation is done in the mappings that loads data to the staging area.

There can be instances where landing and staging are merged into one table if separating them is not required. Having a separate landing area helps to recreate the staging area any time from all the data in the landing area. So that means that landing area has all the historical data which is data from the first load upto to the latest batch of data.

Datawarehouse Area: Call it the data warehouse or datamart area is the place that business users and the reporting layer have access to. The data in the data mart is the final version of the data. Usually data from staging to datamart is passed with minimal transformation. It is a good practice to make data in datamart recoverable any time from the data in staging.Landing and staging are intermediate areas before the data lands up in datamarts.

This is quiet simple isn’t it? If any questions, drop a comment below.

See also: