A Typical ETL (Extract, Transform, Load) Architecture for Data Integration
It is
pretty much a common question in interviews to ask about ETL architecture. Let
this not be confused with the various data warehouse architectures which is a bigger universe and we will
discuss about data warehouse architecture in our next post. The diagram below
is simple illustration of the ETL architecture. The ETL tool used here can be Informatica or any other tool such as SSIS or Data Stage.
The
diagram below shows that ETL as a data integration tool can pull data from
multiple sources. The source can be flat files, VSAM, web services, XML, or
relational databases such as Oracle, DB2,SQL server etc.
Landing
Area: The data from these sources in a normal ETL architecture is loaded first
into a landing area. This landing area will be table(s) in some database that
holds the source data as is. Not many transformations is applied to data while
being loaded into landing area. If this involves daily loads, the batch keys
can be used to identify the load for a particular day. Once the data is in the
landing area, various transformations and business logic can be applied and
data moved to staging area.
Staging
Area: The staging area contains data to which the application logic is applied.
Data from a single table or multiple tables from landing area can be combined
to create table(s) in the staging area. The data is transformed, aggregated,
filtered, and all the business logic applied before loading the data into
staging area. For example: If there are daily loads and if the data from the
latest batch updates the historical data, then the staging area is supposed to
hold the most latest version. If data from multiple columns in landing table
are used to derive a column for a business, then this transformation is done in
the mappings that loads data to the staging area.
There can
be instances where landing and staging are merged into one table if separating
them is not required. Having a separate landing area helps to recreate the
staging area any time from all the data in the landing area. So that means that
landing area has all the historical data which is data from the first load upto
to the latest batch of data.
Datawarehouse
Area: Call it the data warehouse or datamart area is the place that business
users and the reporting layer have access to. The data in the data mart is the
final version of the data. Usually data from staging to datamart is passed with
minimal transformation. It is a good practice to make data in datamart
recoverable any time from the data in staging.Landing and staging are
intermediate areas before the data lands up in datamarts.
This is
quiet simple isn’t it? If any questions, drop a comment below.
See also: http://dwbitechguru.blogspot.ca/2014/07/business-intelligence-simple-definition.html
No comments:
Post a Comment