User Tools

Site Tools


concepts:processes:etl

ETL

The ETL process describes the data extraction, transformation and loading process within a data warehouse or an OLAP application. In detail, in means that data must be extracted from the source systems, then validated and finally loaded into a database of the data warehouse and in the OLAP cube. Therefore, the ETL process is the basis of the whole data warehouse architecture.

Extract

Extraction is the first step of the ETL process. 1) It includes the extraction of required data from previously defined source systems. As there can be many source systems, the data organization and format is not everywhere the same. Therefore, it is part of the extraction process to transform the data into one consistent organization and format. The constant actuality of the data is given through synchronic or asynchronous extraction from the source systems.

Synchronic extraction: changes in the data of the source systems are immediately forwarded to the data warehouse. This is also called real-time-data-warehousing. Asynchronous extraction: With this extraction type, data can be received in three different variations:

  • Periodically: in certain intervals the data warehouse asks for changes in the source systems
  • Event-driven: the source systems create an abstract of the data when a defined event occurs and the data is forwarded to the data warehouse
  • Request-driven: the ongoing data of the source system can be retrieved by a manual request

Because the extraction is mostly placed during time frames when the source system is not fully loaded, which is a proportionally short amount of time, this process has to be on a high-performance level. Therefore there are just small data transformations in the extraction phase.

The main goal of the extraction is to put all ongoing data into a common format and organization for the following data transformation process. Thereby, the data are getting parsed, that means they are being validated and eventually rejected.

Transform

The extracted data are transformed into the required target data model during the transformation phase. The given data quality gets analyzed and improved with the data cleansing process of the database. The consistency check requires that all of the heterogeneous extracted data are getting associated with each other. Even with a small amount of data, this process can be very time-consuming, e.g. for so called lookups. There is a lookout when during a transformation a further request is executed, to receive additional information for the dataset that’s currently been dealt with. Generally the transformation can be distinguished in two steps:

  • Syntactical transformation: Improvement, realisation and correction of the data based on syntactic requirements of the target data model (e.g. consistent date format)
  • Semantic transformation: the data are being analyzed with regards to its content (e.g. aggregation, deleting duplicates)

Load

The transformed data gets integrated into the data warehouse by shifting it physically to the target database. The data which is between the source systems and the data warehouse is placed in the data staging area. To improve the technical performance only the updated data (delta) is loaded in the data warehouse. With an implemented version control it is possible to restore former data pools.

References

  • Jarke, Matthias; Lenzerini, Maurizio; Vassiliou, Yannis; Vassiliadis, Panos: Fundamentals of Data Warehouses.Springer; 2nd, rev. and extended ed. edition, 2003
  • Kimball, Ralph; Caserta, Joe: The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleanin. Wiley, 2004
  • Messerschmidt, Hartmut; Schweinsberg, OLAP mit dem SQL-Server. Kassel: Dpunkt , 2003
  • Ponniah, Paulraj: Data Warehousing Fundamentals for IT Professionals. Wiley; 2 edition, 2010
  • Ullrey, Bruce: Implementing a Data Warehouse: A methodology that worked. AuthorHouse, 2007
1)
Kimball, Ralph; Caserta, Joe: The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleanin. Wiley, 2004
concepts/processes/etl.txt · Last modified: 2020/08/20 13:27 (external edit)