The ETL process describes the data extraction, transformation and loading process within a data warehouse or an OLAP application. In detail, in means that data must be extracted from the source systems, then validated and finally loaded into a database of the data warehouse and in the OLAP cube. Therefore, the ETL process is the basis of the whole data warehouse architecture.
Extraction is the first step of the ETL process. 1) It includes the extraction of required data from previously defined source systems. As there can be many source systems, the data organization and format is not everywhere the same. Therefore, it is part of the extraction process to transform the data into one consistent organization and format. The constant actuality of the data is given through synchronic or asynchronous extraction from the source systems.
Synchronic extraction: changes in the data of the source systems are immediately forwarded to the data warehouse. This is also called real-time-data-warehousing. Asynchronous extraction: With this extraction type, data can be received in three different variations:
Because the extraction is mostly placed during time frames when the source system is not fully loaded, which is a proportionally short amount of time, this process has to be on a high-performance level. Therefore there are just small data transformations in the extraction phase.
The main goal of the extraction is to put all ongoing data into a common format and organization for the following data transformation process. Thereby, the data are getting parsed, that means they are being validated and eventually rejected.
The extracted data are transformed into the required target data model during the transformation phase. The given data quality gets analyzed and improved with the data cleansing process of the database. The consistency check requires that all of the heterogeneous extracted data are getting associated with each other. Even with a small amount of data, this process can be very time-consuming, e.g. for so called lookups. There is a lookout when during a transformation a further request is executed, to receive additional information for the dataset that’s currently been dealt with. Generally the transformation can be distinguished in two steps:
The transformed data gets integrated into the data warehouse by shifting it physically to the target database. The data which is between the source systems and the data warehouse is placed in the data staging area. To improve the technical performance only the updated data (delta) is loaded in the data warehouse. With an implemented version control it is possible to restore former data pools.