As much as is true for a data warehouse that it is not an ordinary database, it is true for the star schema that it is not an ordinary database design. In a standard – say, productive – relational database the 3rd normal form (3NF) is a widely accepted requirement for data consistency and minimal resource consumption. However, in a data warehouse there exist different requirements, of which data access speed is the most important. That's why the data warehouse design deviates from the principle of normalization1) and that’s where the star schema comes into play.
The name originates from the picture that shows up when looking at the table model of a database organized in the respective form 2):
Figure 1: Star Schema Data Model
As becomes obvious in this picture, the dimension tables (Time, Geography and Product) are grouped around the central fact table (Sales) as points of a star are grouped around its center 3). There is no additional table hierarchy as usual in standard databases in a star schema, causing both low complexity and high performance. Typically the dimension tables have one primary key each, which is referenced by the fact table, resulting in one foreign key per dimension table. The combination of these foreign keys is the unique id of a fact, thus comprising the primary key of the fact table. The relationship between the fact and dimension tables is n:1. A fact, in this case a sale, will always refer to just one specific product –a physical piece, to be precise- being sold in just one location at just one point in time. In technical terms: An entry of the fact table always refers to just one entry of each dimension table. On the other hand, a certain product is not sold just once, but many times, in many locations and at many points in time. Thus, a product is surely affected by multiple sales transactions – in technical terms: An entry of the dimension table refers to multiple entries of the fact table.
As one can see, the dimension tables have no foreign keys themselves referencing different tables. This is the result of denormalizing: Whereas in a Snow Flake Schema all dimensions are distributed in multiple tables breaking down the attributes of e.g. a product in a manner that allows each attribute value appearing just once throughout the database, in the Star Schema one dimension table contains all the relevant attributes. In consequence, there also is redundancy: In contrast to the Snow Flake Schema, the Star Schema accepts this in the dimension tables, which becomes obvious when looking into e.g. the dimension table “Product”:
Figure 2: Redundancy in dimension tables
As the table contains all relevant attributes of a product, it is inevitable that there are multiple products with identical values for specific attributes: For example, an online shop for hardware sells more than one type of CPU, so there are multiple products that are part of the same group “CPU”. Here the typical tradeoff in IT technology – memory consumption vs. speed – becomes apparent: For the sake of fast data retrieval by accessing just one table without joining others4) when selecting product information, the higher memory consumption for storing redundant attribute values is accepted. Aside from this drawback, redundancy causes another one, which is database anomaly. Imagine IBM decides to enter the hardware consumer market by buying AMD: If the update of “AMD” to “IBM” in the product table is not carried out thoroughly by updating just one CPU entry, this will leave the table in an inconsistent state: A report on all IBM CPUs in the product range of the online shop would return a false result. However, in a data warehouse both potential difficulties are not as serious as they would be in a productive database: First, the data volume, i.e. the number of entries, of dimension tables is typically rather low compared to the volume of the fact table. Second, there is relatively little change to the dimension tables compared to the fact table; additionally, the update of these tables is performed in a controlled process managed by the ETL engine, minimizing the risk of update anomalies.
In consequence, there are significant arguments for preferring the star schema over the snow flake schema. However, no general recommendation can be given: Pro's and Con's exist in both schemas 5). Thus, it depends strongly on the individual characteristics and requirements of a data warehouse project whether sacrificing memory efficiency and data consistency for speed makes sense or not.