Naaz Vilantakath: March 2012

Dimensional modeling is the name of a logical design technique often used for data-warehouses. Dimensional modeling provides the best results for both ease of use and high performance.

Data Warehouse is loaded from the source database. So we design Dimensional model from the existing OLTP/ Normalized source DB model.

Before going in to the terms and theories, first let us see how an OLTP to Dimensional model will look like:

In Dimensional model, tables are divided in to either fact tables or dimensional tables.

FACT table

Fact table contains columns which are numeric data (measures). In the ORDER FACT, other than PKs and FKs, rest of the columns are measures like QUANTITY, QUOTED_PRICE, UNIT_PRICE, etc.

DIMENSION table

Dimension tables on the other hand have only textual data or non-text data that is used for textual purposes. This data is used for descriptive purposes only. In the above example there are 4 Dimension tables with each table containing only descriptive data.

STAR schema / Snowflake: Dimensional Models are mostly star schema since it looks like star with many dimensions surrounding a Fact. When one or more dimension tables are normalized, then the model will be referred to as Snowflake (Note: Fact table won’t be normalized).

Steps for ER to Dimensional Models

1Separate each entity into the business process that it represents.
2Create fact tables by selecting numeric and additive non-key facts. Fact tables may be a detail level or at an aggregated level depending on business needs.
3Create Dimensions by de-normalizing all the remaining tables into flat tables with atomic keys that connect directly to the fact tables.

Data Mart: According to the different business process, there will different Facts like Product, Order, Sales, etc and a data mart is essentially a coordinated set of fact tables, all with similar structures. Reporting tools and Business Intelligence tools normally works on top of data marts/OLAPs.

Why we need a separate Data Warehouse when all data is present in the source system?

When we use source system for reporting purpose, then transactions like Insert, delete, update will take more time to complete which may affect thousands of users.
2There might be different type of databases (Oracle, SQL Server, CSV, XML, etc) used in source system and to extract reports from all these sources can be done only after bringing those to a common platform.

Why we need Data Warehouse to be in Dimensional Model? Just another copy of source system in one platform isn’t enough?

Suppose you want to find total no. of orders done through a particular Sales representative,

In ER model, query will be like following:

select SUM(od.QUANTITY) from SALESREP s, CUSTOMER c, ORDERS o, ORDER-DETAIL od, PARTS p

where s.SLSREP_NR=c.SLSREP_NR and c.CUSTOMER_NR=o.CUSTOMER_NR and o.ORDER_NR=od.ORDER_NR and od.PART_NR=p.PART_NR

and p.PART_NR=3321223 and s.SLSREP_NR=8769

In Dimensional Model, below query will fetch the same output as produced by above query:

select SUM(QUANTITY) from ORDER_FACT

where PART_NR=3321223 and SLSREP_NR=8769

Did you gain performance? If yes, then where?

Yes. You saved 4 table joins in Dimensional model! (Join operation is considered to be one of the hectic processes for Database software).

Isn’t Dimensional model worth or not?

There are lot more concepts in Data warehouse model which convince the necessity of having Dimensional model for data Warehouses. This model was founded by Ralph Kimball.

This paper is just an introduction for beginners and as I said before, there are lot of concepts still to be discussed.

In the design of a relational database management system (RDBMS), the process of organizing data to minimize redundancy is called Normalization

The normalization rules are designed to prevent update anomalies and data inconsistencies. But with respect to performance tradeoffs, Normalization tend to penalize retrieval, since data which may have been retrievable from one record in an de-normalized design may have to be retrieved from several records in the normalized form. There is no obligation to fully normalize all records when actual performance requirements are taken into account.

Why do we need to do normalization?

To eliminate redundancy of data i.e. having same information stored at multiple places, which eventually be difficult to maintain and will also increase the size of our database.

With normalization we will have tables with fewer columns which will make data retrieval and insert, update and delete operations more efficient.

How many normal forms are there?

There are seven normal forms.

They are

First Normal Form
Second Normal Form
Third Normal Form
Boyce-Codd Normal Form
Fourth Normal Form
Fifth Normal Form
Sixth or Domain-key Normal form

What do we mean when we say a table is not in normalized form?

Let’s take an example to understand this,

Say I want to create a database which stores my friends name and their top three favorite dishes.

This database would be quite a simple so initially. I’ll be having only one table in it say FRIENDS table. Here F_ID is the Primary key.

FRIENDS

F_ID	F_NAME	Favorite_Dish
1	Srihari	Pizza, Burger, Noodles
2	Arvind	Hot Dog, Wedges, Fries

This table is not in normal form why?

Favorite_Dishes column is not atomic or doesn’t have scalar value i.e. it have more than one value.

Let’s modify this table

FRIENDS

F_ID	F_NAME	Favorite_Dish_1	Favorite_Dish_2	Favorite_Dish_3
1	Srihari	Pizza	Burger	Noodles
2	Arvind	Hot Dog	Wedges	Fries

This table is still not in normal form. Why?

We have now changed our table and now each column has only one value!! (So what’s left?)

Because here we are having multiple columns with same kind of value.

I.e. repeating group of data or repeating columns.

So what we need to do to make it normal or at least bring it in First Normal Form?

We’ll first break our single table into two.
Each table should have information about only one entity so it would be nice if we store our friend’s information in one table and his favorite dish’s information in another

(For simplicity we are working with few columns but in real world scenario there could be column like friend’s phone no, email , address and favorites artists albums, awards received by them, country etc. So in that case having two different tables would make complete sense)

FRIENDS

F_ID	F_NAME
1	Srihari
2	Arvind

FAVORITE_DISHES

F_ID (FOREIGN KEY ON FRIENDS.F_ID)	FAVORITE_DISH
1	Akon.
1	The Corrs
1	Robbie Williams
2	Enigma
2	Chicane
2	Shania Twain

F_ID is foreign key in FAVORITE_DISHES table which refers to Primary Key F_ID in our FRIENDS Table.

Now we can say that our table is in First Normal Form.

Remember For First Normal Form,

Column values should be atomic, scalar or should be holding single value

No repetition of information or values in multiple columns.

So what does Second Normal Form means?

For second normal form our database should already be in First Normal Form and every non-key column must depend on entire Primary Key.

Here we can say that our Friend database was already in Second Normal Form.

Why?

Because we don’t have Composite Primary Key in our FRIENDS and FAVORITE_DISHES table.

Composite Primary Keys are primary keys made up of more than one column. But there is no such thing in our database.

But still let’s try to understand Second Normal Form with another example

This is our new table

GADGETS

GADGET_NAME	SUPPLIER_NAME	COST	SUPPLIER_ADDRESS
Headphone	Abaci	123$	New York
Mp3 Player	Sagas	250$	California
Headphone	Mayas	100$	London

In above table (GADGET_NAME+SUPPLIER_NAME) together form a Composite Primary Key.
Let’s check for dependency for COST

If I know GADGET_NAME, can I know the COST?

No same gadget is provided by different supplier at different rate.

If I know SUPPLIER_NAME, can I know about the COST?

No because same supplier can provide me with different gadgets.

If I know both GADGET_NAME and SUPPLIER_NAME, can I know COST?

Yes, then we can.

So COST is fully/functionally dependent on our Composite Primary Key: (GADGET_NAME + SUPPLIER_NAME).

Now let’s check with another non-key column SUPPLIER_ADDRESS

If I know GADGET_NAME, will I come to know about the SUPPLIER_ADDRESS?

Obviously, No.

If I know who the SUPPLIER_NAME is, can I have their address?

Yes.

So here SUPPLIER_ADDRESS is not completely dependent on (ie. partial dependent) on Our Composite Primary Key (GADGET_NAME + SUPPLIER_NAME).

This table is surely not in Second Normal Form.

So what do we need to do to bring it to Second Normal Form?
Here again we’ll break the table in two.

SUPPLIER

SUPPLIER_NAME	SUPPLIER_ADDRESS
Abaci	New York
Sagas	California
Mayas	London

GADGETS

GADGET_NAME	SUPPLIER_NAME (FOREIGN KEY ON SUPPLIER.SUPPLIER_NAME)	COST
Headphone	Abaci	123$
Mp3 Player	Sagas	250$
Headphone	Mayas	100$

Now we know how to normalize till Second Normal Form.

But let’s take a break over here and learn some definitions and terms.

Composite Key: -Composite key is a primary key composed of multiple columns.

Functional Dependency – When value of one column is dependent on another column. ie if value of one column changes, the value of other column changes as well.

e.g. SUPPLIER_ADDRESS is functionally dependent on SUPPLIER_NAME. If supplier’s name is changed in a record we need to change the supplier address as well.

Partial Functional Dependency – A non-key column is dependent on some, but not all the columns in a Composite Primary Key.

e.g. SUPPLIER_ADDRESS was partially dependent on our Composite Key Columns (GADGET_NAME + SUPPLIER_NAME)

Transitive Dependency - A type of functional dependency in which the value in a non-key column is determined by the value in another non-key column.

With these definitions in mind let’s move to Third Normal Form.

For a table in Third Normal Form

It should already be in Second Normal Form.
There should be no Transitive Dependency, i.e. we shouldn’t have any non-key column depending on any other non-key column.

Again we need to make sure that the non-key columns depend upon the Primary Key and not on any other non-key column.

LITERATURE

BOOK_NAME	AUTHOR (a NON KEY)	NO_OF_SECTIONS	COUNTRY (NON KEY DEPENDENT ON AUTHOR)
Aaron Trow	Anthony Trollope	11	Canada
The Deerslayer	James Fenimore Cooper	15	USA
The Belton Estate	Anthony Trollope	11	Canada
Five Go Adventuring Again	Enid Blyton	8	Spain
Five On a Treasure Island	Enid Blyton	10	Spain

Although the above table looks fine but still there is something in it because of which we will normalize it further.

BOOK_NAME is the Primary Key of the above table.

AUTHOR and NO_OF_SECTIONS are functionally dependent on the BOOK_NAME (Primary Key).

But can we say the same with COUNTRY as well?

In the above table, COUNTRY value is getting repeated because of AUTHOR.

So in our above table, COUNTRY column is depended on AUTHOR column which is a non-key column.

So we will move that information in another table and could save table from redundancy i.e. repeating values of COUNTRY column.

AUTHORS

AUTHOR_ID	AUTHOR	COUNTRY
1	Anthony Trollope	Canada
2	James Fenimore Cooper	USA
3	Enid Blyton	Spain

LITERATURE

BOOK_NAME	AUTHOR_ID (FOREIGN KEY TO AUTHORS.AUTHOR_ID)	NO_OF_SECTIONS
Aaron Trow	1	11
The Deerslayer	2	15
The Belton Estate	1	11
Five Go Adventuring Again	3	8
Five On a Treasure Island	3	10

Now this is in 3^rd Normal Form.

Informally, a relational database table is often described as "normalized" if it is in the Third Normal Form. Most 3NF tables are free of insertion, update, and deletion anomalies, i.e. in most cases 3NF tables adhere to BCNF, 4NF, and 5NF (but typically not 6NF).

A standard piece of database design guidance is that the designer should create a fully normalized design. However, in Data Warehouse Design (Dimensional modeling approach) recommend non-normalized designs, i.e. designs that in large part do not adhere to 3NF.

Naaz Vilantakath

Tuesday, March 20, 2012

DIMENSIONAL MODELING - for beginners

Monday, March 19, 2012

Normalization - In Simple words

Enid Blyton

About Me