Tuesday, March 20, 2012

DIMENSIONAL MODELING - for beginners


Dimensional modeling is the name of a logical design technique often used for data-warehouses. Dimensional modeling provides the best results for both ease of use and high performance.
Data Warehouse is loaded from the source database. So we design Dimensional model from the existing OLTP/ Normalized source DB model.
Before going in to the terms and theories, first let us see how an OLTP to Dimensional model will look like:


In Dimensional model, tables are divided in to either fact tables or dimensional tables.
FACT table
Fact table contains columns which are numeric data (measures). In the ORDER FACT, other than PKs and FKs, rest of the columns are measures like QUANTITY, QUOTED_PRICE, UNIT_PRICE, etc.

DIMENSION table
Dimension tables on the other hand have only textual data or non-text data that is used for textual purposes.  This data is used for descriptive purposes only.  In the above example there are 4 Dimension tables with each table containing only descriptive data.

STAR schema / Snowflake: Dimensional Models are mostly star schema since it looks like star with many dimensions surrounding a Fact. When one or more dimension tables are normalized, then the model will be referred to as Snowflake (Note: Fact table won’t be normalized).

Steps for ER to Dimensional Models
  • 1Separate each entity into the business process that it represents.
  • 2Create fact tables by selecting numeric and additive non-key facts.  Fact tables may be a detail level or at an aggregated level depending on business needs.
  • 3Create Dimensions by de-normalizing all the remaining tables into flat tables with atomic keys that connect directly to the fact tables.


Data Mart: According to the different business process, there will different Facts like Product, Order, Sales, etc and a data mart is essentially a coordinated set of fact tables, all with similar structures. Reporting tools and Business Intelligence tools normally works on top of data marts/OLAPs.

Why we need a separate Data Warehouse when all data is present in the source system?
  • When we use source system for reporting purpose, then transactions like Insert, delete, update will take more time to complete which may affect thousands of users.
  • 2There might be different type of databases (Oracle, SQL Server, CSV, XML, etc) used in source system and to extract reports from all these sources can be done only after bringing those to a common platform.


Why we need Data Warehouse to be in Dimensional Model? Just another copy of source system in one platform isn’t enough?
Suppose you want to find total no. of orders done through a particular Sales representative,
In ER model, query will be like following:
select SUM(od.QUANTITY) from SALESREP s, CUSTOMER c, ORDERS o, ORDER-DETAIL od, PARTS p
where s.SLSREP_NR=c.SLSREP_NR and c.CUSTOMER_NR=o.CUSTOMER_NR and o.ORDER_NR=od.ORDER_NR and od.PART_NR=p.PART_NR
and p.PART_NR=3321223 and s.SLSREP_NR=8769     

In Dimensional Model, below query will fetch the same output as produced by above query:
select SUM(QUANTITY) from ORDER_FACT
where PART_NR=3321223 and SLSREP_NR=8769

Did you gain performance? If yes, then where?
Yes. You saved 4 table joins in Dimensional model! (Join operation is considered to be one of the hectic processes for Database software).

Isn’t Dimensional model worth or not?
There are lot more concepts in Data warehouse model which convince the necessity of having Dimensional model for data Warehouses. This model was founded by Ralph Kimball.
This paper is just an introduction for beginners and as I said before, there are lot of concepts still to be discussed.

Monday, March 19, 2012

Normalization - In Simple words

In the design of a relational database management system (RDBMS), the process of organizing data to minimize redundancy is called Normalization
The normalization rules are designed to prevent update anomalies and data inconsistencies. But with respect to performance tradeoffs, Normalization tend to penalize retrieval, since data which may have been retrievable from one record in an de-normalized design may have to be retrieved from several records in the normalized form. There is no obligation to fully normalize all records when actual performance requirements are taken into account.

Why do we need to do normalization?
To eliminate redundancy of data i.e. having same information stored at multiple places, which eventually be difficult to maintain and will also increase the size of our database.
With normalization we will have tables with fewer columns which will make data retrieval and insert, update and delete operations more efficient.

How many normal forms are there?
There are seven normal forms.
They are
  • First Normal Form
  • Second Normal Form
  • Third Normal Form
  • Boyce-Codd Normal Form
  • Fourth Normal Form
  • Fifth Normal Form
  • Sixth or Domain-key Normal form

What do we mean when we say a table is not in normalized form?
Let’s take an example to understand this,
Say I want to create a database which stores my friends name and their top three favorite dishes.
This database would be quite a simple so initially. I’ll be having only one table in it say FRIENDS table. Here F_ID is the Primary key.

FRIENDS
F_ID
F_NAME
Favorite_Dish
1
Srihari
Pizza, Burger, Noodles
2
Arvind
Hot Dog, Wedges, Fries

This table is not in normal form why?
Favorite_Dishes column is not atomic or doesn’t have scalar value i.e. it have more than one value.
Let’s modify this table
FRIENDS
F_ID
F_NAME
Favorite_Dish_1
Favorite_Dish_2
Favorite_Dish_3
1
Srihari
Pizza
Burger
Noodles
2
Arvind
Hot Dog
Wedges
Fries

This table is still not in normal form. Why?
We have now changed our table and now each column has only one value!! (So what’s left?)
Because here we are having multiple columns with same kind of value.
I.e. repeating group of data or repeating columns.
So what we need to do to make it normal or at least bring it in First Normal Form?
  1. We’ll first break our single table into two.
  2. Each table should have information about only one entity so it would be nice if we store our friend’s information in one table and his favorite dish’s  information in another
(For simplicity we are working with few columns but in real world scenario there could be column like friend’s phone no, email , address and favorites artists albums, awards received by them, country etc. So in that case having two different tables would make complete sense)

FRIENDS
F_ID
F_NAME
1
Srihari
2
Arvind

FAVORITE_DISHES
F_ID
(FOREIGN KEY ON FRIENDS.F_ID)
FAVORITE_DISH
1
Akon.
1
The Corrs
1
Robbie Williams
2
Enigma
2
Chicane
2
Shania Twain


F_ID is foreign key in FAVORITE_DISHES table which refers to Primary Key F_ID in our FRIENDS Table.
Now we can say that our table is in First Normal Form.
Remember For First Normal Form,
Column values should be atomic, scalar or should be holding single value
No repetition of information or values in multiple columns.

So what does Second Normal Form means?
For second normal form our database should already be in First Normal Form and every non-key column must depend on entire Primary Key.
Here we can say that our Friend database was already in Second Normal Form.
Why?
Because we don’t have Composite Primary Key in our FRIENDS and FAVORITE_DISHES table.
Composite Primary Keys are primary keys made up of more than one column. But there is no such thing in our database.
But still let’s try to understand Second Normal Form with another example
This is our new table

GADGETS
GADGET_NAME
SUPPLIER_NAME
COST
SUPPLIER_ADDRESS
Headphone
Abaci
123$
New York
Mp3 Player
Sagas
250$
California
Headphone
Mayas
100$
London

In above table (GADGET_NAME+SUPPLIER_NAME) together form a Composite Primary Key.
Let’s check for dependency for COST
If I know GADGET_NAME, can I know the COST?
No same gadget is provided by different supplier at different rate.
If I know SUPPLIER_NAME, can I know about the COST?
No because same supplier can provide me with different gadgets.
If I know both GADGET_NAME and SUPPLIER_NAME, can I know COST?
Yes, then we can.
So COST is fully/functionally dependent on our Composite Primary Key: (GADGET_NAME + SUPPLIER_NAME).
Now let’s check with another non-key column SUPPLIER_ADDRESS
If I know GADGET_NAME, will I come to know about the SUPPLIER_ADDRESS?
Obviously, No.
If I know who the SUPPLIER_NAME is, can I have their address?
Yes.
So here SUPPLIER_ADDRESS is not completely dependent on (ie. partial dependent) on Our Composite Primary Key (GADGET_NAME + SUPPLIER_NAME).
This table is surely not in Second Normal Form.
So what do we need to do to bring it to Second Normal Form?
Here again we’ll break the table in two.
SUPPLIER
SUPPLIER_NAME
SUPPLIER_ADDRESS
Abaci
New York
Sagas
California
Mayas
London

GADGETS
GADGET_NAME
SUPPLIER_NAME
(FOREIGN KEY ON SUPPLIER.SUPPLIER_NAME)
COST
Headphone
Abaci
123$
Mp3 Player
Sagas
250$
Headphone
Mayas
100$


Now we know how to normalize till Second Normal Form.
But let’s take a break over here and learn some definitions and terms.
Composite Key: -Composite key is a primary key composed of multiple columns.
Functional Dependency – When value of one column is dependent on another column. ie if value of one column changes, the value of other column changes as well.
e.g. SUPPLIER_ADDRESS is functionally dependent on SUPPLIER_NAME. If supplier’s name is changed in a record we need to change the supplier address as well.
Partial Functional DependencyA non-key column is dependent on some, but not all the columns in a Composite Primary Key.
e.g. SUPPLIER_ADDRESS was partially dependent on our Composite Key Columns (GADGET_NAME + SUPPLIER_NAME)
Transitive Dependency - A type of functional dependency in which the value in a non-key column is determined by the value in another non-key column.
With these definitions in mind let’s move to Third Normal Form.

For a table in Third Normal Form
  • It should already be in Second Normal Form.
  • There should be no Transitive Dependency, i.e. we shouldn’t have any non-key column depending on any other non-key column.
Again we need to make sure that the non-key columns depend upon the Primary Key and not on any other non-key column.
LITERATURE
BOOK_NAME
AUTHOR
(a NON KEY)
NO_OF_SECTIONS
COUNTRY
(NON KEY DEPENDENT ON AUTHOR)
11
Canada
15
USA
11
Canada
Enid Blyton
8
Spain

Enid Blyton

10
Spain

Although the above table looks fine but still there is something in it because of which we will normalize it further.
BOOK_NAME is the Primary Key of the above table.
AUTHOR and NO_OF_SECTIONS are functionally dependent on the BOOK_NAME (Primary Key).
But can we say the same with COUNTRY as well?
In the above table, COUNTRY value is getting repeated because of AUTHOR.
So in our above table, COUNTRY column is depended on AUTHOR column which is a non-key column.
So we will move that information in another table and could save table from redundancy i.e. repeating values of COUNTRY column.

AUTHORS

AUTHOR_ID
AUTHOR
COUNTRY
1
Canada
2
USA
3
Enid Blyton
Spain


LITERATURE
BOOK_NAME
AUTHOR_ID
(FOREIGN KEY TO AUTHORS.AUTHOR_ID)
NO_OF_SECTIONS
1
11
2
15
1
11
3
8
3
10


Now this is in 3rd Normal Form.
Informally, a relational database table is often described as "normalized" if it is in the Third Normal Form. Most 3NF tables are free of insertion, update, and deletion anomalies, i.e. in most cases 3NF tables adhere to BCNF, 4NF, and 5NF (but typically not 6NF).
A standard piece of database design guidance is that the designer should create a fully normalized design. However, in Data Warehouse Design (Dimensional modeling approach) recommend non-normalized designs, i.e. designs that in large part do not adhere to 3NF.