Monday, March 19, 2012

Normalization - In Simple words

In the design of a relational database management system (RDBMS), the process of organizing data to minimize redundancy is called Normalization
The normalization rules are designed to prevent update anomalies and data inconsistencies. But with respect to performance tradeoffs, Normalization tend to penalize retrieval, since data which may have been retrievable from one record in an de-normalized design may have to be retrieved from several records in the normalized form. There is no obligation to fully normalize all records when actual performance requirements are taken into account.

Why do we need to do normalization?
To eliminate redundancy of data i.e. having same information stored at multiple places, which eventually be difficult to maintain and will also increase the size of our database.
With normalization we will have tables with fewer columns which will make data retrieval and insert, update and delete operations more efficient.

How many normal forms are there?
There are seven normal forms.
They are
  • First Normal Form
  • Second Normal Form
  • Third Normal Form
  • Boyce-Codd Normal Form
  • Fourth Normal Form
  • Fifth Normal Form
  • Sixth or Domain-key Normal form

What do we mean when we say a table is not in normalized form?
Let’s take an example to understand this,
Say I want to create a database which stores my friends name and their top three favorite dishes.
This database would be quite a simple so initially. I’ll be having only one table in it say FRIENDS table. Here F_ID is the Primary key.

FRIENDS
F_ID
F_NAME
Favorite_Dish
1
Srihari
Pizza, Burger, Noodles
2
Arvind
Hot Dog, Wedges, Fries

This table is not in normal form why?
Favorite_Dishes column is not atomic or doesn’t have scalar value i.e. it have more than one value.
Let’s modify this table
FRIENDS
F_ID
F_NAME
Favorite_Dish_1
Favorite_Dish_2
Favorite_Dish_3
1
Srihari
Pizza
Burger
Noodles
2
Arvind
Hot Dog
Wedges
Fries

This table is still not in normal form. Why?
We have now changed our table and now each column has only one value!! (So what’s left?)
Because here we are having multiple columns with same kind of value.
I.e. repeating group of data or repeating columns.
So what we need to do to make it normal or at least bring it in First Normal Form?
  1. We’ll first break our single table into two.
  2. Each table should have information about only one entity so it would be nice if we store our friend’s information in one table and his favorite dish’s  information in another
(For simplicity we are working with few columns but in real world scenario there could be column like friend’s phone no, email , address and favorites artists albums, awards received by them, country etc. So in that case having two different tables would make complete sense)

FRIENDS
F_ID
F_NAME
1
Srihari
2
Arvind

FAVORITE_DISHES
F_ID
(FOREIGN KEY ON FRIENDS.F_ID)
FAVORITE_DISH
1
Akon.
1
The Corrs
1
Robbie Williams
2
Enigma
2
Chicane
2
Shania Twain


F_ID is foreign key in FAVORITE_DISHES table which refers to Primary Key F_ID in our FRIENDS Table.
Now we can say that our table is in First Normal Form.
Remember For First Normal Form,
Column values should be atomic, scalar or should be holding single value
No repetition of information or values in multiple columns.

So what does Second Normal Form means?
For second normal form our database should already be in First Normal Form and every non-key column must depend on entire Primary Key.
Here we can say that our Friend database was already in Second Normal Form.
Why?
Because we don’t have Composite Primary Key in our FRIENDS and FAVORITE_DISHES table.
Composite Primary Keys are primary keys made up of more than one column. But there is no such thing in our database.
But still let’s try to understand Second Normal Form with another example
This is our new table

GADGETS
GADGET_NAME
SUPPLIER_NAME
COST
SUPPLIER_ADDRESS
Headphone
Abaci
123$
New York
Mp3 Player
Sagas
250$
California
Headphone
Mayas
100$
London

In above table (GADGET_NAME+SUPPLIER_NAME) together form a Composite Primary Key.
Let’s check for dependency for COST
If I know GADGET_NAME, can I know the COST?
No same gadget is provided by different supplier at different rate.
If I know SUPPLIER_NAME, can I know about the COST?
No because same supplier can provide me with different gadgets.
If I know both GADGET_NAME and SUPPLIER_NAME, can I know COST?
Yes, then we can.
So COST is fully/functionally dependent on our Composite Primary Key: (GADGET_NAME + SUPPLIER_NAME).
Now let’s check with another non-key column SUPPLIER_ADDRESS
If I know GADGET_NAME, will I come to know about the SUPPLIER_ADDRESS?
Obviously, No.
If I know who the SUPPLIER_NAME is, can I have their address?
Yes.
So here SUPPLIER_ADDRESS is not completely dependent on (ie. partial dependent) on Our Composite Primary Key (GADGET_NAME + SUPPLIER_NAME).
This table is surely not in Second Normal Form.
So what do we need to do to bring it to Second Normal Form?
Here again we’ll break the table in two.
SUPPLIER
SUPPLIER_NAME
SUPPLIER_ADDRESS
Abaci
New York
Sagas
California
Mayas
London

GADGETS
GADGET_NAME
SUPPLIER_NAME
(FOREIGN KEY ON SUPPLIER.SUPPLIER_NAME)
COST
Headphone
Abaci
123$
Mp3 Player
Sagas
250$
Headphone
Mayas
100$


Now we know how to normalize till Second Normal Form.
But let’s take a break over here and learn some definitions and terms.
Composite Key: -Composite key is a primary key composed of multiple columns.
Functional Dependency – When value of one column is dependent on another column. ie if value of one column changes, the value of other column changes as well.
e.g. SUPPLIER_ADDRESS is functionally dependent on SUPPLIER_NAME. If supplier’s name is changed in a record we need to change the supplier address as well.
Partial Functional DependencyA non-key column is dependent on some, but not all the columns in a Composite Primary Key.
e.g. SUPPLIER_ADDRESS was partially dependent on our Composite Key Columns (GADGET_NAME + SUPPLIER_NAME)
Transitive Dependency - A type of functional dependency in which the value in a non-key column is determined by the value in another non-key column.
With these definitions in mind let’s move to Third Normal Form.

For a table in Third Normal Form
  • It should already be in Second Normal Form.
  • There should be no Transitive Dependency, i.e. we shouldn’t have any non-key column depending on any other non-key column.
Again we need to make sure that the non-key columns depend upon the Primary Key and not on any other non-key column.
LITERATURE
BOOK_NAME
AUTHOR
(a NON KEY)
NO_OF_SECTIONS
COUNTRY
(NON KEY DEPENDENT ON AUTHOR)
11
Canada
15
USA
11
Canada
Enid Blyton
8
Spain

Enid Blyton

10
Spain

Although the above table looks fine but still there is something in it because of which we will normalize it further.
BOOK_NAME is the Primary Key of the above table.
AUTHOR and NO_OF_SECTIONS are functionally dependent on the BOOK_NAME (Primary Key).
But can we say the same with COUNTRY as well?
In the above table, COUNTRY value is getting repeated because of AUTHOR.
So in our above table, COUNTRY column is depended on AUTHOR column which is a non-key column.
So we will move that information in another table and could save table from redundancy i.e. repeating values of COUNTRY column.

AUTHORS

AUTHOR_ID
AUTHOR
COUNTRY
1
Canada
2
USA
3
Enid Blyton
Spain


LITERATURE
BOOK_NAME
AUTHOR_ID
(FOREIGN KEY TO AUTHORS.AUTHOR_ID)
NO_OF_SECTIONS
1
11
2
15
1
11
3
8
3
10


Now this is in 3rd Normal Form.
Informally, a relational database table is often described as "normalized" if it is in the Third Normal Form. Most 3NF tables are free of insertion, update, and deletion anomalies, i.e. in most cases 3NF tables adhere to BCNF, 4NF, and 5NF (but typically not 6NF).
A standard piece of database design guidance is that the designer should create a fully normalized design. However, in Data Warehouse Design (Dimensional modeling approach) recommend non-normalized designs, i.e. designs that in large part do not adhere to 3NF.

1 comment:

  1. Great article, Please can you explain 4NF and 5NF with example.

    ReplyDelete