Database Normalization - Normalized Forms , ER diagram and Relational algebra

Relational Algebra

Select Operation
select tuples that satisfy given predicate
the predicate is the where clause
lowest number of tuples selection operation can return is zero and at most n when relation r has n tuples

Project operation
Project operation selects the columns which should be retrieved
that is similiar to select A,B,C from .... where to select only column A B C we need to use project
eliminates duplicates
so a projection on relation r may return at most n tuples, if relation r has n tuples, if all distinct tuples and will return 1 if all same tuples
it restricts the columns unlike selection which restricts rows

Union operation
tuples that appear in either of two relations or both of them
union should be taken on compatible relations
conditions or union opearator to be valid
a) relation must have same number of attributes
b)domain of the attributes must be same

Set difference
tuples in one relation but not in another
same condition like union holds for set difference operation also that is arity and domain condition

Cross product
argument should have distinct names, distinguish same attribute name with table alias
tuple for each pair of tuple in r and s
n1*n2 number of tuples in crossproduct
to get the correct result from cross product we need to identiy tuples in r with tuples in s
that is some condition of the form (r.A = r.B and .. )

Rename operation
allows us to give name to the relation expression
example
a way to find the maximum balance like max function in sql is like
select all those balances which are not greatest that is they are less than some other value
and subtract them from the available balance values
we can use positional parameters $1 $2 to address the attributes instead o giving them names

Additional operations this can be expressed in terms of fundamental operations
Set intersection
can be rewritten with set difference like
r intersection s = r- (r-s)

Natural join operation
allows to combine certain selection and cartesian product in single operation
in terms of fundamental operation
1)first get the same column names R intersection S which gives same attributes that is {A1,A2..}|
2)perform selection with matching these columns like {r.A1 = s.A1 and r.A2 = s.A2..} in predicate
3) perform projection for R union S attributes
does not returns duplicate tuples
natural join is associative
when no attributes in common that is R intersection S = empty then it returns cartesian product
natural join would return only one attribute for common attribute from r and s

Theta join

Division operation
for queries with statement like 'for all'
S attribute set is subset of R every attribute of S is also in R
r / s is relation on the schema R-S( all attributes in R not in S) that is it contains only attributes R-S
conditions
1)for every tupple ts in s there is a tuple tr in r satisfying

find tuples that donot appears in r that is from r x s - r that is all those tuples for which value in r exists for all values in s will be eliminated in this step then we can subtract is from r to get the required tuples
example
r be relation below and R={x,c}
x c
-------
x1 c1
x2 c2
x3 c1
x3 c2

and S be relation below and S= {c}
c
--
c1
c2
we want to find those x for which there all c are mapped like x3 which has both c1 and c2 relation tuple
1)find those tuples which are not mapped
R-S = {x}
1)we find cross product projection(r)_R-S * s_c which will give all possible relations that is for each x there will be mapping for all c's
3) now we subtract r relation from above cross product which will eliminate all the tuples from r which are in s
we will be left with tuples which donot have all c mapped to x

Outer join

Tuple relational calculas

non procedural
{t| p(t)} set of all tuple t such that predicate P is true for t
{t | e r(Q(t))} r(Q (t)) is the relation of which t is tuple r is created from expression on relation
t[A] value of tuple on attribute A
the attribute on which we specify the condition is only contained in the final expression result

example
find loan number of each loan with amt > 1200
predicate t[loan_number] = s[loan_number] and s[loan_number] > 1200
so the result will have only single attribute loan_number

if there are two relations involved in the statement then we use there exist clause for each and connect them
implications if p then q

Translating to Tuple calculas form
1)if asked to find only A attribute then in the complete predicate we will use t on only A
2)if n relations are involved we need n tuple relation like u element of r, x element o s ..
3)if we are using implication then we also need to specify what happen when this is not true like if we are asked to find all the customers who have account at all branches in brooklyn
if they have branch we can specify will predicall for all and implication but we also need to connect this relation with expression that tell include all customer names if no branch at brooklyn
4)for every wheare clause involving two relations we write it like
w is tuple such that on relation r {r[A] = s[B]}
5) If we are asked to find something like
find all projects which supplier S supplies entirely
find all the supplier which do not sell green color parts
first we find the complement that is in first case we find the projects which other suppliers also supply
and then subtract it from the project supplier S supplies
in second case we find suppliers which sells green color parts and subtract them from supplier list

we write query like divison of RA in tuple calculas as
we use 'for all' clause on the relation s whose all attribute values should be mapped and we check for each value in s if there exist a tuple in r mapped to it

SQL

All aggregate function except count(*) ignores null values in input
all aggregate functions returns value of null when applied to empty collection
Nested queries
Set membership

some comparison returns true if value of tuple is greater that atleast one
=some is identical to IN
<>some is not same as NOT IN

Integrity constraint
Referential integrity
Child table or referencing table stores the foreign key that references the parent table or referenced table primary key
no insert can be done in the child table if corresponding foreign key does not exist in parent table
Rules
Restrict disallow the deletion or update of referenced data that is deletion from parent table if not allowed
Set to Null set to null in referencing table associated data on update or delete
Cascade on update in referenced table update the associated data in referencing table and on delete in referenced table delete all associated dependent rows
No action differs from Restrict in sense that checked at end of statement

Normalization

why use normalization
repetition of inormation, data will be repeated for each instance used in the relation,wastage of space
updation would be problem we need to check all rows to update the value, costly updates
inability to represent certain inormation, if certain attributes are not available we cannot insert data other attributes and
while deleting we may delete all the data not leaving any master record trace

Functional dependency

type of constraint on set of relations
a ---> b
set of attributes a, b belong to relation r
holds on schema R, if any legal relation r(R) for all pairs of tuple t1 and t2 in r if t1[a] = t2[a] then t1[b] = t2[b] that is for a value of 'a' there corresponds a unique value of 'b' one to many not allowed

superkey
if K-->R that is for all pair of tuples if t1[K] = t2[K] then t1=t2

Normal forms

First Normal form
there should not be set of values assigned to attribute that is
no field of relation should be like {address1.1,address1.2} that would require to parse the value with extra programming effort

Trivial dependency
dependency that are satisfied by all realtions
A -> A for attributes A
or a ---> b where b is subset of a

Closure of functional dependecy
F+ set of all functional dependecy logically implied by F

Axioms
1) Reflexivity
trivial dependecy A--->B if B is subset of A
2)Augmentation rule
If A--->B and C is set of attributes then AC-->BC
or
If If A--->B and C is set of attributes then AC-->B
or
If A--->B and C->D then AC-->BD
3)transitivity
If A--->B and B--->C then A ->C

Rules
Union rule:
If A->B and A->C then A->BC
2)Decomposition rule reverse of union
If A->BC holds then A->B and A-> must also hold
3)pseudo transitive
If A->B and BC->D then AC->D

Closure of attribute set

Finding the set of attributes determined by A

algorithm
set result =A
while result changes do
for each functional dependency B->C
if B is subset of result include C in the result //

Problem
A, E ---> D -- 1
A --> C -- 2
E --> A -- 3
C ---> E -- 4

find (BA)+

Solution

result =BA
1)find lhs attribute set, superset of BA
2 so we include C, result= BAC
2) repeat A with new result
i)result=BACE   from 4
ii)result = BACED from 1

Problems Find all CK for
A, D ---> C -- 1
D ---> E      -- 2
E, B ---> A -- 3
A ---> B      -- 4
Solution:
1) find the attribute set that determines all other
A,D -> CEB
2)can we remove any of the A or D from the above dependency

since D doesnot appear on right side of any functional dependecy that cannot be determined by any other attribute so D cannot be removed
A is determined by EB and B is determined by A so we cannot remove A also
3)checking for other CK
find which other attributes derive the first CK i.e AD
applying psuedotransitivity on 2 and 3 we get DB->A

A ---> C, D -- 1
C, D ---> E -- 2
E, A ---> B -- 3

Solution

1) since A doesnot appear on right of any FD so A should be in CK
2) A can derive any other attribute set if we check
A ->E
A-> EA
A->B
so A is the CK

A, E ---> D -- 1
A --> C -- 2
E --> A -- 3
C ---> E -- 4

Solution

BA ->ABCED
AB is the CK
we see A is determined by E so BE is also CK
and E can be determined from C so EC is also CK

d)
B --> A,E   -- 1
E, B --> C   -- 2
A, C ---> D -- 3
A ---> B      -- 4

Solution

B->ABECD
so is A

Cannonical Cover
extraneous attribute : if attribute can be removed from functional dependency without changing the closure
1) attribute a of A set is extraneous if for FD A -->B
F logically implies F-{A->B} U (A-a) -->B

example
F: AB->C and A->C
Lets see if A is extraneous or not in AB->C
F' = B->C U A->C
does F logically implies above F'
no F does not implies B->C

Check if B is extraneous in AB->C
A->C U A->C =A->C
F has the same

try reverse does F-{A->B} U (A-a) -->B logically implies F for
AB->C and A->E
let B be extraneous in AB->C then
F'=A->C , A->E
does F' logically implies F
well it does and will always imply F irrespective of FD's

Method
A' = A-a
check if A' -> B can be inferred from F that is compute closure A'+ under F

2) attribute b of B set is extraneous in FD A-->B
if F-{A->B} U (A) -->(B-b) logically implies F
example
F: AB->CD and A->C
let C be extraneous in AB->CD then
F'= AB->D , A->C
does F' implies F, yes
reveres would be always true in this case also that F would always imply F'
Method
check if A ->b is implied by F' that is computer A+ on F' and check if right hand side has b

Cannonical cover
set of depndencies Fc such that F logically implies Fc
1)no functional dependency in Fc contains an extraneous attribute
2)each lef t side of functional dependency in Fc is unique

cannonical cover may not be unique

Decomposition

A set of relation schemas { R1, R2,…, Rn } is a
decomposition of R if
R = R1 U R2 U …..U Rn
each Ri is a subset of R ( for i = 1,2…,n)

Given instances of the decomposed relations,
we may not be able to reconstruct the
corresponding instance of the original relation
– information loss
Lossy decomposition or lossy join bad design gets more tuples but less information
After Natural Join, we get extra tuples. Thus, there is loss of information.
A decomposition {R1, R2,…, Rn} of a relation R is called a lossless decomposition for R if the natural join of
R1, R2,…, Rn produces exactly the relation R.

     R(A, B, C)
          |
     Decompose
R1(A, B) R2(A, C)
   |
    Recover
   R’(A, B, C)

Thus,
R’ = R

R : relation
F : set of functional dependencies on R
X,Y : decomposition of R

Decomposition is lossles if :
if X ∩ Y forms a superkey of either X or Y, the decomposition of R is a lossless decomposition

Superkey
no distinct tuple in the realtion have same values for the attributes in this set

Normal forms
1NF
all attributes directly or indirect depend on candidate key

2NF
1)should be in 1NF
2)Every non key attribute is fully dependent on candidate key

when not in 2NF its vulnerable to update anomaly

example AB is key for R(ABCD)
if we have values like
A B    C     D
A1 B1 C1   D1
A2 B1 C1 D2
A3 B1 C1   D3

here B ->C that is C is dependent only on B which is partial key
now if we have update C for B1 we need to make sure we update all B1 records

So we divide the table in to two BC and ABD for table to be in 2NF
3)case when mutual dependence between non key attribute
2NF eliminates certain update anomaly but not all
the case when non key attribute is dependent on another non key attribute is not handled here
example
R(supplier_no,status,city)
and functional dependency are
--------------------------
supplier_no --> status
supplier_no -->city
city ---> status
---------------------------
This relation is in 2NF but not in 3NF
Reason here both non key attributes status and city are dependent on whole of candidate key but there is dependency city -->status which can lead to update anomaly

4) Anomalies in 2NF
Insertion anomaly : we cannot maintain status of city in above case until we have supplier in that city
Deletion Anomaly:same way if we delete all supplier for a city we loose status information
Updation Anomaly:status occurs many times like if we update status for a particular city we have to update all rows with that city otherwise inconsistency

5)can exist only if table has composite key, always in 2NF if no composite key

6) Case when not proper subset of candidate key and not super key either
Consider the Relation R(pubId, title,pagecount,price)
if FD's are
PubId, title --> pagecount,price
PubId, pagecount --> price
This is still in 2NF because price depends not on the proper subset of candidate key{pubId,title} but on part of candidate key with some other non key attribute
so its not proper subet of key
and not superkey either
This is not in 3NF because
1)non key attribute price should be dependent on superkey or we can say
2) pubId,title -> pagecount,pubId -->price, which is transitive dependency that is non prime attribute price here is transitively dependent on candidate key {pubId,title}

reference

7)Check for all the candidate key for partial dependency

3NF
1)If relation has only one candidate key then no non key column should determine another non key column
2) 2NF say non key attributes depend on whole key, 3NF restraints it further and says non key attribute depends on nothing but the whole key
that is in 3NF cases like X->A,B where X is composite candidate key and A and B are non prime attribute then A->B is not allowed though B depends on whole key X but it also depends on A
2)For each functional dependency X-->A either of the condition should hold true
i)X->A non trivial dependency that is A is subset of X or
ii)X is superkey or
iii)A-X that is nontrivial attributes on right side is prime attribute that is part of some candidate key
3)Anomalies in 3NF
can suffer from all three in case when its not in BCNF
4)when all attributes are prime then its in 3NF

BCNF
1)more restrict than 3NF
2)for every functional dependency X->A either
i)X->A is trivial dependency or
ii)X is superkey
3)3NF with no overlapping candidate key is guaranteed to be in BCNF
4)Dcomposition in BCNF may not preserve certain dependency when attribute functionally dependent moves in to another table

Decomposition of BCNF

for A -> B where A is not key we divide the relation R in to AB and {R-B}
Any relation in BCNF is in 3NF

Problem

1)R(A, B, C)
FD's
--------------
AB -> C,
C -> B
-------------

check for BCNF
i)find closure of the left side attributes
for C->B
closure of C is B that is not = ABC so not key hence not in BCNF

Check for 3NF
i)find the minimal keys
AB and AC
since all right side attributes are prime its in 3NF or
ii)check 3NF conidtion for each dependecy
AB - >C AB is key so satisfies 3NF
C->B C is not key but B is part of key so it also satisfy 3NF

Problem R(A,B,C,D,E) and FD’s
A->B,
B->AE,
AC->D
why not in BCNF? Decompose in BCNF

solution

i) A->B check closure of A which is ABE so not in 3NF
ii) B->AE closure of B is BAE this also violates
iii) AC-> D closure of AC is ACDBE which satisfies
Now we decompose relation such that FD i and ii are satisfied
first rewrite FD like we check this on new FD whenever some transitive dependecy comes from present FD
A-> B , B->A , B->E,A->E, AC->D
for A->B decompose it in R1(AB) and R2(ACDE) and check if FDs hold
on R2 A ->E violates as A is not superkey so we decompose R2 as R3(AE) R4(ACD)
now every FD holds

Problem R(A,B,C,D,E) with functional dependencies A→ E, BC→ A, DE→ B.

Solution

find keys
ACE ACD ACB
since all right side are prime key attributes or part of superkey so it satisfies the 3NF
but left side are not the key so not in BCNF
BCNF decomposition
R1(AE) and R2(ABCD) holds A->E violates BC - A
R1(AE) R2(BCA) R3(CBD) only dependency in R3 is trivial BCD -> BCD
lost the dependency DE->B

Problem R1(A,B), R2(C,D,E,F) A→ B, C → D, D → EF

Solution

key for R1 is A and key for R2 is C
it holds 2NF condition because A-> B, B depends on complete key, for C->D, D depends on complete key and D->EF, EF doesnot depends on partial key either

Entity Relationship
coneptual representation : ER
logical: relational model with attributes datatype lenght constraints
physical: with schema def views triggers

Realtionship set
set of same type of relations
each entity plays a role in the relation
same entity set can participate in relation set more than one time with different roles
descriptive attributes on relation set
relationship instance in the relation set must be uniquely identifiable from its participating entity without descriptive attributes:
Mapping cardinality: number of entities to which another entity can be associated using relationship set
Participation :if every entity in E participates in atleast one relationship in R total participation

Key for relationship set
Ei entities participating in the realtionship
primarykey(Ei) and {a1,a2,..an} attributes associated with realtionship set

describes individual relationhip and union of keys forms the superkey

set of attributes primarykey(E1) U primarykey(E2) U primarykey(E3) ..U {a1,a2..an}

when realtionhship set is many to one
the primary key of the relationship set is identified from the key of the entity on the many side

Translating from ER model o relational model

Many to many relationship
employee[ssn]----------works_in(since)-----deptartment[did]
many employess can work in a department and
and a employee can work in many departments
in translating relation work_in it need to include primary key of both participating entities(which becomes it's PK) and its own attributes

One to Many relation
employee[ssn] <-----------manages(since) ----------department[did]
each employee can manage number of departments but (constraint)
each department is managed by at most one employee

since each deparment has a unique manager we can combine the department entity and manages relation without introducing any redundancy

manages seperate would have been like

ssn ,dId ,since , primary key (did), foreign key(ssn) reference Employees, foreign key (did) REFERENCES Departments)

with combined dept_manages for total participation from "many side"
attributes of department + attribute of manages above
ssn ,did ,since , primary key (did), foreign key(ssn) reference Employees

Total Participation
If every department must have a manager that is must appear in manages relation with non null ssn then
we represent relation manages and department in single table only but on deletion of employee or ssn we don't delete the dept_manages tuple
and ssn constrain of not null

Weak entity set
A weak entity can be identified uniquely only by considering the primary key of another (owner) entity.
– Owner entity set and weak entity set must participate in a one-to-many relationship set (1 owner, many weak entities).
– Weak entity set must have total participation in this identifying relationship set.

Weak entity set and identifying relationship set are translated into a single table.

employee ([ssn],name)-----------policy(cost) <-----------------------dependents ([pname],age)

Dep_policy

pname,age,cost,ssn
with primary key [ssn ,pname] and foreign key [ssn] referencing employee and on delete cascade

when owner entity is deleted all the owned weak entities must also be deleted

References
www.cs.sjsu.edu/faculty/lee/.../26Presentation_Jung_T_Chang.ppt
pages.cs.wisc.edu/~dbbook/openAccess/firstEdition/.../mod5l1-2.pdf
https://agora.cs.illinois.edu/display/cs411sp10/Assignments
www.cs.arizona.edu/classes/cs460/fall09/hmwk2.pdf

1 comment:

Cason LarussoAugust 16, 2011 at 10:52 AM
Ah, database normalization. There are quite a lot of small tweaks in normalizing databases, but this is still one of the trickiest aspects of database theory. Failing to normalize a table can result in errors and redundancy, so normalization should be mastered.

Computer Science Cafe

Pages

Monday, December 27, 2010

Database Normalization - Normalized Forms , ER diagram and Relational algebra

1 comment: