These are personal notes. You are welcome to read them.
Top-level home page | ||
Management hints
|
HTML, javascript
|
Other
|
|
||
More pages here ("technical" pages) | Ham radio pages | Contents of current page |
The Concepts
The Process
Some Products
Miscelaneous
|
Reference table, mandatory values in foreign key. The primary key is in the reference table, the foreign key points to the primary key. The reference table is the parent table; the child table holds the foreign key to the parent table. See below for more on the one-to-many relationships. Notations:
Note that if reference table contains a simple code and a meaning, then it could be removed and the foreign key could be assigned to a domain. |
Barker notation: IDEF1X: |
This models a super-type/sub-type.
Note: a single sub-type is possible. A subtype cluster is complete when all possible subtype entities have been described. The subtype cluster is incomplete when not all entities have been discovered in the modeling process. A subtype relationship is exclusive or non-overlapping if the super-type can be related to only one sub-type. It is inclusive or overlapping if a super-type can be related to more than one sub-type. Related notions are generalization, inheritance and specialization. |
The "short-cut" relationship is redundant because of transitivity: Remove the short-cut. | |
Recursive relationships can be also modelled with two tables. In one
of the tables, the primary key appears twice. |
|
In a hierarchy, each instance can have zero to many subordinates, but a maximum of one superior. Note that the top-most instance can also be modelled as self-referencing. Because hierarchies are difficult to query, consider replacing with separate entities if the hierarchy is only a few levels deep. |
|
In a network, each instance has zero to many superiors and zero to many subordinates. This is difficult to program and efforts should be made to continue talking to the users to discover another way to model the entities. |
|
In a chain, each instance is linked to at most one other instance. Model with a zero to many (zero to two) relationship. Another option is to make another entity that lists the instances in order. | |
Or constraint: one table on the right OR the other is joined to the table on the left. | |
An arc or exclusive or
constraint means that only one or the other of the tables
on the right is joined to the table on the left. Whenever possible, replace arcs with sub-types. Arc in the primary key: trick is to put a surrogate key and have the arc on non-key fields. |
|
In data modeling, aggregation and composition are modeled the same way. The notion is not as important as in UML. | |
Transferability: related to change over time. Suppose a relationship between a person and a radio license (see [2] P. 98 in bibliography). The relationship between a person and the amateur radio license is non-transferable. The relationship between a person and a commercial radio station license is transferable. For transferable relationships, a history should be kept. Note that a transferable one-to-one relationship is really a one-to-many when the time dimension is considered. And a transferable one-to-many relationship can likewise be seen as a many-to-many with changes over time. A non-transferable one-to-one relationship will probably refer to the same real-world concept. A structured key can be used to enforce nontransferability because nontransferability is equivalent to weakness (see below). | |
A weak entity is an entity that relies on another for its identification. In particular, the primary key will be composite and contain the foreign key to another entity. For example, the "Invoice Line" entity is a weak entity if the foreign key to the "Invoice" entity is included in the primary key. Note that nontransferability is equivalent to weakness (see below). See also ER Diagrams. Note that weak entities have rounded corners in IDEF1X notation. | |
An optional relationship means a partial participation in the relationship. A mandatory relationship means a total participation in the relationship. |
Some more rules:
Some aspects of the IDEF1X (Integration DEFinition for Information Modeling) notation are described below in the section on Erwin. Basically, the crow's feet are replaced with a black dot, because it is seen as the child end of the relationship. In one-to-many relationships, a diamond indicates that nulls are allowed on the "one" or parent side of the relationship.
(not to be confused with Idéfix:)
Hint for determining relationship cardinality: count lines on the other side. The difficulty is with the optionality of the Barker notation. The maxima are "on the other side" and the optionality is on the "same side".
Two types of relationship rules: structural or cadinality rules and referential integrity rules. The structural or cardinality rules define the existence (0 or 1 on the diagram). The referential integrity rules define 1 or many on the diagram.
In one-to-one relationships, two tables are joined. Generally, they share the same primary key. This leads to an issue in the primary key generation: care should be taken to generate the primary key for one table and copy the value to the other. In addition, a foreign key relationship may have to be built in to force integrity.
Very often, two tables with a one-to-one relationship can be merged into one table. Exceptions are situations with a super type and sub-types. Another exception is a transferable relationship which, after analysis, turns out to be one-to-many relationship over time (at a given moment, the relationship is one-to-one, but changes over time).
Example | Notation | Barker Notation | Comments |
---|---|---|---|
Loose relationship between the two tables (is it of any use?). | |||
This needs a constraint to enforce the mandatory part of the relationship.
The instead of one-to-one optional relationship, use one table with nullable columns. Exception: super-type/sub-type. |
|||
This needs a constraint to enforce the mandatory relationship on both
sides. A one-to-one mandatory relationship is very likely replaced by one table. |
Example | Notation | Barker Notation | Comments |
---|---|---|---|
This is the most common situation, in particular when one table
references another table as a reference table. Another use is when a fact
table (on the right) references a dimension table (on the left) in data
warehouses. Note that here not all the values in the reference/dimension
table are used. They are available for future use. Another notation uses an arrow: |
|||
When the relationship is optional on the side of the non-reference table / fact table, the foreign key can be null. This is not optimal and it is best to resolve non-mandatory foreign key relationships (see below). Resolving makes the diagram look like the previous row (1-to-1 with 0-to-many). | |||
In this case, the relationship is mandatory on the side of the reference/dimension table (compare with the first case above) and all rows in the reference/dimension table are linked to a row in the other table. This needs an extra constraint to be built in as the foreign key mechanism does not enforce this automatically. This also implies the use of transactions so as to insert new values at the same time on both sides. | |||
This case combines both the difficulty of enforcing the mandatory relationship on the side of the reference/dimension table (see previous row) and the need to resolve the non-mandatory foreign key relationship (see below). | |||
The reference table is missing a value. When referential integrity is enforced,
the database management system raises an error. To solve, add the missing value to the reference table and enable the foreign key constraint. |
An optional relationship from the perspective of the child (with the "0" showing on the parent side!) means that the child (fact table or non-reference table) is not existence dependent on the parent (or reference table). This is shown in the second and fourth cases above. A mandatory relationship from the child's perspective (no "0" showing on the parent side) means that the child is existence-dependent on the parent. This is shown in the first and third cases above. When the relationship is mandatory, two situations arise: in addition to being existence-dependent, the child may be or may not be identification-dependent on the parent. This is linked to the notion of an identifying relationship, which does not show in the IE and Barker notations. Identifying relationships show in the IDEF1X notation as full lines and non-identifying relationships show as dashed lines. The diagrams in this section do not show this, so refer to the ERwin section below.
It is best to resolve non-mandatory foreign key relationships by adding a default value in the reference/dimension table or by creating a sub-type:
Add a default value such as "unknown" or "non-existant" in the reference/fact table and make the null foreign key point to the row with the default value. |
|
Or make a sub-type in which all fields are mandatory so as to make the relationship mandatory. |
Example | Notation | Barker Notation | Comments |
---|---|---|---|
Resolve many-to-many relationships by inserting an associative entity to describe the relationship. Make the foreign keys into a primary key so as to force unicity of the key combinations. Needless to say, the foreign keys in the middle should not be null. |
Note that the foreign key mechanism does not enforce mandatory relationships.
A theta-join is a join using a comparison between fields of two tables. Theta-joins include all comparison operators such as > or <. An equi-join refers to the most common type of theta-join which uses only equality in the condition.
Relation | Table | File |
Tuple | Row | Record |
Attribute | Column name | Field |
Degree or arity | Number of columns | Number of fields |
Cardinality | Number of rows | Number of records |
R = {a1, a2, ..., an} is set of attribute
names for relation schema r(a1, a2, ... , an)
Schema or intension corresponds to instance or extension, which is a set of
n-tuples without duplicates.
Note: tuple rhymes with couple.
The smallest unit of data is a scalar, the values of which are part of a domain. Determining the domain of a an attribute (field) is an important step in the analysis. The data type (character or numeric) is often not enough to define the domain. One way of constraining values to a domain is to define a reference table. Is NULL or blank part of the domain? What are the maximum and the minimum values? For dates, are times included? For numbers, are they integers or decimals? Values are constrained by the DBMS by the data type definition, by the use of the foreign key mechanism, by constraints on NON NULL values and other types of constraints depending on the DBMS. These rules are integrity rules.
A relation on a collection of domains is composed of (see C.J.Date):
Superkey | Set of attributes with distinct values; can identify a tuple. |
Candidate key | Minimal superkey. A candidate key has uniqueness property and is irreducible (no subset has uniqueness property). |
Primary key | No null values |
Foreign key | References primary key of another table (same domain of values) |
Note that, because by definition relations cannot contain duplicate tuples, therefore there always exists a candidate key. The primary key is chosen amongst the candidate keys. (Generally, there is not much of a choice).
Foreign keys:
update option: either cascade (an update on the primary key cascades to matching
foreign keys) or restrict (an update is not possible if matching foreign keys
exist)
delete option: either cascade (delete rows with matching foreign keys) or restrict
(cannot delete if foreign keys exist).
Some DBMS' allow foreign keys to have nulls, others do not.
Total participation: each entity participates Partial participation: some values are null |
Strong Entity | ||
Weak Entity | ||
Relationship | ||
1 to 1 relationship | Same as for m to n relationship |
if e2 has total participation in e1 or allow null values for e1Key and rAttr |
1 to n relationship | ||
m to n relationship | Same as for 1 to 1 relationship |
Table | Alternative |
---|---|
Row | Record |
Column | Field / zone |
Table | File |
Database | Library |
Some basic issues before the normalization process:
Functional dependancies are the starting point for denormalization.
A → B if for each item in A, there exists one and
only one possible value in B.
A is generally a key of ID of some sort.
"→" reads "determines" or "is
a determinant of".
Armstrong's axioms:
X and Y are sets of attributes (or columns)
Closure F+ of F is a set of all functiona dependancies implied by F. This means that a key functionaly determines all attributes. And all attributes of the key are needed to fully determine all attributes in the relation.
Domain D = {values} (a domain D is a set of values)
Relation is included in D1 X D2 X D3 X ...
X Dn (a relation consists of tuples and each domain corresponds to
a column).
(Student, course1, course2, course3, course4) |
(Student, course1, course_type1) |
An entity is in 1st normal form if there are no repeating groups
of attributes.
Note that a single attribute containing a list also violates the 1st
normal form.
(Student, course, room, course_result) room is dependant on only part of the key, i.e. on the course |
(course, room) |
An entity is in 2nd normal form if it is in 1NF and if all the non-key attributes are fully dependant on the primary key, meaning that the non-key attributes are dependant on all of the primary key and on none of the subsets of the primary key. A partial dependancy is when a subset of the primary key determines one of the non-key attributes.
(Course, room, building) Building is dependant on a non-key field, i.e. on the room |
(course, room) ) |
An entity is in 3rd normal form if it is in 2NF and if every determinant of a nonkey attribute is a candidate key. Or, no non-key attribute determines another non-key attribute.
One special case of violations of the 3rd normal form are derived attributes that can be calculated from another attribute.
A trick to remember is that every attribute is dependant "on key, the whole key, and nothing but the key".
A and X are sets of attributes, and A is not in X
If X → A then X must be a candidate key.
That is all columns that determine other columns must be candidate keys.
In other words, every determinant of key items (BCNF) and of nonkey items (see
3NF) must be a candidate key.
Situations where a table is in 3NF but not in BCNF occur when there are overlapping candidate keys.
Note that BCNF may not preserve dependancies required by the business rules (that is each dependance can be reconstructed). The tradeoff is that normalization prevents redundant data. Program logic can be used to enforce additional business rules.
a | b | c |
a | b' | c |
a | b | c' |
A sort of cartesian product (but not complete) within a key-only table. Make
two relations. But watch because this may not preserve dependancies.
See 5NF because it includes 4NF and that it is easier to understand.
4NF and 5NF issues occur in tables whose columns are all part of the primary key. This means that there must be at least three columns in the key and no nonkey items. If there are nonkey elements in the table, then the table is in 5th normal form because the related issues do not apply.
According to Chris Date (and Simsion [2]), examples with nonkey items do not exist in practice.
As suggested by [2], handle 4NF with 5NF together because 5NF includes 4NF and it is easier to explain. It is explained by saying that "no further table splits are possible with different primary keys in the resulting tables". Basically, the normalization process involves various steps of identifying the need to split tables and assign new primary keys to the resulting tables. A correct normalization results in tables that can be re-joined so that the original tables are re-constituted. 5NF says that this process should continue until one of the two is true:
So, when in doubt, split the tables, with different primary keys in each, and try to re-constitute the original table.
Note that if the tables are split and each has the same primary key, this could still be useful if the underlying entities are in fact different but just identified (temporarily) with the same key. Simsion [2] gives the example of a bin where one table describes the physical dimensions of the bin and the other describes the contents.
All constraints are a consequence of domains or keys. This means that constaints should be enforced by keys (see normal forms above) or by limiting the allowable values (domain). This normal form is not used often. It is noted here just for information.
High level data model that captures the categories of data and their business rules. Stick to data that is to be tracked by the IT system.
The starting point is the Business Process Model. This describes what and how things happen in the business. It does not describe the data. This is because the users generally see things from the point of view of the process and not of the data.
The Conceptual Model refers to a high-level model showing entities and relationships but few or no attributes. It is useful in scope definition. See Data Modeling (conceptual model) for more details.
The Business Model describes the data elements from the point of view of the business users. Some argue that there is only one business model. Linked to this type of model are the external models, of which several can exist. The conceptual model and the business model may be the same.
The Logical Data Model fully implements all requirements. It is in third normal form. See Data Modeling (logical) for more details.
The DBA implements the Physical
Data Model. It is optimized for performance and therefore takes into
account denormalization for performance purposes, merging of tables, implementation
of sub-types/super-types, partitioning, ...
See Data Modeling (physical) for more details.
The whole diagram is an Entity Relationship Diagram (ERD).
Keep a central place for general business rules.
Also keep re-occuring definitions such as "Name = first name + last name"
This diagram shows three entities. A "has" 0 to many Bs and B is "belongs" to one and only one A. B "belongs" to 0 or 1 Cs and C "has" 1 to many Bs. |
See basics of data modeling in Data Modeling
See http://www.agiledata.org/essays/dataModeling101.html
In one presentation (sorry, I can't remember who), suggested three contradictory requirements: design elegance, processing speeed, and information requirements. I understood this to be similar to the "cheap, good, fast - pick one" dilema.
The conceptual model is based on the analysis of the business processes.
The conceptual model is a high-level overview of the data entities. It supports discussions during business and systems interviews. Accompanied by a glossary, it provides the business-related descriptions of the entities, the relationships between these entities, the main attributes, examples of values, and owners. Interviews with business users and eventual existing data models provide the information for building the conceptual model.
Verification of the conceptual model: the stakeholders should verify that:
To verify the model, [2] suggests using the assertions approach, described below.
Though the primary keys are formally defined in the logical data model, a clear idea of what constitutes a single row is necessary at this level. Define what makes an entity's occurrence unique.
Determine the scope[4]. This results in a statement such as "a data model for the accounting department's application in six months" or "a model of current business processes in the bank."
Taken from [4].
See also data modeling steps.
At an attribute level, the conceptual model defines the attributes in business terms and defines any related business rules. On the other hand, compared to the conceptual mode, the logical model defines the type of data, the constraints (mandatory, allowed values, ranges), eventual formatting, and the composition of the primary key. The logical model includes:
The main types of attributes are:
See notes on profiling in Data Quality, element analysis
See also data modeling steps.
The physical model is derived from the logical model in a fairly automatic way. The main decisions concern performance and implementation of the structures that have been designed. In particular, decisions have to be taken about indexes, storage (table space usage, free space, ...), memory, locking strategies, creation of views, ...
Some decisions also impact the data model:
Verify that the server has enough cpu, memory, disk space. Do the same for the SAN.
disk storage capacity
Disk throughput capacity
Databases and locations (in the case of remote locations):
Is some data used only in the remote connections.
CPU performance
Measure memory with
Network
Regulatory issues
Data backup and recovery
Data archiving
Server Consolidation
Data Distribution
Main issue is connectivity
Growth
This section describes the steps for going from an enterprise data model, which is generally fully denormalized, to a data model appropriate for a data warehouse (see Silverston [5]).
More notes on designing a data warehouse are in Project Management Notes.
Starting with the normalized enterprise data model, complete with conceptual and logical models, we take the following steps. This constitutes the physical model of the data warehouse.
Note that, according to Silverston [5], we do not need to model the processes in the context of a data warehouse model. The enterprise data model gives the structure. Kimball insists on knowing how the business works. I do not think that means building the Business Process Model.
Simsion [2] proposes an approach using assertions for verifying the data model with the users. Ask the users to validate the follwing assertions:
Once a version of the data model has published, accompany any new version of the data model with a list of the changes so that it is easier for people to validate the changes. Or best, produce a list of changes before implementing them so as to get feedback before messing up the existing model. This includes [2]:
Check that the data model can hold the data necessary to support each process. This can be done with a "CRUD" matrix. Done the left side are the processes, along the top are the entities. In the cells are the indication of which processes Create, Read, Update or Delete which entities.
Entity A | Entity 2 | Entity AB | Entity CD | |
---|---|---|---|---|
Process 1 |
C
|
|
|
R
|
Process B |
R
|
C
|
U
|
R
|
Process XYZ |
R
|
|
D
|
C
|
Some reminders:
Identifying relationships: the child depends on the parent for identification
(and existence as a consequence). The child is identification-dependent.
(Notice how the foreign key is put automatically into the primary key of the
child).
Strictly speaking, the notion of dependant and independant entities is not part
of the IE notation but ERwin includes it anyway.
Non-identifying relationships: the child does not depend on the parent for
identification. However, it may depend on the parent for existence and in this
case, nulls are not allowed on the parent side (see further below).
(notice how the foreign key is put automatically added to the child, as a non-key
column.
Also note that the line is dashed.)
Cardinalities for one-to-many relationships
The left-most column shows the additional indications ("P", "Z",
"3") when the "format > relationship display > cardinality"
option is chosen in the IE notation.
(The parent is on the right side, the child on the left).
For non-identifying relationships, the parent side can be null too. If the
child is existence-dependent on the parent, then no nulls are allowed in the
foreign key attribute of the child. Otherwise, nulls are allowed.
(The parent is on the right side, the child on the left)
Many to many relationships are shown as follows. Note that few options are possible in the relationship properties.
The notation for super types and sub types is as follows. The discriminator is shown next to the symbol. Exclusive/complete sub-typing is shown below. An inclusive sub-type relationship is without the X in the IE notation (the term is not used in IDEF1X). An incomplete relationship is shown with a single horizontal line with the IDEF1X notation (complete and incomplete relationships are not noted in the IE notation). The terms "inclusive" and "exclusive" are part of the IE notation. "Complete" and "incomplete" are part of the IDEF1X notation. Because the sub-types are weak entities, the boxes are rounded in IDEF1X notation.
Below are the (non-intuitive) icons for building the header and footer on the printed page (and no tooltip appears when hovering over the icon):
Referential Integrity options:
Some starting suggestions. Use in the report builder (menu > tools > Report Template Builder > Report Builder) and create a new template.
The proposed methodology is:
Tutorials: