Project Management

These are personal notes. You are welcome to read them.

Index page	Top-level home page
Management hints	HTML, javascript	Other
Personal Notes (hints, thoughts, checklists, categorizing personalities, ...) Project Management (best practises, risks, choosing software, ...) Project Phases (scope, design, development, rollout) Business Analyst Notes (details on the scope and design phase) Data Quality Data Modeling Documentation Requirements	HTML and javascript Calculations with javascript Calculations with javascript Colors Table of ASCII Characters Networks	Alphabets Keyboard Layouts Databases-showpiece Logical Algebra Statistics and math

More pages here ("technical" pages)	Ham radio pages	Contents of current page

Contents

Choosing Software and Services
Sponsorship
Delegation
Project Intake (proj. phases page)
The Project Plan (documentation page)
Best Practices for Internet Projects
Best Practices for Data Warehouse Projects, including dimension types
Data Integration Design, with Source System Analysis, Target System Analysis, and mapping
Enterprise Architecture, Zachman Framework
Risks
Misc

Note: see page without "_"

Choosing Software and Services

Le produit

Travail en équipe et en réseau
Gestion de droits d'accès
Audit trail (historique des changements)
Possibilité de garder un historique
Référentiel centralisé
Vues croisées montrant les mêmes informations
Supports de formation
Intégration de documents de formats courants (HTML, PDF, Excel ou Word)
Bonne intégration avec les outils existants
Référentiel ouvert
Langues du produit
Possibilité de sauvegarde des données (backup)

Editeur et / ou fournisseur

Pérennité de l'entreprise
Nombre de clients existants
Connaissance métier
Accompagnement méthodologique
Présence local
Références d'autres clients dans le même domaine
Is the workload such that they will not be able to deliver [2]?

Coût

Licences (premier achat)
Maintenance corrective (généralement coût annuel)
Maintenance évolutive (généralement coût annuel)
Formation
Consulting pour le projet

Buying

Steps:

Define needs --> business case (see below)
Study
- The processes (talk to the employees involved)
- The existing system(s)
- The expectations for budget and time
- Determine selection criteria (company stability and track record, mature technology, fit to needs)
Make a long list of potential vendors
Send them a request for information
Gather information about:
- the company (solidity, location, # of employees, ...)
- the solution (functionalities, hardware, operating system, technology used, data storage, third party products)
- the services (third party services, support and maintenance)
- and pricing models (purchase and annual maintenance).
Evaluate the companies and solutions so as to establish a short list
Send out a Request for Proposal (for a list of questions, see Free Software Selection Course)
- Requirements
- Where and when the deliverables should be transferred [2]
- Quality / acceptance criteria [2]

Total Cost of Ownership

Costs vs investements

A purchase can incur an immediate cost or it can be considered an investment. In this case, the cost is spread over several years (the write-off period) with a write-off percentage for each year (the percentage may not be the same for each year). The write-offs correspond to depreciation.

Types of Costs

Purchase costs are one thing. Project costs are another: remember to add costs for analyzing, developing, installing, testing and rolling out hardware and software. Additional costs in the form of annual maintenance on software licences.

Delegation

Give the opportunity, motives and means
(See the project management diagram)

P.	Define the objective
A.	Select the person/team Transfer Request Information/give training Tools Budget Deadline Give authority
C.	Follow-up
A.	Coach

Delegator's Dozen [5]

Set a clear objective, no details yet
Select the delegtee
Train the delegatee
Get input from delegatee, get people involved, get commitment
Assign project and deadline
Provide necessrary guidance:
- Provide crtical information and data
- Suggest approaches , insist that it is just suggestions, but refrain if it is considered as directives
- Set standards or describe results desired (details linked to objectives)
- Resposibility, authority (including budget), and accoutability (can be fired or gets praise).
Praise behavior you want repeated
Make Delegation contract: 1 take action, 2 stay in touch, how often, 3 get approval for next phase 4 do only what I tell you to do.
Establish controls
Maintain controls
Provide feedback
Identify lessons learned
Evaluate performance

Motivation

Assign resposibility and give ownership of results.
Meaningful work
Recognition
Growth
Allow for mistakes
Feedback
Achievement

Prevent job from coming back to me:

Don't meddle
Ask for clear statement of the problem
Ask for options to solve
Ask for a recommendation

Bridging the Gap between Business and IT

The gap is a cultural divide between analytical modelers and IT administrators [4].

The way analytical modelers see IT:

Gatekeepers to the data
Bottlenecks
Paralyzed by process

IT sees the business side as CPU hogs (because of big or runaway queries) who slow query response times for other users

Solutions mostly involve changing attitudes and behaviors

Liaisons who understand and talk the language of both sides and who can build partnerships with people on both sides.
Foster Dialogue between the two groups throug formal meetings with exchange of grievances and building of common objectives.
Compromise by reaching out to the other side: Analysts should learn more IT and IT should give more access to the data.
Training: SQL for analysts and better tuning for DBAs in a a mixed environment with casual users, power users, and analysts.
Establish an Analytic Sandbox: Read-only access to any data in the warehouse in s separate environment

Sponsorship

Do not get in a situation where the IT sponsors outnumber the business users.

Internet Best Practices

Web site indicators

Number of pages read
Number of pages not found
Number of commands executed
Average response time
Number of incidents in operations
System down-time

Even in a web site, group changes into versions and test fully before implementing a version.

Web site success factors:

Ease of use
Lisibility of the site
Speed of operations
Minimal number of clicks
Resistance to high load
Security of transactions
Confidentiality and non-repudiation

Two aspects of age-old work methods are disappearing: secrecy and restrictive access to information.

Data Warehouse Best Practises

Notes from Inmon [6]

Four levels in the DWH architecture:

Operational level: these are the OLTP systems, which are the source for the data warehouse
Atomic (or data warehouse) level: subject oriented, integrated, non-volatile, and time-variant
Departmental level: the data marts
Individual level: ad hoc reports and data extracts

The requirements for the data warehouse are not known in advance. The users (analysts) discover what is available as the data is loaded into the data warehouse. I guess this means that we can be more agile. With my experience at FMC, this means bringing data into an area that looks like the source, with as few transformations as possible:

Every current row in the source has a row in this area: no filters. "If the users enter data in the source, we want to see in the data warehouse."
No aggregating: get the lowest granularity into the data warehouse, along with the source's logical key
Deleted rows in the source have a row in this area
As few mandatory fields as possible, so as to have as few reasons to reject a row as possible
The staging areas are visible to the users, at least the power users
Integrate data from various sources; create new fields as necessary, filling them with null for the sources that do not have an equivalent
But, don't just dump the data: try to understand it and present something meaningful
If you have lookup tables, and if maintenance of these tables is an issue, then present the original value and the result of the lookup side by side

Monitoring of the data warehouse includes monitoring of the data and of the users, to understand how the warehouse is growing, what data is used and by whom, and what performance is like (query response times).

Documentation includes:

Catalog of tables (detailed and summary)
A profile of the contents
A list of indexes
How the data is growing
The source of each field
Optionally: the business meaning of the main fields

Granularity affects the volume of data and the type of queries. The lower the level of granularity, with more detail, allows more types of analysis of the data. Many details are lost in summarized data. One option is to store detailed data in less expensive storage and summarized data in storage with faster access.

Partitioning breaks data up into smaller, more manageable chunks. This is often done by date, although other criteria can be used too. Partitioning can be done at the system or database level, or in the application. Partitioning in the application makes it easier to have a different physical layout for different years.

Although auditing can be done in the data warehouse, it makes more sense to do it elsewhere, in particular in the source systems.

No big bang approach:

[6, p.39]: Data warehouses are not built all at once. Instead, they are designed and populated one step at a time, and, as such, are evolutionary, not revolutionary. The costs of building a data warehouse all at once, the resources required, and the disruption to the environment all dictate that the data warehouse be built in an orderly, iterative, step-at-a-time fashion. The "big bang" approach to data warehouse development is simply an invitation to disaster and is never an appropriate alternative.

[6, p.41]: The path ... through the building of data marts first ... is short-sighted and leads to a great deal of waste.

Some More Tips

Add time to the key of the entity: as effective date (with end effective date) for snapshots, or as a time period for counts or balances (month, year, ...)
For derived elements, be sure to load the source elements
When in doubt about how much to load for transactional data: load more because you might not be able to get it again
When in doubt about how much to load for reference data: load less because it remains in the source and more elements can always be loaded
Primary key for dimensions: source system key or surrogate key
primary key for fact table: always combination of foreign keys. No surrogate key.
Three contradictory requirements: design elegance, processing speed, and information requirements
"If it is not in all source systems, then it is not a value used by all the enterprise."
Free text values are of little value. Drop-downs are better.
No big order tables, even though there are many fields in common. No main order table with sub-tables for each area, forcing the users to join the main table to other tables each time. Separate tables for each type of order as understood by the users.
Include facility everywhere
Constantly improve processes by updating the templates, even if we do not go back and change past projects. Keep templates as such, or keep a list of sample projects that can serve for the next project.
Technical specifications should contain servers names, scheduler job names, process names. And, yes, these details change from time to time, and the technical specifications should be short enough that it is easy to update. Technical specifications should not repeat what the code does, nor give table columns, just table names. If the functional specifications do not have a source to target mapping, the technical specifications should have it. See more
Thoughts on completely redoing a data warehouse, or other system. Some people refer to this as the "Big Rewrite" (see here, and here)
Instead of redoing, try these:
- Common wisdom is to make small improvements to the existing code and steer it towards improvements. Every time you touch part of the code, clean it up, refactor it, and improve it. Do not just patch the code.
- Make small pieces that are independent from each other
- Start by putting in version control
- Set up automated testing
- Each component should do one thing, and do it well

Overview of the Data Warehouse Project

Scope	Assessment of needs	Design	Development	Rollout	Maintenance and further development
		Technical architecture Dimensional model then physical model User application	Installation of products Development of loading processes User applications	Put all together and roll out

Scope

Assessment
of needs

Design

Development

Rollout

Maintenance and further development

Technical architecture

Dimensional model then physical model

User application

Installation of products

Development of loading processes

User applications

Put all together and roll out

Notes

The effort of the design and development of data staging is often under-estimated
Data staging is designed for (1) the first loading and (2) regular loadings.
An important issue is about corrections that come after data is loaded, for example corrections for a given month that come when the following month is being loaded. These corrections may not be explicitely visible, except that they have dates that are before the "normal" loading range. The questions are: can the records be changed after being imported? How far back in time can this happen?
In relation to the previous point, the data that is received can correspond to (1) snapshots, (2) updates (the delta) or (3) corrections. The problem is that updates look like corrections but the difference is that the old data is correct in the case of updates but the old data is wrong in the case of corrections. History should be corrected in the case of corrections, but it is also necessary to leave an audit trail of applied corrections.
Some analysts may want to put use cases in the design document. Data warehouse projects involve only batch processes and do not involve screens or user interaction. The user-visible output consists of reports and eventual email notifications of process completion status.
Conceptually, there are only two functions: a one-time load of historical data and a periodically recurring update thereafter. You could argue for a third function: selectively re-load a small number of rows for bug corrections.
The data warehouse team should manage the loading of the data for reasons of scheduling, efficiency, code maintenance, etc.

Some Thoughts on Organizing Business Intelligence

Books have been written on this subject. Here is my humble contribution:

Self-service BI: power users create ad-hoc reports, and casual users need flexibility in navigating existing reports, mostly in the form of drill-down.
Power users need access to all the data, and should ideally have a sandbox environment. Prevent a large number of ad-hoc reports showing up alongside standard reports by having a committee stakeholders decide what becomes a standard report
IT should continue to create and maintain the standard reports.
Super users -- power users with technical abilities -- are capable of suggesting changes to existing standard reports because they understand how these reports work
Executive users should use the fully-tested, standard reports

Design

Debate of dimensional model versus relational (normalized) model: Over the long term, a relational model is more robust and can handle changing requirements better. Use views to make a relational model look like a dimensional model. However, this assumes that the database server is powerful enough to support the use of many joins. Selective denormalization could be an option, knowing that it is risky to undo normalization efforts. Note that, if data marts are considered sub-sets of a data warehouse, each data mart should be in a dimensional model (star or snowflake).

There is a lot of data out there. But people want information. Yet, we (the designers) do not know what the users want, and nor do the users. The process is iterative. [Loosely from Micheal Scofield]

Start with as much data as possible, down to the lowest level of granularity.
Be prepared with appropriate reporting tools, with graph producing tools.
Summarize to the highest level possible and show to the users.
Iteratively enter into lower levels of detail as requested by the users.

Note that some reports serve the purpose of making people feel good that everything is under control. This is fine. But if the report is expensive to produce, then point it out. [Michael Scofield]

Integration: when two or more databases are involved, agree on labels (the labels or categories have to have the same meanings) and on measures. Labels are the desciptions of the dimensions and these labels should be consistent across the enterprise. IT cannot enforce consistent labels, only senior management can. Master dimensions that contain common labels are called conformed dimensions.

Ideally, the most detailed or atomic dimensions should be stored. You can always summarize, but you can't re-create details. And don't mix the granularities.

Produce a diagram with the fact table in the middle, and its grain clearly shown, and the dimensions laid around.

Other important parts of the design are treated elsewhere:

business needs
compliance
data profiling
security
Backup and recovery
"single source of truth", see data quality
machines with enough capability (both disk size and CPU speed)
Company naming standards

International situations:

Choose a single company currency and store each value in the reporting currency and the local currency.
Use a single master language.
Privacy and compliance best practices should be handled by the legal and financial executives.
Implement Unicode

Thoughts about Kimball's kitchen in times of the horsemeat scandal
Kimball's kitchen analogy allows too much complication. Restaurant diners want taste and pleasure. Data warehouse users want to know what is in their plate. With complicated and refined dishes, it is difficult to establish lineage for the data elements. I like horsemeat, but I want it labelled as such. I like other meats, but I want to know which one it is.
--> SIMPLICITY
Yes, it is hard to model the real world. So simplify the concepts and establish clear source to target mappings.

Some Thoughts on Master Data Management

Master data management (MDM) helps maintain reference data about important business entities across and organization: single and consistent set of data. The MDM can be the source of data or it can simply store the keys to equivalent records (loosely taken from TDWI, multiple documents)

Dan Power in "Ten Mistakes to Avoid When Designing and Building Your MDM and Data Governance Initiative" (TDWI publication) says that the "build it and they shall come" "does not work in the real world of Master Data Management" (page 2). Instead, start with the problem and address existing issues. Corporate sponsorship is also very important. Concerning ownership, Dan Power says what I have seen very clearly in my experience: "The business has to own MDM and data governance. As tempting as it is to start and finish with the technology, it just doesn’t work. When MDM is driven by the IT group, the business may not understand or buy in (or even realize it’s there)." The business has to own the overall process, and IT simply provides a tool. The MDM is part of data governance "that happens to be facilitated by MDM technology."

Some Thoughts on Agile Development in Business Intelligence and Data Integration

Report writing can more easily use agile methods than data integration. Data integration spends a lot of time mapping source to target, both fields and keys. In addition, to be agile, test data for data integration is needed and this is not always easy to generate. (Notes loosely taken from TDWI material.)

Conformed dimensions enable more agile data warehouse development. If senior management, who wants the projects to go faster, supports the effort needed to define conformed dimensions, then this will help speed up development. A minimal number of agreed-upon attributes is enough to start and more can be added as time goes on. Start with a small number of attributes from a few sources, for the customer dimension for example. (See Kimball tips #73, #106, #111 and #135.)

Some notable aspects of agile development

Builds fast, reveals errors early, and prompts corrective action soon in the development cycle: needs test cases up front with automated test suites
Gets business and IT together in the same room, or at least a lot of interaction
Small, simple increments; get a working system up and running so as to get feedback from users
More emphasis on adapting to change as opposed to meeting requirements

A user review meeting:
"Developers should not solve problems, troubleshoot, or make commitments based on user feedback. The action taken as a result of feedback is prioritized based on an assessment of the effort required to make changes. Others may observe but should not participate in these meetings."[5]

In iteration 0 (see Highsmith according to [5]), the team sets up the environment

In data warehousing, the users are not interested in the ETL, but in the resulting reports. Yet, technical features need to be implemented too. Keep them small so that they fit in one iteration.

Highsmith in [5] says that after each cycle should leave a complete product so that "if project funding were stopped immediately, users would at least have a working product that meets some requirements." "Done" means "done."

For testing, build a copy of the source systems and keep a subset of meaningful samples. Keep a catalog of test cases with descriptions.

Difficulty with evolving data models: developers need them locked down, the process needs flexibility. So, keep the model simple without contingencies for all sorts of possible things in the future. Shortcuts and bad design increases maintenance efforts: try to continuously improve the design. Use refactoring techniques and automated testing to successfully evolve database models

Having a customer on the team is essential. He/she must be the appropriate customer.

Facts

One of the first steps is to determine the grain. Facts with different grains can be compared if the dimensions are conformed, but not stored on the same fact table.
There are only three types of grains in datawarehouse [Kimball]:

Transaction grain: each row describes one transaction (at a date or time)
Periodic snapshot grain: a snapshot is taken at regular intervals: end-of-day, end-of-month, end-of-year...
Accumulating snapshot grain: similar to previous, but with dates for status such as order placed date, order fulfilled date, payment received date, ... Each date represents one step in the process. Generally, the updates are destructive, meaning that the record is overwritten.

Generally, facts are additive. Inventories and bank balances are semi-additive, that is the data is added and averaged over the periods by dividing the sum by the number of periods (or returning the max or min). Events are represented by factless facts where the fact table contains only foreign keys and no facts.

In the bus matrix, the columns are the nouns and correspond to the dimensions. The rows are verbs describing activities or processes and correspond to the facts. Be sure to group all elements of a process together even if they cross departments. In the case of hierarchies, the column header shows the most detailed level; a fact requiring a rollup can show with a note in the cell.

Combining two different measures in a query (drilling across in [Kimball]) is best done by sending two queries to the database and combining the results. The dimensions need to be conformed (leitmotiv of Kimball). The results are sort/merged.

The accumulating snapshot has mutiple timestamps. Note that if a time dimension is used, then join multiple aliases of the time dimension table to the fact table. For each of the dates, add a numeric "fact", a 0 or 1, indicating whether the step has been completed. These additive numeric facts make reporting easier. We can also add a field with the lag or time difference between important steps.

Counts:
A way of getting around counts in an outer join: create a case statements to count when nothing is found in the join on one side, when nothing is found on the other side in the case of a full outer join, and when a row would have been returned by an inner join. Kimball's Tip #12

Dimensions

See diagrams for types further below.

Type 1 consists of overwriting any existing values with the new value.

Type 2 SCD (Slowly Changing Dimension): use a field for begin_effective_date and end_effective_date. The end_effective_date contains null or a maximum date such as 2099 (preferred by [Kimball]). For ease of query writing, add a flag indicting the current row (redundant with end effective date = 2099). An advantage of the type 2 SCD is that it partitions history (See and Kimball's Tip #8). This means that a fact record knows what the dimension looked like when the fact record was inserted. The historical rows in the fact tables reference the non-current rows in the dimension. Therefore, do not exclude the historical records with a condition on the flag mentionned above. The historical rows in the dimension table are NOT updated. Therefore, to get the current values for historical rows, see below.

The most a type 2 slowly changing dimension would need is:

All useful fields from the source
Type 6 attributes (see below) that contain the current value for each attribute.
A surrogate primary key. You want to keep track of the source's primary key (the natural key). The unique key of the dimension could be the natural key along with the insert timestamp.
Begin effective time stamp (real timestamp, not a surrogate key point, or at least an integer yyyymmddhh24miss). Instead of just "SYSDATE", try to make it the same for all records in a batch so that you can link back to a control table
End effective time stamp. For the current row, a date in the future
Effective date surrogate key (daily grain) connecting to date dimension as a snowflake [Kimball]
Change description field (text): this is somewhat of a luxury
Current row flag for ease of querying, instead of querying on the most recent insert timestamp or on a null end timestamp

To report on a dimension with current values, add extra attributes to the dimension that are updated each time a new row is added to the dimension. Kimball calls this type 6 (Kimbal Tip #15). This is in contract to the type 2 SCD that reports on facts linked to the dimensions as they were at the time of insertion (see above).

Aggregate dimension: pre-calculate facts with one dimension less or higher up in the hierarchy

"Junk" dimension: dimension table with miscellaneous, unrelated columns

Snowflake compared to star: snowflakes conform to third normal form and stars to second normal form. Star schemas are more optimal for DW/BI query performance. The update issues linked to the second normal form are handled through the data warehouse update process. Space saved through snowflakes is generally negligeable compared to the increased query performance.

Outriggers are dimension tables linked to other dimension tables, but not in a snowflake schema. They have their purpose, but use with moderation.

Bridge tables are used to represent many-to-many relationships or ragged hierarchies. They increase complexity. Using a primary sales person or a top member in the hierarchy represents a possible compromise to remove the need for a bridge table in most cases.

Degenerate dimensions are dimensions with just only one attribute. It is not worth building a table with this attribute and a surrogate key. So put the value directly in the fact table.

Store the data in the lowest granularity possible, and build summary reports. If only summaries are stored, you will never be able to drill down.

If each department defines dimensions in the Kimball approach, yes, there is a risk of having silos. That is why, if I understand correctly, Kimall et al. advocate building process-centric dimensions dimensional models.

Kimball et al. say that starting the data warehouse design from a report puts you at risk of unrealistic expectations from the users. Instead, add dimensions one at a time.

Type 1

KEY	VALUE
123	ABC
789	XYZ

Insert 456, update 123, delete 789

	VALUE
123	BCD
456	GHI

SELECT KEY, VALUE FROM TABL;

No history

Type 2

KEY	VALUE	BEGIN_EFF_DT	END_EFF_DT	CURR_IND
123	ABC	1988/01/01	2099/12/31	Y
789	XYZ	1988/01/01	2099/12/31	Y

Insert 456, update 123, delete 789, on 23 Apr 2022

KEY	VALUE	BEGIN_EFF_DT	END_EFF_DT	CURR_IND
123	ABC	1988/01/01	2022/04/23	N
123	BCD	2022/04/23	2099/12/31	Y
456	GHI	2022/04/23	2099/12/31	Y
789	XYZ	1988/01/01	2022/04/23	N

To get the same results as for type 1:
SELECT KEY, VALUE FROM TABL WHERE CURR_IND='Y';

To get the historical results from a previous date, e.g. 2020:
SELECT KEY, VALUE FROM TABL WHERE TO_DATE('2020/01/01', 'YYYY/MM/DD') BETWEEN BEGIN_EFF_DT AND END_EFF_DT;

Type 3

KEY	VALUE	OLD_VALUE
123	ABC
789	XYZ

Insert 456, update 123, delete 789

KEY	VALUE	OLD_VALUE
123	BCD	ABC
456	GHI
789		XYZ

To get the same results as for type 1:
SELECT KEY, VALUE FROM TABL;

To get the historical results, but without any distinction of date:
SELECT KEY, OLD_VALUE FROM TABL;

Data Warehouse Loading Processes

Extraction from sources to staging area
Transformation, including cleaning, combination of data sources (exacte correspondance with keys or approximative correspondance), creation of substitute keys (so as not to depend on "real" keys that may change), construction of aggregates
Loading from staging to data warehouse; indexing
Quality control
Define data latency (24h, several minutes for real-time)
The Source-to-Target specifications define the full path from source to the main data warehouse, and eventually to the data marts.
Define intermediate hops for quality control and monitoring. Include the following for each step:

Daily count of rows loaded, including count of rows in error
Queries to tie back to the previous step or to the source, to ensure that the number of rows per date (or other grouping) match, based on the logical key.
Queries to verify the values in each attribute between the steps: for each value of the logical key, ensure that values in a given step tie back to the previous step.
Document error conditions, and what to do in each case
Queries for one given logical key across all steps to see the progression

Design the process in modules, with clearly documented inputs and outputs. Do not allow shared variables. If developers want to get creative inside the modules, it is their choice. by having modules that are small enough, spaghetti code can be confined to small, independent boxes that put a fence around bad code. For ETL, make small hops, as described above.

Date and Time Dimension

Use a calendar dimension (with holidays, etc). When hours/minutes/sec is needed, then have a separate timestamp in the fact table: consider this a special type of fact. Therefore two fields: foreign key to the calendar dimension and timestamp field. [Kimball]

If there is no need to keep the time of day in the data warehouse, then keep the date rounded to the day as a "meaningful" surrogate key (the only exception to non-meaningful keys) and point to a date dimension that keeps track of the days of the week and the company holidays. Note that Ralph Kimball says that even the time dimension should not be meaningful, but he does admit that "the time dimension is the ONLY dimension that has any logic to the surrogate keys and is the only one we dare place application constraints." Tip #14.

Although Kimball insists that the keys should be meaningless surrogate keys, I think that dates should be an exception. It is tedious to have to join to a separate table to get the date for every date field. Handle dates like degenerate dimensions and keep the meaningful dates in the fact table, with a calendar dimension that you can join to if necessary.

If there is a need to keep track of the time of day, then keep one field as above with the date without the time and another field either with just the time or with the date and time. If the field contains just the time, then consider a time dimension with indications of work hours, shifts, etc.

History

The tracking of history requires that each piece of data have a defined begin and end dates of validity.

Source System Analysis

See notes on:

It is generally a good idea to create the role of Source Data Steward, especially for complex systems with data important to other business applications.

Terms related to next generation data warehouses (DWH)

Real-time data warehouses:
- Need trickle or streaming loads
- High availability is not required, but long down times don't make sense with a real-time DWH
- Data federation: get data from source when it is needed. Alternative to trickle or streaming load
- P.N.: difference between real-time, where the change triggers the load, or near real-time, where the batch load happens in short intervals
Master Data Management (MDM) and Data Quality (DQ) into DWH
Advanced Analytics: moving beyond basic online analytic processing (OLAP), such as data mining, predictive analytics, statistics, and artificial intelligence. Ad-hoc SQLs put a strain on resources, yet this is what most analysts do.
Open Source Software. The main objections are linked to getting support and maintenance and to learning the tool.
Cloud Computing and Software-as-a-Service (SaaS). Note, a private cloud could address issues of security, volatility of public cloud providers, and the difficulty of moving data into and out of a cloud
In-Memory Processing and 64-Bit Computing
Service-oriented architecture (SOA) and Web services
Data Warehouse Appliances

Changed Data Capture

Last update date on the source tables: note that this may miss manual changes to the data, in particular for rows deleted from the source table, and also inserted or modified rows if the last update date is not set appropriately.
Use a timestamp on source data to determine the delta to load
Load the full set of data, and compare with a previous set in the staging area: complete, but takes a long time
File comparison: complete, but takes a long time
Detect change in incoming data by using checksum: if the checksum was changed then the row changed. However, in some cases, the checksum is the same for different values. Therefore, for the rows with unchanged checksum, compare all the columns. A timestamp could also be used, but it is not ideal.
Is the source data composed of a full data set, is it a feed of incremental data (with updates), or is it a feed with deltas (debits/credits, deletes/inserts)?
Have a status in the source table. Set it to "loaded" when the row is processed and loaded into the target. Note that this works well for inserted and modified records, but not for deleted records: the records deleted from the source would not have any placeholders for the status field.
Are soft or hard deletes used in the source database? Adjust CDC accordingly.
Row create dates can be used easily for replication;
row update dates are trickier (you have to look for the row in the target to update it)
Does the design include detecting when data is changed by the DBA, outside of the application?

Work with technical staff who know how the source system updates data

Lessons Learnt from the Trenches

Define the single source of truth for every data element. It should ideally be where the data element is entered by users.
Clearly identify the unique key for a single row.
Using the same granularity as the source system allows better data quality. It also improves loading speed because aggregation and key calculation are not necessary. Design the target at the lowest level of granularity, and without aggregation
If possible, use the source system"s identifier to determine single row (For example: visit id from ecc, with the other system"s identifier as attribute. Another example for key: PAN worked well from PRO. We should have used visit_id from ecc)
Prefer ELT over ETL. Show most source data before transformation in intermediate tables. Always have an intermediate table with the same granularity as source, before merges or creation of additional rows (examples: setting and modality in kcng hard to troubleshoot, successful case: easier troubleshooting with visit in kc)
Get access to source. Create queries on source that allow comparisons to data in target. These queries validate the ETL/ELT work, and help fill holes in the documentation.
Get a golden copy of all specifications in a single place. When new modifications come, it should be clear where to add them. (bad examples: visit and adm document holding other interfaces, multiple places for vao, missed version in TALR FSD. )
Implement an easy way to re-trigger the loading of specific data elements:
- For source data elements, when the usual CDC is date based, maybe have an alternate way of selecting data to load based on the key
- For intermediate tables when the CDC is based on a status indicator, implement an easy way of setting the status to "pending"
Keep all mdm merge and unmerge decisions in one place. Not like in the emails
In the "do not let them see you sweat" category: do not share how difficult it would be to implement a user request during the requirements gathering. Hold until later, and present options, or delay the implementation. A thought: if the user requirements box you into a corner for implementation, then maybe the requirements should be taken up a level, in the sense that the requirements are about what is to be implemented, not about how.

Data Lakes

Large amounts of data
Keep in original format
Load fast, transform later, less data cleansing
No requirements, just dump data
As opposed to data warehouses, data lakes do not seem to be subject oriented and integrated. They may be time-variant (have historical data).
Data lakes are nonvolatile like data warehouses, meaning modifications are not done in the data but in the sources

Data Integration Design

With Source System Analysis, Target System Analysis, and mapping.

See also notes on Structured Analysis and Data Warehouse Best Practices.

Methodology

Top level meeting to define the subject area and the related the source table(s). Determine with client what tables or data feeds are required to cover the subject area. Define the conditions for the specific data that we want, and any links to other entities of interest.
Outcome is a commitment on the client side to provide specific data that covers the subject area: arrange to get a data extract or a full dump. Try to get information on which fields are important and which fields are not used, empty or do not matter
Is the loading effort a one-time conversion or is it an interface that is accessed are regular intervals?
Define the access to the data: database tables to be exported, files to be exported via application, a feed received from an external company?
Do basic profiling of the source data to obtain measurements of important qualifiers such as the percentage of null values in data fields, the degree of duplicate records, the percentage of bad data formats (especially dates) and the distribution of values. See element analysis above and data profiling.
Then comes attribute-level analysis: dive in, do data type analysis, filter out un-used fields. Try to find out what each field means.
Determine how to detect change (Changed Data Capture): See above
Determine what initial load is necessary. This can consist of a full load of current customer or product data, or an historical load of recent transactions going back a pre-determined number of months. The initial load could also consist of the migration from an old data structure.
Follow-up meeting with source system expert to ask additional questions at attribute level.
In the absence, or in complement to, the data profiling, ask for:
- data types
- descriptions and definitions
- domains of values, ranges of values
- how null, unknown or non-applicable values are indicated,
- rules that have to be applied to the data
- what constitutes an invalid value, constraints on the values
- definition/meaning of the codes
- unique/primary keys, or what defines a single row
- are historical rows kept in the source?
- do some fields have multiple uses, possibly depending on other pieces of data
Top-level meeting with the target system expert for target-level analysis. Define the target tables or APIs and ask for documentation. Try to map the source entities to the target entities. Only talk about attributes if they are keys or if they hold important data.
Do preliminary mapping
Meet target system expert and start the mapping
Up to here, about two weeks, maybe three
Validate target data model against expected data warehouse design. Provide the following:
- Mockups of the expected reports
- Matrix of measures and the dimensions they will be linked to
- Typical business questions
All required target fields need data; all source fields with data should have a destination unless it is explicitely known that the data is not needed. Add a column in the spreadsheet to say if the field is used in the target and label each source field with one of the following: no significant data, same data as another field, internal primary key or internal foreign key, or the target field.
Make sure that there is information for the source table name, source field name, transformation, lookups (table, condition, returned fields)
Point out the cases where strings feed into date or number fields. Validation rules have to be set up. Look for cases where data will be truncated.
Define error conditions for invalid values and/or null values, field by field. Pay special attention to null values for fields that are part of the key.
Define what constitutes a new row in the target in the case of new, changed, and deleted rows in the source. Ideally, the logical key in the target is the same as in the source. If the keys are different, then extra care is needed to map out the scenarios. Also validate which rows are kept as history
Validate the design with IT architects and DBAs. Then talk to:
- source system developers and DBAs who know the source system model well. However, they may not be familiar with dimensional modeling.
- power users, who are generally technical
- core business users, who may be less technical
- and representatives of the broader user community, focusing on how the data will be used (examples of questions and the corresponding queries as answers)
(tip #108 from www.kimballgroup.com).
Push the small "nice-to-have" changes to the next version.
After development, show the data in the target. Be prepared to answer questions about where each piece of data comes from.

Documentation should contain:

Field by field source to target mapping, with transformation rules
Lookup / reference tables
Error severity (information, warning, severe)
Error handling level: pass on (with warning), suspend the row and reprocess until conditions are met for success, pass the row but reprocess
The primary, logical, and natural keys in the target. Diagram a sample combination of keys. The sample contains all key fields; no need to put in the non-key fields unless there is some pertinence.
Define when to insert, update, or delete a row in the target
Diagram(s)
Use cases / examples
Graphical data model

Enterprise Architecture

Zachman Framework for Enterprise Architecture
Order	Layer	Stakeholder group	What (Data)	How (Function)	Where (Network)	Who (People)	When (Time)	Why (Motivation)
1	Scope (contextual boundary)	Visionary / Planner to the business	List of things important to the business Entity = a business thing	List of processes the business performs Process = a business process	List of locations in which the business operates Node = major business location	List of organizations important to the business People = major organizational unit	List of events significant to the business Time = major business event or cycle	List of business goals/strategies Ends/means = major business goal or strategy
2	Conceptual Business Model	Owner	Semantic or Entity-relationship Model Business entities and business relationships	Business Process Model Business processes and I/O as business resources	Business Logistics System business locations as nodes and business linkage as links	Work Flow Model organizational units as people and work products	Master Schedule business events as time, business cycles	Business Plan End = business objective; means = business strategy
3	Logical System Model	Designer	Logical Data Model data entities and data relationships	Application Architecture process = application function; I/O = user views	Distributed System Architecture Nodes are processors, storage units, etc; links are line characteristics	Human Interface Architecture People = roles; Work = deliverable	Processing Structure Time = system event; cycle = processing cycle	Business Rule Model end = structural assertion; means = action assertion
4	Pysical Technology Model	Builder	Physical Data Model segments/tables and pointers, DB relationships	System Design Computer functions and data elements	Technology Architecture Nodes are hardware or software systems, links are line specifications	Presentation Architecture people are users, work are the screen formats	Control Structure time = execute, cycle = component cycle	Rule Design end = condition, means = action
5	Component Configuration	Implementer	Data Definition	Program	Network Architecture	Security Architecture	Timing Definition	Rule Specification
6	Functioning Enterprise Instances	Worker	Data	Function	Network	Organization	Schedule	Strategy

References:

http://en.wikipedia.org/wiki/Zachman_framework

Risks

Space
Logistics such as food and lodging
Transportation
Permissions to use equipement, property, ...
International borders
Licenses, permits and clearances
Insurances
Weather
Unavailable resources

IT Risks

Cannot handle the production load
Incomptability between components
Advertized performance does not corresond to real performance
Inadequate skills for supporting the application

Project Risks

Not enough financing or not solid enough
Badly thought business model
Excessive expenses
Absent clients
Deadlines are not respected
Slipping schedule

Attitude towards risk

In relation to technological innovation, there are the following types of profiles:

Visionaries, who are enthused by new technology and envision new possibilities.
Those with a more pragmatic approach, adopting new technology if proven
Those with a conservative approach, accepting a new system only if widely used, generally choosing expensive but guaranteed solutions
Skeptics

Risks

Risk = impact (e.g. delay or costs) * probability
This should lead to action to avert the risk or reduce the impact.

Stakeholders

Have I identified all stakeholders?
Have I identified leader or spokesman for all groups of stakeholders?
Have I ever met the stakeholders?
Can I describe the fears of the stakeholders in relation to the project?
Do I know what the stakeholders want?
Are my stakes included?

Requirements

Unclear goals and scope.
No room for negotiation
Unreliable estimations
People without a clear role in the project
The project smells like a trap

Risk Response Strategies:

Avoid: change the plan to eliminate the possibility that the risk will occur.
Transfer: transfer consequences of the risk to a third party. An example is insurance. This means that the impact or consequences will be felt by the third party.
Mitigate: reduce the possibility that the risk will occur or reduce the impact/consequence of the risk.
Accept: accept that the risk will occur and develop a plan to deal with the consquences should they occur.

Miscellaneous

The amateur knows what to do;
the professional knows what not to do.

Sources:

[1]Henri-Pierre MADERS, Manager une équipe projet, Leadership-grilles de lecture, cas concrets, Edition d'organisation, 2003
[2] Sunny and Kim Baker: The Complete Idiot's Guide to Project Management, Alpha Books, 1998
[3] Free Software Selection Course (looks like a personal web site, but good reading. It inspired me for many of the comments above.)
[4] Wayne W. Eckerson: From Five Steps to Bridge the Business/IT Gap, The Data Warehouse Institute
[5] Agile Data Warehousing: Incorporating Agile Principles by Dr. Ken Collier, Senior Consultant, Cutter Consortium; with Jim Highsmith, Director, Cutter Consortium Agile Product & Project Management Practice
[6] William H. Inmon: Building the Data Warehouse, Fourth Edition. Wiley.

Some language tips:

Best practice = "I recommend it, but you don't have to do it"
"If you want to ..., click on ..."