Project Management
Project Management
These are personal notes. You are welcome to read them.
|
|
|
Management hints
|
HTML, javascript
|
Other
|
(best practises, risks, choosing software, ...)
-
(scope, design, development, rollout)
-
(details on the scope and design phase)
-
-
|
|
|
|
|
|
|
|
|
Contents
Note: see page without "_"
Choosing Software and Services
Le produit
- Travail en équipe et en réseau
- Gestion de droits d'accès
- Audit trail (historique des changements)
- Possibilité de garder un historique
- Référentiel centralisé
- Vues croisées montrant les mêmes informations
- Supports de formation
- Intégration de documents de formats courants (HTML, PDF, Excel ou Word)
- Bonne intégration avec les outils existants
- Référentiel ouvert
- Langues du produit
- Possibilité de sauvegarde des données (backup)
Editeur et / ou fournisseur
- Pérennité de l'entreprise
- Nombre de clients existants
- Connaissance métier
- Accompagnement méthodologique
- Présence local
- Références d'autres clients dans le même domaine
- Is the workload such that they will not be able to deliver [2]?
Coût
- Licences (premier achat)
- Maintenance corrective (généralement coût annuel)
- Maintenance évolutive (généralement coût annuel)
- Formation
- Consulting pour le projet
Buying
Steps:
- Define needs --> business case (see below)
- Study
- The processes (talk to the employees involved)
- The existing system(s)
- The expectations for budget and time
- Determine selection criteria (company stability and track record, mature
technology, fit to needs)
- Make a long list of potential vendors
- Send them a request for information
- Gather information about:
- the company (solidity, location, # of employees, ...)
- the solution (functionalities, hardware, operating system, technology
used, data storage, third party products)
- the services (third party services, support and maintenance)
- and pricing models (purchase and annual maintenance).
- Evaluate the companies and solutions so as to establish a short list
- Send out a Request for Proposal (for a list of questions, see Free
Software Selection Course)
- Requirements
- Where and when the deliverables should be transferred [2]
- Quality / acceptance criteria [2]
-
Total Cost of Ownership
Costs vs investements
A purchase can incur an immediate cost or it can be considered an investment.
In this case, the cost is spread over several years (the write-off period) with
a write-off percentage for each year (the percentage may not be the same for
each year). The write-offs correspond to depreciation.
Types of Costs
Purchase costs are one thing. Project costs are another: remember to add costs
for analyzing, developing, installing, testing and rolling out hardware and
software. Additional costs in the form of annual maintenance on software licences.
Delegation
Give the opportunity, motives and means
(See the project management diagram)
P. |
Define the objective |
A. |
Select the person/team
Transfer
- Request
- Information/give training
- Tools
- Budget
- Deadline
- Give authority
|
C. |
Follow-up |
A. |
Coach |
Delegator's Dozen [5]
- Set a clear objective, no details yet
- Select the delegtee
- Train the delegatee
- Get input from delegatee, get people involved, get commitment
- Assign project and deadline
- Provide necessrary guidance:
- Provide crtical information and data
- Suggest approaches , insist that it is just suggestions, but refrain if it is considered as directives
- Set standards or describe results desired (details linked to objectives)
- Resposibility, authority (including budget), and accoutability (can be fired or gets praise).
Praise behavior you want repeated
- Make Delegation contract: 1 take action, 2 stay in touch, how often, 3 get approval for next phase 4 do only what I tell you to do.
- Establish controls
- Maintain controls
- Provide feedback
- Identify lessons learned
- Evaluate performance
Motivation
- Assign resposibility and give ownership of results.
- Meaningful work
- Recognition
- Growth
- Allow for mistakes
- Feedback
- Achievement
Prevent job from coming back to me:
- Don't meddle
- Ask for clear statement of the problem
- Ask for options to solve
- Ask for a recommendation
Bridging the Gap between Business and IT
The gap is a cultural divide between analytical modelers and IT administrators
[4].
The way analytical modelers see IT:
- Gatekeepers to the data
- Bottlenecks
- Paralyzed by process
IT sees the business side as CPU hogs (because of big or runaway queries)
who slow query response times for other users
Solutions mostly involve changing attitudes and behaviors
- Liaisons who understand and talk the language
of both sides and who can build partnerships with people on both sides.
- Foster Dialogue between the two groups
throug formal meetings with
exchange of grievances and building of common objectives.
- Compromise by reaching out to the other side:
Analysts should learn more IT and IT should give more access to the data.
to increase mutual understanding, through socializing and
- Training: SQL for analysts and
better tuning for DBAs in a
a mixed environment with casual
users, power users, and analysts.
- Establish an Analytic Sandbox:
Read-only access to any data in the warehouse in s separate environment
Do not get in a situation where the IT sponsors outnumber the business users.
Internet Best Practices
Web site indicators
- Number of pages read
- Number of pages not found
- Number of commands executed
- Average response time
- Number of incidents in operations
- System down-time
Even in a web site, group changes into versions and test fully before implementing
a version.
Web site success factors:
- Ease of use
- Lisibility of the site
- Speed of operations
- Minimal number of clicks
- Resistance to high load
- Security of transactions
- Confidentiality and non-repudiation
Two aspects of age-old work methods are disappearing: secrecy and restrictive
access to information.
Data Warehouse Best Practises
Notes from Inmon [6]
Four levels in the DWH architecture:
- Operational level: these are the OLTP systems, which are the source for the data warehouse
- Atomic (or data warehouse) level: subject oriented, integrated, non-volatile,
and time-variant
- Departmental level: the data marts
- Individual level: ad hoc reports and data extracts
The requirements for the data warehouse are not known in advance. The users
(analysts) discover what is available as the data is loaded into the data
warehouse. I guess this means that we can be more agile. With my experience
at FMC, this means bringing data into an area that looks like the source, with
as few transformations as possible:
- Every current row in the source has a row in this area: no filters.
"If the users enter data in the source, we want to see in the data warehouse."
- No aggregating: get the lowest granularity into the data warehouse,
along with the source's logical key
- Deleted rows in the source have a row in this area
- As few mandatory fields as possible, so as to have as few reasons to
reject a row as possible
- The staging areas are visible to the users, at least the power users
- Integrate data from various sources; create new fields as necessary,
filling them with null for the sources that do not have an equivalent
- But, don't just dump the data: try to understand it and present
something meaningful
- If you have lookup tables, and if maintenance of these tables is an issue,
then present the original value and the result of the lookup side by side
Monitoring of the data warehouse includes monitoring of the data and of the
users, to understand how the warehouse is growing, what data is used and by whom,
and what performance is like (query response times).
Documentation includes:
- Catalog of tables (detailed and summary)
- A profile of the contents
- A list of indexes
- How the data is growing
- The source of each field
- Optionally: the business meaning of the main fields
Granularity affects the volume of data and the type of queries.
The lower the level of granularity, with more detail, allows more
types of analysis of the data. Many details are lost in summarized data.
One option is to store detailed data in less expensive storage and
summarized data in storage with faster access.
Partitioning breaks data up into smaller, more manageable chunks.
This is often done by date, although other criteria can be used too.
Partitioning can be done at the system or database level, or in the
application. Partitioning in the application makes it easier
to have a different physical layout for different years.
Although auditing can be done in the data warehouse, it makes
more sense to do it elsewhere, in particular in the source systems.
No big bang approach:
[6, p.39]: Data warehouses are not built all at once. Instead, they are
designed and populated one step at a time, and, as such, are evolutionary,
not revolutionary. The costs of building a data warehouse all at once,
the resources required, and the disruption to the environment all dictate
that the data warehouse be built in an orderly, iterative, step-at-a-time
fashion. The "big bang" approach to data warehouse development is simply
an invitation to disaster and is never an appropriate alternative.
[6, p.41]: The path ... through the building of data marts first ...
is short-sighted and leads to a great deal of waste.
Some More Tips
- Add time to the key of the entity: as effective date (with end effective
date) for snapshots, or as a time period for counts or balances
(month, year, ...)
- For derived elements, be sure to load the source elements
- When in doubt about how much to load for transactional data:
load more because you might not be able to get it again
- When in doubt about how much to load for reference data:
load less because it remains in the source and more elements can always be loaded
- Primary key for dimensions: source system key or surrogate key
primary key for fact table: always combination of foreign keys.
No surrogate key.
- Three contradictory requirements: design elegance, processing
speed, and information requirements
- "If it is not in all source systems,
then it is not a value used by all the enterprise."
- Free text values are of little value. Drop-downs are better.
- No big order tables, even though there are many fields in common.
No main order table with sub-tables for each area, forcing the users to join
the main table to other tables each time.
Separate tables for each type of order as understood by the users.
- Include facility everywhere
- Constantly improve processes by updating the templates, even if we do not go back and change past projects.
Keep templates as such, or keep a list of sample projects that can serve for the next project.
- Technical specifications should contain servers names, scheduler job names, process names.
And, yes, these details change from time to time, and the technical specifications should be short
enough that it is easy to update. Technical specifications should not repeat what the code does,
nor give table columns, just table names. If the functional specifications do not have
a source to target mapping, the technical specifications should have it. See more
- Thoughts on completely redoing a data warehouse, or other system.
Some people refer to this as the "Big Rewrite" (see
here,
and here)
Instead of redoing, try these:
- Common wisdom is to make small improvements to the existing code and steer it towards improvements.
Every time you touch part of the code, clean it up, refactor it, and improve it.
Do not just patch the code.
- Make small pieces that are independent from each other
- Start by putting in version control
- Set up automated testing
- Each component should do one thing, and do it well
Overview of the Data Warehouse Project
Scope |
Assessment
of needs |
Design |
Development |
Rollout |
Maintenance and further development |
|
|
Technical architecture
Dimensional model then physical model
User application
|
Installation of products
Development of loading processes
User applications
|
Put all together and roll out |
|
Notes
- The effort of the design and development of data staging is often under-estimated
- Data staging is designed for (1) the first loading and (2) regular loadings.
- An important issue is about corrections that come after data is loaded, for example
corrections for a given month that come when the following month is being loaded.
These corrections may not be explicitely visible, except that they have dates
that are before the "normal" loading range. The questions are: can the records
be changed after being imported? How far back in time can this happen?
- In relation to the previous point, the data that is received can correspond to
(1) snapshots, (2) updates (the delta) or (3) corrections. The problem is
that updates look like corrections but the difference is that the old data
is correct in the case of updates but the old data is wrong in the case of
corrections. History should be corrected in the case of corrections, but
it is also necessary to leave an audit trail of applied corrections.
- Some analysts may want to put use cases in the design document. Data
warehouse projects involve only batch processes and do not involve screens or
user interaction. The user-visible output consists of reports and eventual
email notifications of process completion status.
- Conceptually, there are only two functions: a one-time load of historical data
and a periodically recurring update thereafter. You could argue for a third function:
selectively re-load a small number of rows for bug corrections.
- The data warehouse team should manage the loading of the data
for reasons of scheduling, efficiency, code maintenance, etc.
Some Thoughts on Organizing Business Intelligence
Books have been written on this subject. Here is my humble contribution:
- Self-service BI: power users create ad-hoc reports, and casual users need
flexibility in navigating existing reports, mostly in the form of drill-down.
- Power users need access to all the data, and should ideally have a sandbox
environment. Prevent a large number of ad-hoc reports showing up alongside
standard reports by having a committee stakeholders decide what becomes a
standard report
- IT should continue to create and maintain the standard reports.
- Super users -- power users with technical abilities -- are capable of
suggesting changes to existing standard reports because they understand how
these reports work
- Executive users should use the fully-tested, standard reports
Design
Debate of dimensional model versus relational (normalized) model: Over the long
term, a relational model is more robust and can handle changing requirements
better. Use views to make a relational model look like a dimensional model.
However, this assumes that the database server is powerful enough to support
the use of many joins. Selective denormalization could be an option, knowing
that it is risky to undo normalization efforts. Note that, if data marts are
considered sub-sets of a data warehouse, each data mart should be in
a dimensional model (star or snowflake).
There is a lot of data out there. But people want information. Yet, we
(the designers) do not know what the users want, and nor do the users.
The process is iterative. [Loosely from Micheal Scofield]
- Start with as much data as possible, down to the lowest level of granularity.
- Be prepared with appropriate reporting tools, with graph producing tools.
- Summarize to the highest level possible and show to the users.
- Iteratively enter into lower levels of detail as requested by the users.
Note that some reports serve the purpose of making people feel good that
everything is under control. This is fine. But if the report is expensive to
produce, then point it out. [Michael Scofield]
Integration: when two or more databases are involved, agree on
labels (the labels or categories have to have the same meanings) and on measures.
Labels are the desciptions of the dimensions and these labels
should be consistent across the enterprise.
IT cannot enforce consistent labels, only senior management can.
Master dimensions that contain common labels are called conformed dimensions.
Ideally, the most detailed or atomic dimensions should be stored. You can always summarize,
but you can't re-create details. And don't mix the granularities.
Produce a diagram with the fact table in the middle, and its grain clearly shown,
and the dimensions laid around.
Other important parts of the design are treated elsewhere:
- business needs
- compliance
- data profiling
- security
- Backup and recovery
- "single source of truth", see data quality
- machines with enough capability (both disk size and CPU speed)
- Company naming standards
International situations:
- Choose a single company currency and store each value in the reporting currency and the local currency.
- Use a single master language.
- Privacy and compliance best practices should be handled by the legal and financial executives.
- Implement Unicode
Thoughts about Kimball's kitchen in times of the horsemeat scandal
Kimball's kitchen analogy allows too much complication. Restaurant
diners want taste and pleasure. Data warehouse users want to know what is
in their plate. With complicated and refined dishes, it is difficult to
establish lineage for the data elements.
I like horsemeat, but I want it labelled as such. I like other meats, but
I want to know which one it is.
--> SIMPLICITY
Yes, it is hard to model the real world.
So simplify the concepts and establish clear source to target mappings.
Some Thoughts on Master Data Management
Master data management (MDM) helps maintain reference data about
important business entities across and organization: single and consistent set of data.
The MDM can be the source of data or it can simply store the keys to equivalent records
(loosely taken from TDWI, multiple documents)
Dan Power in "Ten Mistakes to Avoid When Designing and Building Your MDM and Data
Governance Initiative" (TDWI publication) says that the "build it and they shall come"
"does not work in the real world of Master Data Management" (page 2). Instead, start with the
problem and address existing issues. Corporate sponsorship is also very important.
Concerning ownership, Dan Power says what I have seen very clearly in my
experience: "The business has to own MDM and data governance. As
tempting as it is to start and finish with the technology, it
just doesn’t work. When MDM is driven by the IT group, the
business may not understand or buy in (or even realize
it’s there)." The business has to own the overall process, and
IT simply provides a tool. The MDM is part of data
governance "that happens to be facilitated by MDM
technology."
Some Thoughts on Agile Development in Business Intelligence and Data Integration
Report writing can more easily use agile methods than data integration. Data integration
spends a lot of time mapping source to target, both fields and keys. In addition, to be agile,
test data for data integration is needed and this is not always easy to generate. (Notes loosely taken from TDWI material.)
Conformed dimensions enable more agile data warehouse development. If senior management,
who wants the projects to go faster, supports the effort needed to define conformed dimensions,
then this will help speed up development. A minimal number of agreed-upon attributes is enough to start
and more can be added as time goes on. Start with a small number of attributes from a few sources,
for the customer dimension for example.
(See Kimball tips #73, #106, #111 and #135.)
Some notable aspects of agile development
- Builds fast, reveals errors early, and prompts corrective action soon in the development cycle: needs test cases up front with automated test suites
- Gets business and IT together in the same room, or at least a lot of interaction
- Small, simple increments; get a working system up and running so as to get feedback from users
- More emphasis on adapting to change as opposed to meeting requirements
A user review meeting:
"Developers should
not solve problems, troubleshoot,
or make commitments based on
user feedback. The action taken
as a result of feedback is prioritized
based on an assessment
of the effort required to make
changes. Others may observe but
should not participate in these
meetings."[5]
In iteration 0 (see Highsmith according to [5]),
the team sets up the environment
In data warehousing, the users are not interested in the ETL, but in the resulting reports. Yet, technical features
need to be implemented too. Keep them small so that they fit in one iteration.
Highsmith in [5] says that after
each cycle should leave a complete product so that "if project funding were
stopped immediately, users would
at least have a working product
that meets some requirements." "Done" means "done."
For testing, build a copy of the source systems and keep a subset of meaningful samples.
Keep a catalog of test cases with descriptions.
Difficulty with evolving data models: developers need them locked down, the process needs flexibility.
So, keep the model simple without contingencies for all sorts of possible things in the future.
Shortcuts and bad design increases maintenance efforts: try to continuously improve the design.
Use refactoring techniques and automated testing to successfully evolve database models
Having a customer on the team is essential. He/she must be the appropriate customer.
Facts
One of the first steps is to determine the grain.
Facts with different grains can be compared if the dimensions are conformed,
but not stored on the same fact table.
There are only three types of grains in datawarehouse [Kimball]:
- Transaction grain: each row describes one transaction (at a date or time)
- Periodic snapshot grain: a snapshot is taken at regular intervals:
end-of-day, end-of-month, end-of-year...
- Accumulating snapshot grain: similar to previous, but with dates for
status such as order placed date,
order fulfilled date, payment received date, ...
Each date represents one step in the process.
Generally, the updates are destructive, meaning that the record is overwritten.
Generally, facts are additive. Inventories and bank balances
are semi-additive, that is
the data is added and averaged over the periods by dividing the
sum by the number of periods (or returning the max or min).
Events are represented by factless facts where the fact table contains only
foreign keys and no facts.
In the bus matrix, the columns are the nouns and correspond to the dimensions.
The rows are verbs describing activities or processes and correspond to the facts.
Be sure to group all elements of a process together even if they cross departments.
In the case of hierarchies, the column header shows the most detailed level; a
fact requiring a rollup can show with a note in the cell.
Combining two different measures in a query (drilling across in [Kimball])
is best done by sending two queries to the database and combining the results.
The dimensions need to be conformed (leitmotiv of Kimball). The results
are sort/merged.
The accumulating snapshot has mutiple timestamps. Note that if a time dimension is used,
then join multiple aliases of the time dimension table to the fact table.
For each of the dates, add a numeric "fact", a 0 or 1, indicating whether
the step has been completed. These additive numeric facts make reporting easier.
We can also add a field with the lag or time difference between important steps.
Counts:
A way of getting around counts in an outer join: create a case statements
to count when nothing is found in the join on one side, when nothing
is found on the other side in the case of a full outer join, and
when a row would have been returned by an inner join.
Kimball's Tip #12
Dimensions
See diagrams for types further below.
Type 1 consists of overwriting any existing values with the new value.
Type 2 SCD (Slowly Changing Dimension): use a field for begin_effective_date
and end_effective_date. The end_effective_date contains null or a maximum date
such as 2099 (preferred by [Kimball]). For ease of query writing, add a flag
indicting the current row (redundant with end effective date = 2099).
An advantage of the type 2 SCD is that it partitions history
(See
and Kimball's Tip #8).
This means that a fact record knows what the dimension looked like when the fact
record was inserted. The historical rows in the fact tables reference the non-current rows
in the dimension. Therefore, do not exclude the historical records with a
condition on the flag mentionned above. The historical rows in the dimension table are NOT updated.
Therefore, to get the current values for historical
rows, see below.
The most a type 2 slowly changing dimension would need is:
- All useful fields from the source
- Type 6 attributes (see below) that contain the current value for each attribute.
- A surrogate primary key. You want to keep track of the source's primary key (the natural key).
The unique key of the dimension could be the natural key along with the insert timestamp.
- Begin effective time stamp (real timestamp, not a surrogate key point, or at least an integer yyyymmddhh24miss).
Instead of just "SYSDATE", try to make it the same for all records in a batch so that
you can link back to a control table
- End effective time stamp. For the current row, a date in the future
- Effective date surrogate key (daily grain) connecting to date dimension as a snowflake [Kimball]
- Change description field (text): this is somewhat of a luxury
- Current row flag for ease of querying, instead of querying on the most
recent insert timestamp or on a null end timestamp
To report on a dimension with current values, add extra attributes to the dimension that are updated
each time a new row is added to the dimension.
Kimball calls this type 6 (Kimbal Tip #15).
This is in contract to the type 2 SCD that reports on facts linked to the dimensions
as they were at the time of insertion (see above).
Aggregate dimension: pre-calculate facts with one dimension less or higher
up in the hierarchy
"Junk" dimension: dimension table with miscellaneous, unrelated columns
Snowflake compared to star: snowflakes conform to third normal form and stars to
second normal form.
Star schemas are more optimal for DW/BI query performance. The update issues linked to the
second normal form are handled through the data warehouse update process.
Space saved through snowflakes is generally negligeable compared to the increased
query performance.
Outriggers are dimension tables linked to other dimension tables, but not in a
snowflake schema. They have their purpose, but use with moderation.
Bridge tables are used to represent many-to-many relationships or ragged hierarchies.
They increase complexity. Using a primary sales person or a top member in the
hierarchy represents a possible compromise to remove the need for a bridge table in most cases.
Degenerate dimensions are dimensions with just only one attribute. It is not worth
building a table with this attribute and a surrogate key. So put the value
directly in the fact table.
Store the data in the lowest granularity possible, and build summary reports.
If only summaries are stored, you will never be able to drill down.
If each department defines dimensions in the Kimball approach, yes, there is
a risk of having silos. That is why, if I understand correctly, Kimall et al.
advocate building process-centric dimensions dimensional models.
Kimball et al. say that starting the data warehouse design from a report puts you
at risk of unrealistic expectations from the users. Instead, add dimensions one at a time.
Type 1
KEY | VALUE |
123 | ABC |
789 | XYZ |
Insert 456, update 123, delete 789
SELECT KEY, VALUE FROM TABL;
No history
Type 2
KEY | VALUE | BEGIN_EFF_DT | END_EFF_DT | CURR_IND |
123 | ABC | 1988/01/01 | 2099/12/31 | Y |
789 | XYZ | 1988/01/01 | 2099/12/31 | Y |
Insert 456, update 123, delete 789, on 23 Apr 2022
KEY | VALUE | BEGIN_EFF_DT | END_EFF_DT | CURR_IND |
123 | ABC | 1988/01/01 | 2022/04/23 | N |
123 | BCD | 2022/04/23 | 2099/12/31 | Y |
456 | GHI | 2022/04/23 | 2099/12/31 | Y |
789 | XYZ | 1988/01/01 | 2022/04/23 | N |
To get the same results as for type 1:
SELECT KEY, VALUE FROM TABL WHERE CURR_IND='Y';
To get the historical results from a previous date, e.g. 2020:
SELECT KEY, VALUE FROM TABL WHERE TO_DATE('2020/01/01', 'YYYY/MM/DD') BETWEEN BEGIN_EFF_DT AND END_EFF_DT;
Type 3
KEY | VALUE | OLD_VALUE |
123 | ABC | |
789 | XYZ | |
Insert 456, update 123, delete 789
KEY | VALUE | OLD_VALUE |
123 | BCD | ABC |
456 | GHI | |
789 | | XYZ |
To get the same results as for type 1:
SELECT KEY, VALUE FROM TABL;
To get the historical results, but without any distinction of date:
SELECT KEY, OLD_VALUE FROM TABL;
Data Warehouse Loading Processes
- Extraction from sources to staging area
- Transformation, including cleaning, combination of data sources (exacte
correspondance with keys or approximative correspondance), creation of substitute
keys (so as not to depend on "real" keys that may change), construction
of aggregates
- Loading from staging to data warehouse; indexing
- Quality control
- Define data latency (24h, several minutes for real-time)
- The Source-to-Target specifications define the full path from source to the
main data warehouse, and eventually to the data marts.
- Define intermediate hops for quality control and monitoring. Include the following for each step:
- Daily count of rows loaded, including count of rows in error
- Queries to tie back to the previous step or to the source, to ensure
that the number of rows per date (or other grouping) match, based on the
logical key.
- Queries to verify the values in each attribute between the steps:
for each value of the logical key, ensure that values in a given step tie back to the previous step.
- Document error conditions, and what to do in each case
- Queries for one given logical key across all steps to see the progression
- Design the process in modules, with clearly documented inputs and outputs.
Do not allow shared variables.
If developers want to get creative inside the modules, it is their choice.
by having modules that are small enough, spaghetti code can be confined to
small, independent boxes that put a fence around bad code.
For ETL, make small hops, as described above.
Date and Time Dimension
Use a calendar dimension (with holidays, etc). When hours/minutes/sec is needed,
then have a separate timestamp in the fact table: consider this a special type
of fact. Therefore two fields: foreign key to the calendar dimension and timestamp
field. [Kimball]
If there is no need to keep the time of day in the data warehouse, then
keep the date rounded to the day as a "meaningful" surrogate key
(the only exception to non-meaningful keys) and point to a date dimension
that keeps track of the days of the week and the company holidays.
Note that Ralph Kimball says that even the time dimension should not be meaningful, but
he does admit that "the time dimension is the ONLY dimension that has any logic to the
surrogate keys and is the only one we dare place application constraints."
Tip #14.
Although Kimball insists that the keys should be meaningless surrogate keys, I think that
dates should be an exception. It is tedious to have to join to a separate table to get the date
for every date field. Handle dates like degenerate dimensions and keep the meaningful dates in
the fact table, with a calendar dimension that you can join to if necessary.
If there is a need to keep track of the time of day, then keep one
field as above with the date without the time and another field either
with just the time or with the date and time. If the field contains just the
time, then consider a time dimension with indications of work hours, shifts, etc.
History
The tracking of history requires that each piece of data have a
defined begin and end dates of validity.
Source System Analysis
See notes on:
It is generally a good idea to create the role of Source Data Steward, especially
for complex systems with data important to other business applications.
Terms related to next generation data warehouses (DWH)
- Real-time data warehouses:
- Need trickle or streaming loads
- High availability is not required, but long down times don't make sense with a real-time DWH
- Data federation: get data from source when it is needed. Alternative to trickle or streaming load
- P.N.: difference between real-time, where the change triggers the load, or near real-time, where the batch load happens in short intervals
- Master Data Management (MDM) and Data Quality (DQ) into DWH
- Advanced Analytics: moving beyond basic online analytic processing (OLAP), such as
data mining, predictive analytics, statistics, and artificial intelligence. Ad-hoc SQLs put
a strain on resources, yet this is what most analysts do.
- Open Source Software. The main objections are linked to getting support and maintenance and to
learning the tool.
- Cloud Computing and Software-as-a-Service (SaaS). Note, a private cloud could address issues of security, volatility of public cloud providers, and the difficulty of moving data into and out of a cloud
- In-Memory Processing and 64-Bit Computing
- Service-oriented architecture (SOA) and Web services
- Data Warehouse Appliances
- Last update date on the source tables: note that this may miss manual
changes to the data, in particular for rows deleted from the source table,
and also inserted or modified rows if the last update date is not set appropriately.
- Use a timestamp on source data to determine the delta to load
- Load the full set of data, and compare with a previous set in the staging area: complete, but takes a long time
- File comparison: complete, but takes a long time
- Detect change in incoming data by using checksum: if the checksum was changed
then the row changed. However, in some cases, the checksum is the same for different
values. Therefore, for the rows with unchanged checksum, compare all the columns.
A timestamp could also be used, but it is not ideal.
- Is the source data composed of a full data set, is it a feed
of incremental data (with updates), or is it a feed with deltas (debits/credits, deletes/inserts)?
- Have a status in the source table. Set it to "loaded" when the row is processed and loaded into the target.
Note that this works well for inserted and modified records, but not for deleted records: the records deleted from
the source would not have any placeholders for the status field.
- Are soft or hard deletes used in the source database? Adjust CDC accordingly.
- Row create dates can be used easily for replication;
row update dates are trickier (you have to look for the row in the target to update it)
- Does the design include detecting when data is changed by the DBA, outside of the application?
Work with technical staff who know how the source system updates data
Lessons Learnt from the Trenches
- Define the single source of truth for every data element.
It should ideally be where the data element is entered by users.
- Clearly identify the unique key for a single row.
- Using the same granularity as the source system allows better data quality.
It also improves loading speed because aggregation and key calculation are not necessary.
Design the target at the lowest level of granularity, and without aggregation
- If possible, use the source system"s identifier to determine single row
(For example: visit id from ecc, with the other system"s identifier as attribute.
Another example for key: PAN worked well from PRO. We should have used visit_id from ecc)
- Prefer ELT over ETL. Show most source data before transformation in intermediate tables.
Always have an intermediate table with the same granularity as source,
before merges or creation of additional rows (examples: setting and modality in kcng hard to troubleshoot,
successful case: easier troubleshooting with visit in kc)
- Get access to source. Create queries on source that allow comparisons to data in target.
These queries validate the ETL/ELT work, and help fill holes in the documentation.
- Get a golden copy of all specifications in a single place. When new modifications come,
it should be clear where to add them. (bad examples: visit and adm document holding other interfaces,
multiple places for vao, missed version in TALR FSD. )
- Implement an easy way to re-trigger the loading of specific data elements:
- For source data elements, when the usual CDC is date based, maybe have an alternate way of selecting
data to load based on the key
- For intermediate tables when the CDC is based on a status indicator, implement an easy way of setting
the status to "pending"
- Keep all mdm merge and unmerge decisions in one place. Not like in the emails
- In the "do not let them see you sweat" category: do not share how difficult it would be to implement
a user request during the requirements gathering. Hold until later, and present options, or delay
the implementation. A thought: if the user requirements box you into a corner for implementation,
then maybe the requirements should be taken up a level, in the sense that the requirements are
about what is to be implemented, not about how.
Data Lakes
- Large amounts of data
- Keep in original format
- Load fast, transform later, less data cleansing
- No requirements, just dump data
- As opposed to data warehouses, data lakes do not seem to be subject oriented and integrated. They may be time-variant (have historical data).
- Data lakes are nonvolatile like data warehouses, meaning modifications are not done in the data but in the sources
With Source System Analysis, Target System Analysis, and mapping.
See also notes on Structured Analysis and
Data Warehouse Best Practices.
Methodology
- Top level meeting to define the subject area and the related the source table(s).
Determine with client what tables or data feeds are required to cover the subject area.
Define the conditions for the specific data that
we want, and any links to other entities of interest.
Outcome is a commitment on the
client side to provide specific data that covers the subject area:
arrange to get a data extract or a full dump. Try to get information on which fields are
important and which fields are not used, empty or do not matter
- Is the loading effort a one-time conversion or is it an interface that is
accessed are regular intervals?
- Define the access to the data:
database tables to be exported, files to be exported via application, a feed received
from an external company?
- Do basic profiling of the source data to obtain
measurements of important qualifiers such as the percentage
of null values in data fields, the degree of duplicate records, the percentage
of bad data formats (especially dates) and the distribution of values.
See element analysis above
and data profiling.
- Then comes attribute-level analysis: dive in, do data type analysis, filter out un-used fields.
Try to find out what each field means.
- Determine how to detect change (Changed Data Capture): See above
- Determine what initial load is necessary. This can consist of a full load of current customer or product
data, or an historical load of recent transactions going back a pre-determined number of months. The
initial load could also consist of the migration from an old data structure.
- Follow-up meeting with source system expert to ask additional questions at attribute level.
- In the absence, or in complement to, the data profiling, ask for:
- data types
- descriptions and definitions
- domains of values, ranges of values
- how null, unknown or non-applicable values are indicated,
- rules that have to be applied to the data
- what constitutes an invalid value, constraints on the values
- definition/meaning of the codes
- unique/primary keys, or what defines a single row
- are historical rows kept in the source?
- do some fields have multiple uses, possibly depending on other pieces of data
- Top-level meeting with the target system expert for target-level analysis. Define the target tables or APIs and ask for documentation.
Try to map the source entities to the target entities.
Only talk about attributes if they are keys or if they hold important data.
- Do preliminary mapping
- Meet target system expert and start the mapping
- Up to here, about two weeks, maybe three
- Validate target data model against expected data warehouse design.
Provide the following:
- Mockups of the expected reports
- Matrix of measures and the dimensions they will be linked to
- Typical business questions
- All required target fields need data; all source fields with data should have a destination
unless it is explicitely known that the data is not needed. Add a column in the spreadsheet to
say if the field is used in the target and label each source field with one of the
following: no significant data, same data as another field, internal primary key or internal foreign key, or the target field.
- Make sure that there is information for the source table name, source field name, transformation,
lookups (table, condition, returned fields)
- Point out the cases where strings feed into date or number fields. Validation rules have
to be set up. Look for cases where data will be truncated.
- Define error conditions for invalid values and/or null values, field by field.
Pay special attention to null values for fields that are part of the key.
- Define what constitutes a new row in the target in the case of new, changed, and deleted
rows in the source. Ideally, the logical key in the target is the same as in the source. If the
keys are different, then extra care is needed to map out the scenarios. Also validate which rows
are kept as history
- Validate the design with IT architects and DBAs. Then talk to:
- source system developers and DBAs who know the source system model well.
However, they may not be familiar with dimensional modeling.
- power users, who are generally technical
- core business users, who may be less technical
- and representatives of the broader user community,
focusing on how the data will be used (examples of questions and the corresponding queries as answers)
(tip #108 from www.kimballgroup.com).
- Push the small "nice-to-have" changes to the next version.
- After development, show the data in the target. Be prepared to answer questions about where each piece
of data comes from.
Documentation should contain:
- Field by field source to target mapping, with transformation rules
- Lookup / reference tables
- Error severity (information, warning, severe)
- Error handling level: pass on (with warning), suspend the row and reprocess until conditions are met for success, pass the row but reprocess
- The primary, logical, and natural keys in the target. Diagram a sample
combination of keys. The sample contains all key fields;
no need to put in the non-key fields unless there is some pertinence.
- Define when to insert, update, or delete a row in the target
- Diagram(s)
- Use cases / examples
- Graphical data model
Zachman Framework for Enterprise Architecture
Order |
Layer |
Stakeholder group |
What (Data) |
How (Function) |
Where (Network) |
Who (People) |
When (Time) |
Why (Motivation) |
1 |
Scope (contextual boundary) |
Visionary /
Planner to the business |
List of things important to the business
Entity = a business thing |
List of processes the business performs
Process = a business process |
List of locations in which the business operates
Node = major business location |
List of organizations important to the business
People = major organizational unit |
List of events significant to the business
Time = major business event or cycle |
List of business goals/strategies
Ends/means = major business goal or strategy |
2 |
Conceptual Business Model |
Owner |
Semantic or Entity-relationship Model
Business entities and business relationships |
Business Process Model
Business processes and I/O as business resources |
Business Logistics System
business locations as nodes and business linkage as links |
Work Flow Model
organizational units as people and work products |
Master Schedule
business events as time, business cycles |
Business Plan
End = business objective; means = business strategy |
3 |
Logical System Model |
Designer |
Logical Data Model
data entities and data relationships |
Application Architecture
process = application function; I/O = user views |
Distributed System Architecture
Nodes are processors, storage units, etc; links are line characteristics |
Human Interface Architecture
People = roles; Work = deliverable |
Processing Structure
Time = system event; cycle = processing cycle |
Business Rule Model
end = structural assertion; means = action assertion |
4 |
Pysical Technology Model |
Builder |
Physical Data Model
segments/tables and pointers, DB relationships |
System Design
Computer functions and data elements |
Technology Architecture
Nodes are hardware or software systems, links are line specifications |
Presentation Architecture
people are users, work are the screen formats |
Control Structure
time = execute, cycle = component cycle |
Rule Design
end = condition, means = action |
5 |
Component Configuration |
Implementer |
Data Definition |
Program |
Network Architecture |
Security Architecture |
Timing Definition |
Rule Specification |
6 |
Functioning Enterprise Instances |
Worker |
Data |
Function |
Network |
Organization |
Schedule |
Strategy |
References:
Risks
- Space
- Logistics such as food and lodging
- Transportation
- Permissions to use equipement, property, ...
- International borders
- Licenses, permits and clearances
- Insurances
- Weather
- Unavailable resources
IT Risks
- Cannot handle the production load
- Incomptability between components
- Advertized performance does not corresond to real performance
- Inadequate skills for supporting the application
Project Risks
- Not enough financing or not solid enough
- Badly thought business model
- Excessive expenses
- Absent clients
- Deadlines are not respected
- Slipping schedule
Attitude towards risk
In relation to technological innovation, there are the following types of profiles:
- Visionaries, who are enthused by new technology and envision new possibilities.
- Those with a more pragmatic approach, adopting new technology if proven
- Those with a conservative approach, accepting a new system only if widely
used, generally choosing expensive but guaranteed solutions
- Skeptics
Risks
Risk = impact (e.g. delay or costs) * probability
This should lead to action to avert the risk or reduce the impact.
Stakeholders
- Have I identified all stakeholders?
- Have I identified leader or spokesman for all groups of stakeholders?
- Have I ever met the stakeholders?
- Can I describe the fears of the stakeholders in relation to the project?
- Do I know what the stakeholders want?
- Are my stakes included?
Requirements
- Unclear goals and scope.
- No room for negotiation
- Unreliable estimations
- People without a clear role in the project
- The project smells like a trap
Risk Response Strategies:
- Avoid: change the plan to eliminate the possibility that the risk will occur.
- Transfer: transfer consequences of the risk to a third party. An example is insurance.
This means that the impact or consequences will be felt by the third party.
- Mitigate: reduce the possibility that the risk will occur or reduce the
impact/consequence of the risk.
- Accept: accept that the risk will occur and develop a plan to deal with the consquences
should they occur.
Miscellaneous
The amateur knows what to do;
the professional knows what not to do.
Sources:
- [1]Henri-Pierre MADERS, Manager une équipe projet, Leadership-grilles
de lecture, cas concrets, Edition d'organisation, 2003
- [2] Sunny and Kim Baker: The Complete Idiot's Guide to Project Management,
Alpha Books, 1998
- [3] Free
Software Selection Course (looks like a personal web site, but good reading.
It inspired me for many of the comments above.)
- [4] Wayne W. Eckerson: From Five Steps to Bridge the Business/IT Gap,
The Data Warehouse Institute
- [5] Agile Data Warehousing:
Incorporating Agile Principles
by Dr. Ken Collier, Senior Consultant, Cutter
Consortium; with Jim Highsmith, Director,
Cutter Consortium Agile Product & Project
Management Practice
- [6] William H. Inmon: Building the Data Warehouse, Fourth Edition. Wiley.
Some language tips:
- Best practice = "I recommend it, but you don't have to do it"
- "If you want to ..., click on ..."