Testing Data Warehouse DSS

Here are my notes on how to manage data in computers within a large enterprise.

Sound: Drum riff

Take the Brainbench certification test on Data Warehousing Concepts.

Topics this page:

SQL

Data Entry using Excel

Site Map
About this site

Functional Testing Decision Support Systems

A "Black box" approach to testing would start from the end result of all the processing: summary reports created from multi-dimensional cubes, which are equivalent to a table in a relational database. The decision support system's output would be compared against detail data found in the cubes that displays detailed data in multi-dimensional cubes (the Excel Add-in for Hyperion Essbase Analytic Services).

The add-in may connect to a Essbase Spreadsheet Web Service (Java application), part of Hyperion's Essbase Spreadsheet Services product (since Release 7.1.2)

Because existing software to compare multi-dimensional cubes are not reliable, verification of multi-dimensional cubes require the generation of a format that is common to both the data mart and and the cubes so that the data can be compared. Such an approach is problematic because two sets of complex extract code would need to be created (an extract from a relational database and an export from a multi-dimensional database). Oracle 10g Data Pump technology can be used to simplify the extract.

Oracle's Warehouse Builder has a Data Quality Option used to determine, correct, and remove bad data.

Performance Testing Decision Support Systems

Data warehouses are built with materialized views.

Unlike an ordinary view, which does not take up any storage space because they are generated on the fly, materialized views provide indirect access to table data by storing the results of an aggregation query in a separate schema object. A materialized view definition can include any number of aggregations (SUM, COUNT(x), COUNT(*), COUNT(DISTINCT x), AVG, VARIANCE, STDDEV, MIN, and MAX).

Materialized summary tables use aggregate keys to define a hierarchy of aggregation. Several dimensions can be contained in each aggregate key.

Do retrievals at different level of aggregation exihibt different performance?

Materialized views can be stored as summary tables in the same database as their base tables. This can improve query performance within OLTP systems (when Oracle QUERY REWRITE and init.ora COMPATIBLE is enabled).

Most databases do not have enough CPU capacity to handle both OLTP and the heavy demands of OLAP processing.

So to ensure good performance, data marts generally make materialized views available in a separate schema on another machine than the OLTP system. The two databases are sychronized on a nightly (rather than instantaneous) basis.

Updates to Data

When changes to data occur, materialized views also need to be updated.

Materialized views stored in the same Oracle database as the master table's data use a materialized view log schema object to record changes to be refreshed incrementally.

DSS Performance Benchmarks

Picking the right size machine for DSS is made easier because computer manufacturers publish results after running the Transaction Processing Council's TPC-H benchmark software (not ESSBASE nor OWB). $/QpH (Queries per Hour) is the common TPC-H metric. This is a composite metric obtained from the geometric mean of the single-user PowerTest metric and the multi-user Throughput metric.

The TPC-H benchmark software includes a random data generation program that can be handy even if you're not running the benchmark.

Basic Concepts

Members are based containing values.
Up to 63 dimensions are supported by MS-SQL7/2000.
Virtual dimensions are based on member properties, such as the number of children in the household.
Aggregations are totals of member values, such as sales by store manager.
Write-back is the ability to change member values in order to analyze "what if" effects.

Decision Support Services

A Logical Cube uses measures and dimensions from physical cubes, usually to compartmentalize security permissions. Example is payroll information for department x.

Each table is used to store rows and columns of information.
A column (field) is a single item of information kept for each record in the table.
A row of columns (also called a record) contains the information for a single entry.
A cursor points to the record being processed.

A primary key is a field, or set of fields, that is used to uniquely identify a specific entry record in a table.
A foreign key is used to create a link between the table it is in and another table. The relationship between a primary and foreign key is that a foreign key in one table 'points' to a primary key in another table to create a link or relationship between a record from each table.
Referential Integrity describes the validity of the relationship between Parents and child data.

A variable is a container to store values which can be changed.
Variables must be declared before being referenced (no forward referencing).

SQL statements can be embedded into C, COBOL, FORTRAN, and other programming source between exec SQL execute and end-exec for processing by the pre-compiler .
References to host variables from within a SQL block begin with a colon (:).
Database engines such as "SQL Server" do the queries, additions, deletions, etc that are performed on the data.
SQL = Structured Query Language The first implementation of SQL was on IBM's DB2 mainframe databases.

A data warehouse collects historical subject-oriented data for non-volatile time-series analysis using analytical engines such as Excel 2000. Such databases are kept separate from operational data stores to avoid degrading the performance of on-line transactions. This separation also allows for integration of data from various sources.

A trigger is a procedure that is stored in the database and executed automatically (implictly) before or after insert, update, or delete commands is issued against the associated table. Triggers are used to transparently create events logs which can be used to generate statistics or to review security enforcement. Trigger conditions must be a SQL condition and not a PL/SQL condition.

Scalar datatypes have no internal components.
composite datatypes have internal components which can be manipulated.

Cube (yellow), 1991 (Serigraph) by Sol Lewitt
Get this print framed for your wall!

Manual filing systems
Database Diagramming with Visual Studio 6.0 and SQL Server 7.0 covers the purpose of the database diagram, how to create a database diagram using the Visual Studio 6.0 database diagram tool and how to version your schema in Visual SourceSafe.
Building Data Warehousing Applications w/SQL Server2000 by Cesar Larrea from Microsoft

Architectures Compared

Microsoft's SQL Server Analytical Server Services uses Decision Support Objects (DSO) to communicate with a Pivot Table service on the client. Its repository metadata is stored by default in the msmdrep.mdb Access database.

On clients, VB applications communicate with the Pivot Table service via OLE DB or ADO MD (Microsoft's ActiveX Data Objects Multi-Dimensional). This is the way web servers access the database.

Microsoft looks to Excel's Pivot Table. However, Excel is limited to 65,000 rows.

Inflows

Data from OLTP and legacy systems provide Inflow into staging servers of a data warehouse.

For example, in a bank, data is gathered from loan processing, pass book processing, and accounting systems. In a retail store, data is gathered from point-of-sale devices, cash registers, and entry/exit monitors.

The first step is typically data cleansing.

MMD's (Multi-Dimensional Databases) address the use of data generated by on-line transaction processing (OLTP) systems.

Neil Raden's Modeling the Data Warehouse article excerpted in the January 29, 1996 issue of Information Week, identified these differences:

Relational Model Multi-Dimentional Model
Transaction View Slice of Time View
Local Consistency Global Consistency
Audit Trail Big Picture
Explicit Relationships Implied Relationships

Collected data typically go through cleaning and transformation before being mapped and loaded into the warehouse. Examples of data cleaning include removing inconsistencies, adding missing fields, and cross-checking for data integrity. Examples of data transformation include adding date/time stamp fields, summarizing detailed data, and deriving new fields to store calculated data.

Information about data in the warehouse (such as the location, description, and other information) about the data structures is referred to as Metadata. Like a library card catalog, metadata helps the reader determine if an item of information exists in the library and if it does, provides a description of it and points to its location.. The technical description for each item of information include file name, file type, location, data source, rules used in cleaning or mapping, and date of creation or last access. The business catalog enables end users to interpret the contents of a data warehouse by:

listing pre-defined queries and reports
presentingnon-technical description of data sources and formulas for defining computed fields used in ad hoc querying, and
providing a tool (GUI) for exploring information (and Information Explorer).

Cube (blue), 1991 (Serigraph) by Sol Lewitt

Get this print framed for your wall!

Upflows

Data warehouses provide an Upflow of summaries rolled-up from the detailed data. Averages and summaries are pre-calculated and stored as a separate unit to answer common queries raised in decision-making.

Companies offering products to perform ETL (Extraction, Transformation, and Loading) include:

Ascential, acquired by IBM
Oracle Warehouse Builder (OWB), which take advantage of Oracle database set-based and row-based operations, PL/SQL bulk processing and table functions, foreign key constraint manipulation, use of inline views to speed loading, partition exchange loading, external table support, multi-table insert, merge, direct path insert, and parallel operations.
Sunopsis (acquired by Oracle in late 2006) favors the "E-LT" approach.
Informatica COGNOS

These tools are faster and less error prone than manual scripting because they generate SQL code based on metadata about changing sources or targets. NOTE: The OWB repository can be exported as a metadata loader (.mdl) file type.

ETL tools also provide a visual record of the process that can be adjusted when sources or targets change.

A variant of ETL is ELT (Extract, Load, and then Transform).

Outflows

The greatest value (payback) from a Data warehouse is in the way it provide quick and flexible Outflows of transformed data that leads its users to actionable insights useful to forecasting and planning decisions. This is why such systems are called "Business Intelligence".

Zyga consultants say:

Data Warehousing with OLAP tools also enables companies to manage by exception. Managers today are deluged with status reports on company operations. Often this information either comes too late or does not require managers to take any corrective action. By using threshold analysis and intelligent agents that trigger exception alarms, a Data Warehouse provides managers timely access to only the critical information they require in order to take action.

The heart of a data warehouse is its planning and analysis applications called On-Line Analytical Processing (OLAP). The term "MOLAP" for Multi-dimensional OLAP is also used because, unlike OLTP entity-relationship models consisting of two-dimensional tables, data warehouses use a multi-dimensional model for storing data.

Information in a Data Warehouse is organized into various dimensions. For example,

a sales analysis database is organized by product, time, territory, and other dimensions.
an invoice database could use time, customer, product, and supplier dimensions.

A flat dimension have dimension members who are equivalent (such as a Category of Actual, Budget, Forecast, What-if).
A Hierarchical dimension have dimension members that are parts of a whole, such as PreTaxIncome and Taxes totaling to NetIncome.

The most common basis for summarization are the dimensions:

Measures
allow changes to the view of data, such as Periodic, Week To Date (WTD), Month To Date (MTD), Quarter to date (QTD), Year to date (YTD). Daily transactions are aggregated to provide consolidated weekly or a monthly comparisons viewed using a Calendar interface.
Dimensions
such as Service or Product and their properties (such as COLOR and SIZE) can totaled into a hierarchy of BRAND, MANUFACTURER, CATEGORY, or other aggregate.
Currency
(InputCurrency and RptCurrency).
Location (Geography)
each retail store's data is summarized into CITY, REGION, STATE, or COUNTRY levels of granularity.

To group applications such as Finance, Sales, etc. which can share common dimensions, SAP's Business Planning and Consolidation (BPC) application uses the concept of Application Sets, equivalent to a single MS Analysis Services database.

Each dimension, such as time, can be structured in a hierarchy of consolidation levels -- years, quarters, months, weeks, individual days, or other level of data granularity. But "day of the week" is an extended attribute.

The lower (finer, more detailed) the level of granularity available for analysis, the more costly it is to store and process the data.

Other dimensions depend on business needs:

Customer Entity
who purchased can totaled into a hierarchy of INDIVIDUALBUYER, MARKET CHANNEL, LIFESTAGE, or other aggregate.
Diagnosis / Need:
(e.g., medical DCD codes).
Entity Organization
providing the server/product, which can be totaled into a hierarchy of DEPARTMENT, STORE, DISTRICT, DIVISION, CORPORATION, or other grouping.
Supplier
Account
(Income Statement, Balance Sheet, Cash Flow, KPI)
Product Inventory
Product Returns
Event
DataSrc
(for Data Source).
IntCo
Inter-Company Eliminations

At the center of the data model, measures (numeric attributes such as sales dollars, Invoice Amount, etc.) are stored in a fact table. To make access information multi-dimensionally, fact tables also contain several foreign keys used to join facts to several dimension tables.

Dimension tables organize and index the data stored in a fact table. A visual representation of this connection between fact tables and dimension tables appears as a star.

Larry Greenfield's Data Warehousing Information Center is the ultimate portal on data warehousing, decision support, and data mining. Included is a "rant and rave" on the definition of data warehousing.

Neil Raden offers Star Schema 101, Data, Data Everywhere a white paper excerpted in the October 30, 1995 issue of Information Week on selecting the OLAP technology, and

ocsqlug-unsubscribe@yahoogroups.com Ralph Kimball, Data Warehouse Consultant.

Chuo-Han's Data Wharehousing resource site

The Data Warehousing Institute.

Data Warehousing.com.

Data Warehousing.org.

Data-Warehouse.com.

Microsoft's Technet

Directory of Data Warehouse, Data Mining, and Decision Support Resources

Technology Guides for Data Warehousing Professionals

Seth Grimes' OLAP Pages has links to resources specifically related to On-Line Analytical Processing (OLAP).

TeraCLIN provides the application of star schemas to health care.

High-end databases:

Mercury from Business Objects

PowerPlay from Cognos

PaBLO from Andyne

SAS

Microsoft

High-end database engines:

Express from Oracle,

Acumate ES from Kenan,

Gentium from Planning Sciences

Holos from Holistic Systems

MDD engines:

Brio

Fusion from Information Builders

Essbase "hypercubes"

Seagate Software

LightShip Server from D&B/Pilot

TM/1 from Sinper

A surrogate key is used to maintain a hierarchy. For example, the LOCATION table's Store_id consists of 3 hierachical levels: Area, Region, and Store codes.

Population values are summarized at each level in the hierarchy.

The fact table's primary key is a concatenated key containing a concatenated key which consists of the foreign keys from every dimension table.

This star model is the end-user's view of data.

The star model is not "normalized". Normalization would turn the data models into looking like a snowflake where dimension tables are joined to other dimension tables.

Most warehouses end up with a mixed model to balance speed and complexity.

For faster queries, Mini-dimensions contain a subset of a larger dimension. Minidimensions contain just current data, a filtered set of rows, or a subset of attributes.

Updates to Slowly Changing Data

Unlike operational OLTP (On-Line Transation Processing) systems, which may hold only sixty to ninety days of data, a typical data warehouse stores data from the last several years.

How do materialized views keep up with changes to data such as people changing their names?
How can like comparisons over periods of time be performed when underlying business definitions for products, geographies, and other attributes changes?

Ralph Kimball and others came up with a classification of 3 types of slowly changing dimensions implemented in OWB version 10.2 and MS-SQL 2005:

A new record replaces the original record. No trace of the old record exists. Only the most recent dimensional values are stored. Previous (historical) values are no longer available.

A new record is added into the dimension table. To differentiate various versions, a surrogate key contains the primary key and effectivity dates. This enables storage and tracking of every change in value, used when dimensional values have changed but the records would have the same primary keys. This would complicate the ETL process.

The original record is modified to reflect the change. Extra fields in the record is added to contain the previous (historical) dimensional value and the effectivity of that data. This cannot keep all history, only the most recent change.

Downflows of Obsolete Data

Data warehouses archive their Downflow of obsolete or infrequently used data.

Partitioning

When warehouses that are partitioned by month, the least current month is dropped each month so that the same number of months are available, and the most current table is the only one actively updated, with the other tables read-only. Oracle 10g can handle up to 64,000 partitions.

Comapnies can end up with a mixed model to balance speed and complexity.

Your rating of this page:
Low High

Your comments on this topic, please:

Publish this comment publicly

Your first name:

Your family name:

Your location (city, country):

Your Email address:

Email me updates

Top of Page

Thank you!

Testing Data Warehouse DSS

Functional Testing Decision Support Systems

Performance Testing Decision Support Systems

Updates to Data

DSS Performance Benchmarks

Basic Concepts

Architectures Compared

Inflows

Upflows

Outflows

Measures

Dimensions

Currency

Location (Geography)

Customer Entity

Diagnosis / Need:

Entity Organization

Supplier

Account

Product Inventory

Product Returns

Event

DataSrc

IntCo

High-end database engines:

MDD engines:

Updates to Slowly Changing Data

Downflows of Obsolete Data

Partitioning