Here are my notes on how to manage data in computers within a large enterprise.
A "Black box" approach to testing would start from the end result of all the processing: summary reports created from multi-dimensional cubes, which are equivalent to a table in a relational database. The decision support system's output would be compared against detail data found in the cubes that displays detailed data in multi-dimensional cubes (the Excel Add-in for Hyperion Essbase Analytic Services).
The add-in may connect to a Essbase Spreadsheet Web Service (Java application), part of Hyperion's Essbase Spreadsheet Services product (since Release 7.1.2)
Because existing software to compare multi-dimensional cubes are not reliable, verification of multi-dimensional cubes require the generation of a format that is common to both the data mart and and the cubes so that the data can be compared. Such an approach is problematic because two sets of complex extract code would need to be created (an extract from a relational database and an export from a multi-dimensional database). Oracle 10g Data Pump technology can be used to simplify the extract.
Oracle's Warehouse Builder has a Data Quality Option used to determine, correct, and remove bad data.
Data warehouses are built with materialized views.
Unlike an ordinary view, which does not take up any storage space because they are generated on the fly, materialized views provide indirect access to table data by storing the results of an aggregation query in a separate schema object. A materialized view definition can include any number of aggregations (SUM, COUNT(x), COUNT(*), COUNT(DISTINCT x), AVG, VARIANCE, STDDEV, MIN, and MAX).
Materialized summary tables use aggregate keys to define a hierarchy of aggregation. Several dimensions can be contained in each aggregate key.
Do retrievals at different level of aggregation exihibt different performance?
Materialized views can be stored as summary tables in the same database as their base tables. This can improve query performance within OLTP systems (when Oracle QUERY REWRITE and init.ora COMPATIBLE is enabled).
Most databases do not have enough CPU capacity to handle both OLTP and the heavy demands of OLAP processing.
So to ensure good performance, data marts generally make materialized views available in a separate schema on another machine than the OLTP system. The two databases are sychronized on a nightly (rather than instantaneous) basis.
Materialized views stored in the same Oracle database as the master table's data use a materialized view log schema object to record changes to be refreshed incrementally.
The TPC-H benchmark software includes a random data generation program that can be handy even if you're not running the benchmark.
Manual filing systems
Microsoft's SQL Server Analytical Server Services uses Decision Support Objects (DSO) to communicate with a Pivot Table service on the client. Its repository metadata is stored by default in the msmdrep.mdb Access database.
On clients, VB applications communicate with the Pivot Table service via OLE DB or ADO MD (Microsoft's ActiveX Data Objects Multi-Dimensional). This is the way web servers access the database.
Microsoft looks to Excel's Pivot Table. However, Excel is limited to 65,000 rows.
Data from OLTP and legacy systems provide Inflow into staging servers of a data warehouse.
For example, in a bank, data is gathered from loan processing, pass book processing, and accounting systems. In a retail store, data is gathered from point-of-sale devices, cash registers, and entry/exit monitors.
The first step is typically data cleansing.
MMD's (Multi-Dimensional Databases) address the use of data generated by on-line transaction processing (OLTP) systems.
Collected data typically go through cleaning and transformation before being mapped and loaded into the warehouse. Examples of data cleaning include removing inconsistencies, adding missing fields, and cross-checking for data integrity. Examples of data transformation include adding date/time stamp fields, summarizing detailed data, and deriving new fields to store calculated data.
Information about data in the warehouse (such as the location, description, and other information) about the data structures is referred to as Metadata. Like a library card catalog, metadata helps the reader determine if an item of information exists in the library and if it does, provides a description of it and points to its location.. The technical description for each item of information include file name, file type, location, data source, rules used in cleaning or mapping, and date of creation or last access. The business catalog enables end users to interpret the contents of a data warehouse by:
Get this print framed for your wall!
Data warehouses provide an Upflow of summaries rolled-up from the detailed data. Averages and summaries are pre-calculated and stored as a separate unit to answer common queries raised in decision-making.
Companies offering products to perform ETL (Extraction, Transformation, and Loading) include:
These tools are faster and less error prone than manual scripting because they generate SQL code based on metadata about changing sources or targets. NOTE: The OWB repository can be exported as a metadata loader (.mdl) file type.
ETL tools also provide a visual record of the process that can be adjusted when sources or targets change.
A variant of ETL is ELT (Extract, Load, and then Transform).
The greatest value (payback) from a Data warehouse is in the way it provide quick and flexible
of transformed data that leads its users to actionable insights
useful to forecasting and planning decisions.
This is why such systems are called "Business Intelligence".
The greatest value (payback) from a Data warehouse is in the way it provide quick and flexible Outflows of transformed data that leads its users to actionable insights useful to forecasting and planning decisions. This is why such systems are called "Business Intelligence".
The heart of a data warehouse is its planning and analysis applications called On-Line Analytical Processing (OLAP). The term "MOLAP" for Multi-dimensional OLAP is also used because, unlike OLTP entity-relationship models consisting of two-dimensional tables, data warehouses use a multi-dimensional model for storing data.
Information in a Data Warehouse is organized into various dimensions. For example,
The most common basis for summarization are the dimensions:
To group applications such as Finance, Sales, etc. which can share common dimensions, SAP's Business Planning and Consolidation (BPC) application uses the concept of Application Sets, equivalent to a single MS Analysis Services database.
Each dimension, such as time, can be structured in a hierarchy of consolidation levels -- years, quarters, months, weeks, individual days, or other level of data granularity. But "day of the week" is an extended attribute.
The lower (finer, more detailed) the level of granularity available for analysis, the more costly it is to store and process the data.
Other dimensions depend on business needs:
At the center of the data model, measures (numeric attributes such as sales dollars, Invoice Amount, etc.) are stored in a fact table. To make access information multi-dimensionally, fact tables also contain several foreign keys used to join facts to several dimension tables.
Dimension tables organize and index the data stored in a fact table. A visual representation of this connection between fact tables and dimension tables appears as a star.
Larry Greenfield's Data Warehousing Information Center is the ultimate portal on data warehousing, decision support, and data mining. Included is a "rant and rave" on the definition of data warehousing.
firstname.lastname@example.org Ralph Kimball, Data Warehouse Consultant.
Chuo-Han's Data Wharehousing resource site
Seth Grimes' OLAP Pages has links to resources specifically related to On-Line Analytical Processing (OLAP).
TeraCLIN provides the application of star schemas to health care.
A surrogate key is used to maintain a hierarchy. For example, the LOCATION table's Store_id consists of 3 hierachical levels: Area, Region, and Store codes.
Population values are summarized at each level in the hierarchy.
The fact table's primary key is a concatenated key containing a concatenated key which consists of the foreign keys from every dimension table.
This star model is the end-user's view of data.
The star model is not "normalized". Normalization would turn the data models into looking like a snowflake where dimension tables are joined to other dimension tables.
Most warehouses end up with a mixed model to balance speed and complexity.
For faster queries, Mini-dimensions contain a subset of a larger dimension. Minidimensions contain just current data, a filtered set of rows, or a subset of attributes.
Unlike operational OLTP (On-Line Transation Processing) systems, which may hold only sixty to ninety days of data, a typical data warehouse stores data from the last several years.
How do materialized views keep up with changes to data such as people changing their names?
Ralph Kimball and others came up with a classification of 3 types of slowly changing dimensions implemented in OWB version 10.2 and MS-SQL 2005:
PartitioningWhen warehouses that are partitioned by month, the least current month is dropped each month so that the same number of months are available, and the most current table is the only one actively updated, with the other tables read-only. Oracle 10g can handle up to 64,000 partitions.
Comapnies can end up with a mixed model to balance speed and complexity.
Your first name:
Your family name:
Your location (city, country):
Your Email address:
Top of Page