Search: 
 
 
 

Mondrian Home
Download
Documentation
What is OLAP
Installation (Spanish)
MDX
Architecture
Writing a Schema
Configuration
Performance
Aggregate Tables
Cache Control
Workbench
Command Runner
FAQ
Roadmap
Components
API
Developer's Guide
Developer Notes
Help
Forum
Julian Hyde's Blog
Links
SourceForge
Mailing List
JPivot
ResGen


Aggregate Tables

Contents

  1. Introduction
  2. What are aggregate tables?
    1. A simple aggregate table
    2. Another aggregate table
  3. Defining aggregate tables
  4. Building aggregate tables
  5. How Mondrian recognizes aggregate tables
    1. Rules
    2. Explicit aggregates
  6. Aggregate tables and parent-child hierarchies
    1. Aggregate tables at the leaf level of a parent-child hierarchy
    2. Combined closure and aggregate tables
  7. How Mondrian uses aggregate tables
    1. Choosing between aggregate tables
    2. Distinct count
  8. Tools for designing and maintaining aggregate tables
    1. AggGen (aggregate generator)
    2. Aggregate table populater
    3. Script generator
    4. Recommender
    5. Online/offline control
  9. Properties that affect aggregates
  10. References

1. Introduction 

Unlike many OLAP servers, Mondrian does not store data on disk: it just works on the data in the RDBMS, and once it has read a piece of data once, it stores that data in its cache. This greatly simplifies the process of installing Mondrian, but it puts limits on Mondrian's performance when Mondrian is applied to a huge dataset.

Consider what happens when the CEO runs her Sales Report first thing on a Monday morning. This report contains a single number: the total sales of all products, in all regions, this year. In order to get this number, Mondrian generates a query something like this:

SELECT sum(store_sales)
FROM sales_fact,
    time
WHERE sales_fact.time_id = time.time_id
AND time.year = 2005

and sends it to the DBMS. The DBMS takes several minutes to execute it: which is understandable because the DBMS has to read all of this year's records in the fact table (a few million sales, say) and aggregate them into a single total. Clearly, what is needed in this case, and in others like it, is a pre-computed summary of the data: an aggregate table.

An aggregate table coexists with the base fact table, and contains pre-aggregated measures build from the fact table. It is registered in Mondrian's schema, so that Mondrian can choose to use whether to use the aggregate table rather than the fact table, if it is applicable for a particular query.

Designing aggregate tables is a fine art. There is extensive research, both empirical and theoretical, available on the web concerning different ways to structure aggregate tables and we will not attempt to duplicate any of it here.

2. What are aggregate tables? 

To explain what aggregate tables are, let's consider a simple star schema.

The star schema has a single fact table Sales, two measure columns (units and dollars) and four dimension tables (Product, Mfr, Customer, Time, and Customer).

On top of this star schema, we create the following multidimensional model:

  • Cube [Sales] has two measures [Unit sales] and [Dollar sales]
  • Dimension [Product] has levels [All Products], [Manufacturer], [Brand], [Prodid]
  • Dimension [Time] has levels [All Time], [Year], [Quarter], [Month], [Day]
  • Dimension [Customer] has levels [All Customers], [State], [City], [Custid]
  • Dimension [Payment Method] has levels [All Payment Methods], [Payment Method]

Most of the dimensions have a corresponding dimension table, but there are two exceptions. The [Product] dimension is a snowflake dimension, which means that it is spread across more than one table (in this case Product and Mfr). The [Payment Method] dimension is a degenerate dimension; its sole attribute is the payment column in the fact table, and so it does not need a dimension table.

2.1 A simple aggregate table 

Now let's create an aggregate table, Agg_1:

See how the original star schema columns have been combined into the table:

  • The Time dimension has been "collapsed" into the aggregate table, omitting the month and day columns.
  • The two tables of the Product dimension has been "collapsed" into the aggregate table.
  • The Customer dimension has been "lost".
  • For each measure column in the fact table (units, dollars), there are one or more measure columns in the aggregate table (sum units, min units, max units, sum dollars).
  • There is also a measure column, row count, representing the "count" measure.

Agg_1 would be declared like this:

<Cube name="Sales">
  <Table name="sales">
    <AggName name="agg_1">
      <AggFactCount column="row count"/>
      <AggMeasure name="[Measures].[Unit Sales]" column="sum units"/>
      <AggMeasure name="[Measures].[Min Units]" column="min units"/>
      <AggMeasure name="[Measures].[Max Units]" column="max units"/>
      <AggMeasure name="[Measures].[Dollar Sales]" column="sum dollars"/>
      <AggLevel name="[Time].[Year]" column="year"/>
      <AggLevel name="[Time].[Quarter]" column="quarter"/>
      <AggLevel name="[Product].[Mfrid]" column="mfrid"/>
      <AggLevel name="[Product].[Brand]" column="brand"/>
      <AggLevel name="[Product].[Prodid]" column="prodid"/>
    </AggName>
  </Table>
 
  <!-- Rest of the cube definition -->
</Cube>

2.2 Another aggregate table 

Another aggregate table, Agg_2:

and the corresponding XML:

<Cube name="Sales">
  <Table name="sales">
    <AggName name="agg_1" ... />
    <AggName name="agg_2">
      <AggFactCount column="row count"/>
      <AggForeignKey factColumn="prodid" aggColumn="prodid"/>
      <AggMeasure name="[Measures].[Unit Sales]" column="sum units"/>
      <AggMeasure name="[Measures].[Min Units]" column="min units"/>
      <AggMeasure name="[Measures].[Max Units]" column="max units"/>
      <AggMeasure name="[Measures].[Dollar Sales]" column="sum dollars"/>
      <AggLevel name="[Time].[Year]" column="year"/>
      <AggLevel name="[Time].[Quarter]" column="quarter"/>
      <AggLevel name="[Time].[Month]" column="month"/>
      <AggLevel name="[Payment Method].[Payment Method]" column="payment"/>
      <AggLevel name="[Customer].[State]" column="state"/>
    </AggName>
  </Table>

  <Dimension name="Product">
    <Hierarchy hasAll="true" primaryKey="prodid" primaryKeyTable="Product">
      <Join leftKey="mfrid" rightKey="mfrid">
        <Table name="Product"/>
        <Table name="Mfr"/>
      </Join>
      <Level name="Manufacturer" table="Mfr" column="mfrid"/>
      <Level name="Brand" table="Product" column="brand"/>
      <Level name="Name" table="Product" column="prodid"/>
    </Hierarchy>
  </Dimension>

  <!-- Rest of the cube definition -->
</Cube>

Several dimensions have been collapsed: [Time] at the [Quarter] level; [Customer] at the [State] level; and [Payment Method] at the [Payment Method] level. But the [Product] dimension has been retained in its original snowflake form.

The <AggForeignKey> element is used to  declare that the column prodid links to the dimension table, but all other columns remain in the Product and Mfr dimension tables.

3. Defining aggregate tables 

A fact table can have zero or more aggregate tables. Every aggregate table is associated with just one fact table. It aggregates the fact table measures over one or more of the dimensions. As an example, if a particular column in the fact table represents the number of sales of some product on a given day by a given store, then an aggregate table might be created that sums the information so that applies at a month level rather than by day. Such an aggregate might reasonably be 1/30th the size of the fact table (assuming comparable sales for every day of a month). Now, if one were to execute a MDX query that needed sales information at a month (or quarter or year) level, running the query against the aggregate table is faster but yields the same answer as if it were run against the base fact table.

Further, one might create an aggregate that not only aggregates at the month level but also, rather than at the individual store level, aggregates at the state level. If there were, say, 20 stores per state, then this aggregate table would be 1/600th the size of the original fact table. MDX queries interested only at the month or above and state or above levels would use this table.

When a MDX query runs, what aggregate should be used? This comes down to what measures are needed and with which dimension levels. The base fact table always has the correct measures and dimension levels. But, it might also be true that there is one or more aggregate tables that also have the measures and levels. Of these, the aggregate table with the lowest cost to read, the smallest number of rows, should be the table used to fulfill the query.

Mondrian supports two aggregation techniques which are called "lost" dimension and "collapsed" dimension. For the creation of any given aggregate table these can be applied independently to any number of different dimensions.

A "lost" dimension is one which is completely missing from the aggregate table. The measures that appear in the table have been aggregated across all values of the lost dimension. As an example, in a fact table with dimensions of time, location, and product and measure sales, for an aggregate table that did not have the location dimension that dimension would be "lost". Here, the sales measure would be the aggregation over all locations. An aggregate table where all of the dimensions are lost is possible - it would have a single row with the measure aggregated over everything - sales for all time, all locations and all products.

fact table
    time_id
    product_id
    location_id
    measure
 
lost (time_id) dimension table
    product_id
    location_id
    measure (aggregated over time)
    fact_count
 
fully lost dimension table
    measure (aggregated over everything)
    fact_count

Note the "fact_count" column in the aggregate table. This additional column is a general feature of aggregate tables. It is a count of how many fact table columns were aggregated into the one aggregate table row. As an example, if for a particular choice of product_id and location_id, the time_id occurred 5 times in the fact table, then in the aggregate table the fact_count column would contain 5 for that product_id/location_id pair (a given product was sold at a given location at 5 different times).

The second supported aggregation technique provides a finer level of control, the "collapsed" dimension technique. Recall that the dimension key in the fact table refers (more or less) to the lowest level in the dimension hierarchy. For a collapsed dimension, the dimension key in the aggregate table is replaced with a set of dimension levels; the dimension key column is replaced with a set of columns; a fully denormalized summary table for that dimension. As an example, if the time dimension with base fact table foreign key time_id had the levels: day, month, quarter and year, and in an aggregate it was collapsed to the month level, then the aggregate table would not have a time_id column but rather columns for month, quarter and year. The SQL generated for a MDX query for which this aggregate table can be used, would no longer refer to the time dimension's table but rather all time related information would be gotten from the aggregate table.

time dimension table
    time_id
    day
    month
    quarter
    year
 
fact table
    time_id
    measure
 
collapsed dimension table
    month
    quarter
    year
    measure (aggregated to month level)
    fact_count

In the literature, there are other ways of creating aggregate tables but they are not supported by Mondrian at this time.

4. Building aggregate tables 

Aggregate tables must be built. Generally, they not real-time; they are rebuilt, for example, every night for use the following day by the analysts. Considering the lost and collapsed dimension technique for aggregate table definition, one can estimate that for a dimension with N levels, there are N+1 possible aggregate tables (N collapsed and 1 lost). Also, dimensions (with different dimension tables) can be aggregated independently. For the FoodMart Sales cube there are 1400 different possible aggregate tables.

Clearly, one does not want to create all possible aggregate tables. Which ones to create depends upon two considerations. The first consideration is application dependent: the nature of the MDX queries that will be executed. If many of the queries deal with per month and per state questions, then an aggregate at those levels might be created. The second consideration is application independent: per dimension aggregating from the lowest level to the next lowest generally gives greater bang for the buck than aggregating from the N to the N+1 (N>1) level. This is because 1) a first level aggregation can be used for all queries at that level and above and 2) dimension fan-out tends to increase for the lower levels. Of course, your mileage may vary.

In a sense, picking which aggregate tables to build is analogous to picking which indexes to build on a table; it is application dependent and experience helps.

The hardest part about the actually creation and population of aggregate tables is figuring out how to create the first couple; what the SQL looks like. After that they are pretty much all the same.

Four examples will be given. They all concern building aggregate tables for the sales_fact_1997 fact table. As a reminder, the sales_fact_1997 fact table looks like:

sales_fact_1997
    product_id
    time_id
    customer_id
    promotion_id
    store_id
    store_sales
    store_cost
    unit_sales

The first example is a lost time dimension aggregate table, the time_id foreign key is missing.

CREATE TABLE agg_l_05_sales_fact_1997 (
    product_id INTEGER NOT NULL,
    customer_id INTEGER NOT NULL,
    promotion_id INTEGER NOT NULL,
    store_id INTEGER NOT NULL,
    store_sales DECIMAL(10,4) NOT NULL,
    store_cost DECIMAL(10,4) NOT NULL,
    unit_sales DECIMAL(10,4) NOT NULL,
    fact_count INTEGER NOT NULL);
 
CREATE INDEX i_sls_97_cust_id ON agg_l_05_sales_fact_1997 (customer_id);
CREATE INDEX i_sls_97_prod_id ON agg_l_05_sales_fact_1997 (product_id);
CREATE INDEX i_sls_97_promo_id ON agg_l_05_sales_fact_1997 (promotion_id);
CREATE INDEX i_sls_97_store_id ON agg_l_05_sales_fact_1997 (store_id);
 
INSERT INTO agg_l_05_sales_fact_1997 (
    product_id,
    customer_id,
    promotion_id,
    store_id,
    store_sales,
    store_cost,
    unit_sales,
    fact_count)
SELECT
    product_id,
    customer_id,
    promotion_id,
    store_id,
    SUM(store_sales) AS store_sales,
    SUM(store_cost) AS store_cost,
    SUM(unit_sales) AS unit_sales,
    COUNT(*) AS fact_count
FROM
    sales_fact_1997
GROUP BY
    product_id,
    customer_id,
    promotion_id,
    store_id;

A couple of things to note here.

The above is in MySQL's dialect of SQL, and may not work for your database - but I hope the general idea is clear. The aggregate table "looks like" the base fact table except the time_id column is missing and there is a new fact_count column. The insert statement populates the aggregate table from the base fact table summing the measure columns and counting to populate the fact_count column. This done while grouping by the remaining foreign keys to the remaining dimension tables.

Next, some databases recognize star joins - Oracle for instance. For such database one should not create indexes, not on the fact table and not on the aggregate tables. On the other hand, databases that do not recognize star joins will require indexes on both the fact table and the aggregate tables.

For our purposes here, the exact name of the aggregate table is not important; the "agg_l_05_" preceding the base fact table's name sales_fact_1997. First, the aggregate table name must be different from the base fact table name. Next, the aggregate table name ought to be related to the base fact table name both for human eyeballing of what aggregate is associated with which fact table, but also, as described below, Mondrian employs mechanism to automagically recognize which tables are aggregates of others.

The following example is a collapsed dimension aggregate table where the time dimension has been rolled up to the month level.

CREATE TABLE agg_c_14_sales_fact_1997 (
    product_id INTEGER NOT NULL,
    customer_id INTEGER NOT NULL,
    promotion_id INTEGER NOT NULL,
    store_id INTEGER NOT NULL,
    month_of_year SMALLINT(6) NOT NULL,
    quarter VARCHAR(30) NOT NULL,
    the_year SMALLINT(6) NOT NULL,
    store_sales DECIMAL(10,4) NOT NULL,
    store_cost DECIMAL(10,4) NOT NULL,
    unit_sales DECIMAL(10,4) NOT NULL,
    fact_count INTEGER NOT NULL);
 
CREATE INDEX i_sls_97_cust_id ON agg_c_14_sales_fact_1997 (customer_id);
CREATE INDEX i_sls_97_prod_id ON agg_c_14_sales_fact_1997 (product_id);
CREATE INDEX i_sls_97_promo_id ON agg_c_14_sales_fact_1997 (promotion_id);
CREATE INDEX i_sls_97_store_id ON agg_c_14_sales_fact_1997 (store_id);
 
INSERT INTO agg_c_14_sales_fact_1997 (
    product_id,
    customer_id,
    promotion_id,
    store_id,
    month_of_year,
    quarter,
    the_year,
    store_sales,
    store_cost,
    unit_sales,
    fact_count)
SELECT
    BASE.product_id,
    BASE.customer_id,
    BASE.promotion_id,
    BASE.store_id,
    DIM.month_of_year,
    DIM.quarter,
    DIM.the_year,
    SUM(BASE.store_sales) AS store_sales,
    SUM(BASE.store_cost) AS store_cost,
    SUM(BASE.unit_sales) AS unit_sales,
    COUNT(*) AS fact_count
FROM
    sales_fact_1997 AS BASE, time_by_day AS DIM
WHERE
    BASE.time_id = DIM.time_id
GROUP BY
    BASE.product_id,
    BASE.customer_id,
    BASE.promotion_id,
    BASE.store_id,
    DIM.month_of_year,
    DIM.quarter,
    DIM.the_year;

In this case, one can see that the time_id foreign key in the base fact table has been replaced with the columns: month_of_year, quarter, and the_year in the aggregate table. There is, as always, the fact_count column. The measures are inserted as sums. And, the group by clause is over the remaining foreign keys as well as the imported time dimension levels.

When creating a collapsed dimension aggregate one might consider creating indexes for the columns imported from the dimension that was collapsed.

Below is another aggregate table. This one has two lost dimensions (store_id and promotion_id) as well as collapsed dimension on time to the quarter level. This shows how aggregate techniques can be mixed.

CREATE TABLE agg_lc_100_sales_fact_1997 (
    product_id INTEGER NOT NULL,
    customer_id INTEGER NOT NULL,
    quarter VARCHAR(30) NOT NULL,
    the_year SMALLINT(6) NOT NULL,
    store_sales DECIMAL(10,4) NOT NULL,
    store_cost DECIMAL(10,4) NOT NULL,
    unit_sales DECIMAL(10,4) NOT NULL,
    fact_count INTEGER NOT NULL);
 
CREATE INDEX i_sls_97_cust_id ON agg_lc_100_sales_fact_1997 (customer_id);
CREATE INDEX i_sls_97_prod_id ON agg_lc_100_sales_fact_1997 (product_id);
 
INSERT INTO agg_lc_100_sales_fact_1997 (
    product_id,
    customer_id,
    quarter,
    the_year,
    store_sales,
    store_cost,
    unit_sales,
    fact_count)
SELECT
    BASE.product_id,
    BASE.customer_id,
    DIM.quarter,
    DIM.the_year,
    SUM(BASE.store_sales) AS store_sales,
    SUM(BASE.store_cost) AS store_cost,
    SUM(BASE.unit_sales) AS unit_sales,
    COUNT(*) AS fact_count
FROM sales_fact_1997 AS BASE,
    time_by_day AS DIM
WHERE
    BASE.time_id = DIM.time_id
GROUP BY
    BASE.product_id,
    BASE.customer_id,
    DIM.quarter,
    DIM.the_year;

In the above three examples, for the most part the column names in the aggregate are the same column names that appear in the fact table and dimension tables. These tables would all be recognized by the Mondrian default aggregate recognizer. It is possible to create an aggregate table and name the columns arbitrarily. For such an aggregate, an explicit Mondrian recognizer must be specified.

CREATE TABLE agg_c_special_sales_fact_1997 (
    PRODUCT_ID INTEGER NOT NULL,
    CUSTOMER_ID INTEGER NOT NULL,
    PROMOTION_ID INTEGER NOT NULL,
    STORE_ID INTEGER NOT NULL,
    TIME_MONTH SMALLINT(6) NOT NULL,
    TIME_QUARTER VARCHAR(30) NOT NULL,
    TIME_YEAR SMALLINT(6) NOT NULL,
    STORE_SALES_SUM DECIMAL(10,4) NOT NULL,
    STORE_COST_SUM DECIMAL(10,4) NOT NULL,
    UNIT_SALES_SUM DECIMAL(10,4) NOT NULL,
    FACT_COUNT INTEGER NOT NULL);
 
CREATE INDEX i_sls_97_cust_id ON agg_c_special_sales_fact_1997 (CUSTOMER_ID);
CREATE INDEX i_sls_97_prod_id ON agg_c_special_sales_fact_1997 (PRODUCT_ID);
CREATE INDEX i_sls_97_promo_id ON agg_c_special_sales_fact_1997 (PROMOTION_ID);
CREATE INDEX i_sls_97_store_id ON agg_c_special_sales_fact_1997 (STORE_ID);
 
INSERT INTO agg_c_special_sales_fact_1997 (
    PRODUCT_ID,
    CUSTOMER_ID,
    PROMOTION_ID,
    STORE_ID,
    TIME_MONTH,
    TIME_QUARTER,
    TIME_YEAR,
    STORE_SALES_SUM,
    STORE_COST_SUM,
    UNIT_SALES_SUM,
    FACT_COUNT)
SELECT
    BASE.product_id,
    BASE.customer_id,
    BASE.promotion_id,
    BASE.store_id,
    DIM.month_of_year,
    DIM.quarter,
    DIM.the_year,
    SUM(BASE.store_sales) AS STORE_SALES_SUM,
    SUM(BASE.store_cost) AS STORE_COST_SUM,
    SUM(BASE.unit_sales) AS UNIT_SALES_SUM,
    COUNT(*) AS FACT_COUNT
FROM
    sales_fact_1997 BASE, time_by_day DIM
WHERE
    BASE.time_id = DIM.time_id
GROUP BY
    BASE.product_id,
    BASE.customer_id,
    BASE.promotion_id,
    BASE.store_id,
    DIM.month_of_year,
    DIM.quarter,
    DIM.the_year;

This aggregate table has column names that are not identical to those found in the base fact table and dimension table. It is still a valid aggregate but Mondrian has to be told how to map its columns into those of the base fact table.

Sometimes with multiple aggregate tables, one aggregate table is an aggregate of not only the base fact table but also another aggregate table; an aggregate table with lost time and product dimensions (no time_id and product_id foreign keys) is an aggregate of the base fact table and an aggregate which only has a lost time dimension (no time_id foreign key). In this case, one might first build the aggregate with only the lost time dimension and then build the aggregate with both lost time and product dimensions from that first aggregate - it will be faster (in some cases, much faster) to populate the second aggregate from the first rather than from the base fact table.

One last note, when creating aggregate tables from the base fact table pay attention to the size of the numeric columns - what might be big enough in the base fact table might not be big enough in an aggregate.

5. How Mondrian recognizes Aggregate Tables 

Mondrian has to know about the aggregate tables in order to use them. You can either define an aggregate explicitly, or set up rules to recognize several aggregate tables at the same time.

How Mondrian recognizes aggregate table names and columns pretty much dictates how one must name those table names and columns when creating them in the first place!

5.1 Rules 

Rules are templates, designed to work for all fact table names and their column names. These rules are templates of regular expressions that are instantiated with the names of a fact table and its columns. In order to describe the rule templates, a name that instantiate a rule are represented in a rule by have the name bracketed by "${" and "}". As an example, "abc_${name}_xyz" is a rule parameterized by "name". When name is "john" the template becomes "abc_john_xyz".

The regular expression engine used here and a definition of the allowed regular expression grammar is found in the Java regular expression Pattern class: java.util.regex.Pattern.

In order that a table be recognized as an aggregate table, Mondrian must be able to map from the fact table foreign key columns and measure columns and those in the aggregate table. In addition, Mondrian must identify the fact count column in the aggregate and possible level columns (which would appear in an aggregate table if it had a "collapsed" dimension). What follows is a description of the steps taken in the identification of aggregate tables by the default recognizer. If at any step, a match fails, the table is rejected as an aggregate table.

Starting off, the candidate aggregate table's name must comply with the aggregate table name rule. Represented as a template regular expression the rule is:

agg_.+_${fact_table_name}

which is parameterized with the fact table's name. (In addition, this rule is applied in "ignore case" mode.) This means that an aggregate table's name must start with "agg_" (ignoring character case), followed by at least one character, then the '_' character and, lastly, the name of the fact table. The ".+" in the template has special meaning in a regular expression - it matches one or more characters.

As an example of applying the aggregate table name rule, let the fact table be called sales_fact_1997, the Sales cube's fact table from the FoodMart schema. Applying the specific fact table name to the regular expression template creates the following regular expression:

agg_.+_sales_fact_1997

This will match the following table names:

  • agg_l_05_sales_fact_1997
  • agg_c_14_sales_fact_1997
  • agg_lc_100_sales_fact_1997
  • agg_c_special_sales_fact_1997
  • AGG_45_SALES_FACT_1997
  • AGG_drop_time_id_sales_fact_1997

The aggregate table name recognition mechanism has one additional programatic feature, one can specify that only a portion of the base fact table name be used as the basis of template name. For instance, if the DBA demanded that all fact tables begin with the string "fact_", e.g., "fact_sales_fact_1997", one would certainly not want that string to have to be part of each aggregate table's name. The aggregate table name recognition mechanism allows one to specify a regular expression with one and only one group clause (a group clause is a pattern bracketed by '(' and ')'). Whatever is matched by the contents of the group clause is taken to be the part of the fact table name to be used in the matching template. This regular expression containing the group clause is specified as the "basename" attribute. The default Mondrian aggregate table recognizer does not use this feature. For more information see the associated developer's note link.

After the default recognizer determines that a table's name matches the aggregate table template regular expression for a given fact table, it then attempts to match columns. The first column tested for is the "fact count" column. Here the candidate aggregate table must have a column called "fact_count" (ignoring case) and this column's type must be numeric. The following examples would match as "fact count" columns.

fact_count
FACT_COUNT
fact_COUNT

Following matching the "fact count" column, the candidate aggregate table's columns are examined for possible foreign key matches. For each of the foreign key column names in the fact table it is determined if there are any character case independent matches of the aggregate table's columns. Those columns that match are noted. It is alright if no columns match; the aggregate might be a "collapsed" dimension aggregate with no fact table foreign keys remaining. If the fact table had foreign key columns "store_id" and "time_id", then the following aggregate table columns (for example) would match:

  • time_id
  • store_id
  • TIME_ID
  • STORE_ID
  • time_ID
  • STORE_id

At this point, matches are looked for the level and measure columns. Both of these matching rules are multi-part - has sub rules; each rule has more than one possible regular expression that might match where a match on any one is a match.

There are three sub rules for matching level columns. Each is a template which is parameterized with 1) the fact table's cube's dimension hierarchy's name, "hierarchy_name", 2) the fact table's cube's dimension hierarchy's level name, "level_name", 3) the dimension table's level column name, "level_column_name", and 4) a usage prefix, "usage_prefix", which in most cases is null":

  • ${hierarchy_name}_${level_name}
  • ${hierarchy_name}_${level_column_name}
  • ${usage_prefix}${level_column_name}
  • ${level_column_name}

The "usage_prefix" is the value of the DimensionUsage's or private Dimension's optional usagePrefix attribute. It can be the case that a "level_column_name", the name of a dimension's level column, is the same for more than one dimension. During aggregate recognition for collapsed dimension aggregates where the base fact table has two or more dimensions with common column names, the attempted recognition will fail unless in the schema catalog the usagePrefix attribute is used to disambiguate those column names. Of course, one must also remember to prefix the the column in the aggregate table with the same prefix.

As an example of usagePrefix, consider a fact table named ORDERS which has two DimensionUsages, one for the CUSTOMER dimension and the other for the WHOLESALER dimension where each dimension has a level column named CUST_NM. In this case, a collapsed aggregate table could not include a column named CUST_NM because there would be no way to tell which dimension to associate it with. But if in the CUSTOMER' DimensionUsage the usagePrefix had the value "CU_", while the WHOLESALER's usagePrefix had the value "WS_", and the aggregate table column was named WS_CUST_NM, then the recognizer could associate the column with the WHOLESALER dimension.

In the case of a private Dimension, a usagePrefix need only be used if there is a public, shared Dimension that has the same name and has a "level_column_name" that is also the same. Without the usagePrefix there would be no way of disambiguating collapsed dimension aggregate tables.

If any of these parameters have space characters, ' ', these are mapped to underscore characters, '_', and, similarly, dot characters, '.', are also mapped to underscores. So, if the hierarchy_name is "Time", level_name is "Month" and level_column_name is month_of_year, the possible aggregate table column names are:

  • time_month
  • time_month_of_year
  • month_of_year

For this rule, the "hierarchy_name" and "level_name" are converted to lower case while the "level_column_name" must match exactly.

Lastly, there is the rule for measures. There are three parameters to matching aggregate columns to measures: 1) the fact table's cube's measure name, "measure_name", 2) the fact table's cube's measure column name, "measure_column_name", and 3) the fact table's cube's measure's aggregator (sum, avg, max, etc.), "aggregate_name".

  • ${measure_name}
  • ${measure_column_name}
  • ${measure_column_name}_${aggregate_name}

where the measure name is converted to lower case and both the measure column name and aggregate name are matched as they appear. If the fact table's cube's measure name was, "Avg Unit Sales", the fact table's measure column name is "unit_sales", and, lastly, the fact table's cube's measure's aggregate name is "avg", then possible aggregate table column names that would match are:

  • avg_unit_sales
  • unit_sales
  • unit_sales_avg

For Mondrian developers there are additional notes describing the default rule recognition schema.

5.2 Explicit aggregates 

On a per cube basis, in a schema file a user can both include and exclude aggregate tables. A table that would have been include as an aggregate by the default rules can be explicitly excluded. A table that would not be include by the default rules can be explicitly included. A table that would have only been partially recognized by the default rules and, therefore, resulted in a warning or error message, can be explicitly include in rules specified in the cube's definition.

Below is an example for the FoodMart Sales cube with fact table sales_fact_1997. There are child elements of the Table element that deal with aggregate table recognition.

<Cube name="Sales">
  <Table name="sales_fact_1997">
    <AggExclude name="agg_c_14_sales_fact_1997" />
    <AggExclude name="agg_lc_10_sales_fact_1997" />
    <AggExclude name="agg_pc_10_sales_fact_1997" />
 
    <AggName name="agg_c_special_sales_fact_1997">
        <AggFactCount column="FACT_COUNT"/>
        <AggIgnoreColumn column="admin_one"/>
        <AggIgnoreColumn column="admin_two"/>
        <AggForeignKey factColumn="product_id" aggColumn="PRODUCT_ID" />
        <AggForeignKey factColumn="customer_id" aggColumn="CUSTOMER_ID" />
 
        <AggForeignKey factColumn="promotion_id" aggColumn="PROMOTION_ID" />
        <AggForeignKey factColumn="store_id" aggColumn="STORE_ID" />
        <AggMeasure name="[Measures].[Unit Sales]" column="UNIT_SALES_SUM" />
        <AggMeasure name="[Measures].[Store Cost]" column="STORE_COST_SUM" />
        <AggMeasure name="[Measures].[Store Sales]" column="STORE_SALES_SUM" />
        <AggLevel name="[Time].[Year]" column="TIME_YEAR" />
 
        <AggLevel name="[Time].[Quarter]" column="TIME_QUARTER" />
        <AggLevel name="[Time].[Month]" column="TIME_MONTH" />
    </AggName>
    <AggPattern pattern="agg_sales_fact_1997_.*">
    ....
    <AggExclude name="agg_sales_fact_1997_olddata" />
    <AggExclude pattern="agg_sales_fact_1997_test.*" />

     </AggPattern>

  </Table>
....
</Cube>

The AggExclude elements define tables that should not be considered aggregates of the fact table. In this case Mondrian is instructed to ignore the tables agg_c_14_sales_fact_1997, agg_lc_10_sales_fact_1997 and agg_pc_10_sales_fact_1997. Following the excludes is the AggName element which identifies the name of an aggregate table table, agg_c_special_sales_fact_1997, and rules for mapping names from the fact table and cube to it. The two AggIgnoreColumn elements are used to specifically state to Mondrian that the columns admin_one and admin_two are known and should be ignored. If these columns were not so identified, Mondrian at the end of determining the fitness of the agg_c_special_sales_fact_1997 table to be an aggregate of the sales_fact_1997 fact table would complain that there were extra unidentified columns and that the mapping was incomplete. The AggForeignKey elements define mappings from the sales_fact_1997 fact table foreign key column names into the agg_c_special_sales_fact_1997 aggregate table column names.

Both the AggMeasure and AggLevel elements map "logical" name, names defined in the cube's schema, to the aggregate table's column names. An aggregate table does not have to have all of the measures that are found in the base fact table, so it is not a requirement that all of the fact table measures appear as AggMeasure mappings, though it will certainly be the most common case. The most notable exception are distinct-count measures; such a measure can be aggregated, but one can not in general aggregate further on the measure - the "distinctness" of the measure has been lost during the first aggregation.

The AggLevel entries correspond to collapsed dimensions. For each collapsed dimension there is a hierarchy of levels spanning from the top level down to some intermediate level (with no gaps).

The AggName element is followed by an AggPattern element. This matches candidate aggregate table names using a regular expression. Included as child elements of the AggPattern element are two AggExclude elements. These specifically state what table names should not be considered by this AggPattern element.

In a given Table element, all of the AggExclude are applied first, followed by the AggName element rules and then the AggPattern rules. In the case where the same fact table is used by multiple cubes, the above still applies, but its across all of the aggregation rules in all of the multiple cube's Table elements. The first "Agg" element, name or pattern, that matches per candidate aggregate table name has its associated rules applied.

Most of the time, the scope of these include/exclude statements apply only to the cube in question, but not always. A cube has a fact table and it is the characteristics of the fact table (like column names) against which some of the aggregate table rules are applied. But, a fact table can actually be the basis of more than one cube. In the FoodMart schema the sales_fact_1997 fact table applies to both the Sales and the Sales Ragged cubes. What this means is that any explicit rules defined in the Sales cube also applies to the Sales Ragged cube and visa versa.

One feature of the explicit recognizer is very useful. With a single line in the cubes definition in the schema file, one can force Mondrian not to recognize any aggregate tables for the cube's fact table. As an example, for the FoodMart Sales cube the following excludes all aggregate tables because the regular expression pattern ".*" matches all candidate aggregate table names.

<Table name="sales_fact_1997" >     <AggExclude pattern=".*" /> </Table>

During aggregate table recognition, rather than fail silently, Mondrian is rather noisy about things it can not figure out.

6. Aggregate tables and parent-child hierarchies 

A parent-child hierarchy is a special kind of hierarchy where members can have arbitrary depth. The classic example of a parent-child hierarchy is an employee org-chart.

When dealing with parent-child hierarchies, the challenge is to roll up measures of child members into parent members. For example, when considering an employee Bill who is head of a department, we want to report not Bill's salary, but Bill's salary plus the sum of his direct and indirect reports (Eric, Mark and Carla). It is difficult to generate efficient SQL to do this rollup, so Mondrian provides a special structure called a closure table, which contains the expanded contents of the hierarchy.

A closure table serves a similar purpose to an aggregate table: it contains a redundant copy of the data in the database, organized in such a way that Mondrian can access the data efficiently. An aggregate table speeds up aggregation, whereas a closure table makes it more efficient to compute hierarchical rollups.

Supposing that a schema contains a large fact table, and one of the hierarchies is a parent-child hierarchy. Is is possible to make aggregate tables and closure tables work together, to get better performance? Let's consider a concrete example.

Cube:
    [Salary]

Dimensions:
    [Employee], with level [Employee]
    [Time], with levels [Year], [Quarter], [Month], [Day]

Fact table:
    salary (employee_id, time_id, dollars)

Parent-child dimension table:
    employee (employee_id, supervisor_id, name)
 
employee
supervisor_id employee_id name
null 1 Frank
1 2 Bill
2 3 Eric
1 4 Jane
3 5 Mark
2 6 Carla
Closure table:
    employee_closure (employee_id, supervisor_id, depth)
employee_closure
supervisor_id employee_id distance
1 1 0
1 2 1
1 3 2
1 4 1
1 5 3
1 6 2
2 2 0
2 3 1
2 5 2
2 6 1
3 3 0
3 5 1
4 4 0
5 5 0
6 6 0

Regular dimension table:
    time (year, month, quarter, time_id)

6.1 Aggregate tables at the leaf level of a parent-child hierarchy