What are some ways to implement a many-to-many relationship in a data warehouse?
In my experience, a recursive hierarchy is the most practical way of tackling this. It offers the following advantages:
- Unlimited depth.
- Compactness.
- Flexibility.
- Speed.
By contrast, it takes an extra table for each level of "-to-many" joins. This is hard coded and difficult to maintain against schema updates.
By using filtered indexes, a large table of hierarchical joins can perform at superior speed to dedicated tables. The reason is each join is only "parent-child" compared to "to join table to data table". The latter has more indexes to process and store.
I've been trying to solve this problem for many years. Recently, this is what I came up with.
Some scenarios for M:M relationships in a data warehouse model
Most OLAP servers and ROLAP systems have a means to deal with M:M data structures now, but there are some caveats about this that you will need to pay attention to. If you do implement M:M relationships you will need to keep an eye on your reporting layer and what tools you want to support.
Scenario 1: M:M dimension onto a fact table
An example of this might be multiple drivers on a motor policy. If you add or remove a driver, the policy adjustment transaction may have a relationship to a list of drivers that changes with the adjustment.
Option 1 - M:M driver-fact bridge table This will have quite a large volume of data, as it has drivers x transactions rows for a given policy. SSAS can consume this data structure directly, but it is slower to query through a ROLAP tool.
If your M:M relationship is based on entities that are specific to the fact row (e.g. drivers on a car) this might be appropriate for a ROLAP tool as well, providing your ROLAP tool supports M:M relationships (e.g. using contexts in Business Objects).
Option 2 - Dummy 'combinations' dimension table If you are mapping a list of common codes to a fact table (i.e. the linked entities are not peculiar to the fact row) then you can take another approach that will reduce the data volumes. An example of this type of scenario is ICD codes on an inpatient visit. Each inpatient visit will have one or more ICD diagnoses and/or procedures listed against it. The ICD codes are global.
In this case, you can make up a distinct list of the combinations of codes on each case. Make a dimension table with one row for each distinct combination, and have a link table between the combinations and the reference tables for the ICD codes themselves.
The fact table can have a dimension key to the 'combinations' dimension, and the dimension row has a list of references to actual ICD codes. Most ROLAP tools can consume this data structure. If your tool will only work with an actual M:M relationship then you can create a view that emulates the M:M relationship between the fact and the coding reference table. This would be the preferred approach with SSAS.
Advantages of option 1: - Appropriately indexed, queries based on selecting fact table rows with a certain relationship through the M:M table can be reasonably efficient.
- Slightly simpler conceptual model
Advantages of option 2: - Data storage is more compact
- You can emulate a straight 1:M relationship by presenting the combinations in a human-readable format as a code on the 'combinations' dimension. This might be more useful on dumber reporting tools that lack support for M:M relationships.
Scenario 2: M:M relationship between dimensions:
Harder to think of a use case, but one could envisage something out of healthcare with ICD codes again. On a cost analysis system, the inpatient visit may become a dimension, and will have M:M relationships between the visit (or consultant-episode in NHS-speak) and the codings.
In this case you can set up the M:M relationships, and possibly codify a human-readable rendering of them on the base dimension. The relationships can be done through straight M:M link tables or through a bridging 'combinations' table as before. This data structure can be queried correctly through Business Objects or better quality ROLAP tools.
Off the top of my head, I can't see SSAS being able to consume this without taking the relationship right down to the fact table, so you would need to present a view of the M:M relationship between the coding and the fact table rows to use SSAS with this data.
I would like to know exactly what kind of many-to-many relationship you have in mind in your model, either as it is in the transactional system or whatever data model it currently is in.
Typically, many-to-many relationships between dimensions are facts about dimensions. The fact that a customer orders from a several branch offices who service many customers, or something like that. Each of those is a fact. It would have an effective date or something like that, but the relationship could be "fact-less". The relationship itself may have other dimensions besides customer and branch office. So it's a typical star schema with a (potentially fact-less) fact table at the center. How this star can relate to other dimensional stars in the warehouse will obviously depend. Any time you combine different stars, you would do so on the business keys and have to ensure you aren't performing inadvertent cross-joins.
Typically one does not report on such dimension relationship tables to the same degree as larger fact tables and when they do, it's not always as much data, so it doesn't tend to affect performance. In the above case, you might look at customer/branch utilization over time, but better data about actual order quantities would be available in your order fact table, which would presumably also have dimensions for the customer, branch, etc. These are not what most people would consider many-to-many (although an order could be considered to define a many-to-many relationship from customer to branch), so are more typical in data warehouse environments. You would only be doing numerical aggregates on many-to-many models if you had rolled up summary information to that relationship level - i.e. customer, branch, month, total sales - a summary fact table, which looks more like a one fact-many dimension model now that it has numerical data.