Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
What Is an OLAP Cube? An Exhaustive Explainer (holistics.io)
86 points by chotmat on Sept 4, 2021 | hide | past | favorite | 19 comments


OLAP database is just a copy, replica, or archive of data with a schema designed for analytical queries and not for transactional speed and robustness.

Cube means you can answer questions across dimensions of time, region, and the like. Temporal queries are not always simple.

You don't query your live transactional db to answer a question like, "are third week of February margins on Acme Widget Q better in place A or B."

That's really it.


I remember hearing about OLAP cubes donkey's years ago (probably not far from 1993 as the article says), read numerous descriptions of what they are (many sounding like this article). But technically there was nothing novel about the results they were producing, any programmer could have done it. There was nothing surprising about the underlying data structures. I walked away with much of a clue, but it wasn't just marketing fluff because the idea kept hanging around.

That changed recently. I was asked to load some data into a OLAP package. Nothing special about that, but it meant I got to sit in on the training session for the people using it. They were the same people our IT had to help on a daily basis with the simplest of things. Yet within an hour it had clicked, and they were happily exploring how each dimension in many dimensional data effected the others.

That was my light bulb moment. The reason OLAP is still a thing 20 years later isn't because of some special algorithm, or data structure - it's because it's a unique GUI for exploring very complex data that non-computer find intuitive. That's my take on it anyway.


If you've used pivot tables in Excel or Google Sheets, you'll know that you have:

A) rows and columns

B) values

Dimensions are things you might include in A.

Facts (or measures) are things you can aggregate, and would be part of B.

In the old days, you often couldn't quickly (on demand) compute the answer to a question, due to memory and processing limitations.

But, if you could decide ahead of time what dimensions and facts you wanted, you could compute all the little slices, e.g. value of sales in the north region by salesperson Bob in month July 1996.

Then, when you needed some particular pivot table, your software could provide it to you just by summing these precomputed slices.

The OL in OLAP stands for 'online', i.e. you'd do your analytical processing ('AP') live, rather than waiting for some long-running batch job.


My experience of OLAP was that it took enterprise so long to adopt and plan and design OLAP systems that they were obsolete before they could ever be used.


OLAP Cubes have been disrupted by Column Stores. Unless you are interested in the history of Data Warehouses, the technology can safely be ignored.


They're different things entirely. Think of OLAP cubes as more of a model, a set of language (measures, dimensions) for expressing all kinds of aggregation but not necessarily prescribing the how. Column store is the new how so you can still have OLAP queries against it.


olap cubes are the main use case for column stores


100% here. We need to drop old terms and methods that are now obsolete.


So that we can now sell old things as exciting new technologies.


people can really get hung up on jargon like this. the real concept is dimensional modeling which is a whole strategy and toolkit of ideas for doing online analytical processing. you can do it with anything, the data warehouse toolkit was calling it "rowlap" when you did in a regular database. you would be dead lost trying to use a proper "olap" tool with special sparse matrix data structures and MDX queries without understanding concepts like the various types of dimension tables and how they can be nested etc. Claire was basically right but maybe actually complaining more about the marketing of the olap products than any actual gatekeeping.


Very nice write up. For the story from 2010 to now, I would mention the emergence by MPP columnar processing systems like Vertica and in-memory distributed systems like MemSQL to the narrative. Of course Kylin, Clickhouse etc are great open-source contenders (although at the time I looked into them (~5-6 years ago), they were not mature enough).

In my experience, people often underestimate the continuous effort to maintain the the Kimball's "Enterprise Data Warehouse Bus Architecture" diagram, even with more powerful machines and modern distributed tooling.

In today's fast evolving Internet apps world, the data use cases and scenarios are very fast evolving. That brings its own set of challenges.

Having good usable tools for managing the lifecycle of entity or event definitions, their variants like emitted/logged vs cleaned/processed/synthesized, their data quality checks etc and ensuring they are easily discoverable and understandable by everyone in the org is super crucial and it is significantly under-appreciated.

Usually, strong systems engineers who are in charge of the data platform focus on building the data infra (job scheduling, data pipelines execution, storage etc) but the crucial work of defining the data dictionaries, event or entity models etc are left out. The data producers and data consumers who are spread out throughout the organization have to muddle through this on their own without any centralized tooling to support this activity. These make data use very difficult and siloed.

Usually, there would be a team of BI analysts who are tasked to get some answers out of the data for the questions asked of them by various data users. Funnily these analysts are also working in silos assigned to those different data users. Inevitably, they become the super-inefficient intermediary between the data users and the data insights.

The pre-cooked data insights are presented in spreadsheets and slides in review meetings – where a narrative is already prepared by the analysts.

This robs the opportunity for the data users to explore and ask data questions on their own in a fast iteration cycle to improve their intuition and understanding of their product/business environment.

IMO, these challenges still remain largely unsolved even to this day across organizations of all size and scale.


Well, an OLAP cube is a kind of a table/view. There are two core properties: the cube contains only one value ('column' in RDBMS-speak) and the data is defined across a set of 'dimensions' (levels in OLAP, unique indices with a foreign key in RDBMS).

The core insight though that the article is missing is that these dimensions are well-defined (and finite!) within the OLAP architecture and so are the roll-ups (aggregations) with their parent dimensions. The classical example is a date: a date has to roll-up to a week. Obvious. But the set of dates within the environment has to (or should anyway) also be well defined: so you can't just use any date, it has to fall, e.g. between 2019 and 2021.

Why? because the structure of the dimensions makes algebra across OLAP cubes trivial. In fact, OLAP algebra is pretty much linear algebra, just defined slightly differently for a more fluid problem space. That's why I like to think that OLAP isn't obsolete: in fact, most databases are just now becoming OLAP-y :)

Source: 20+ years of OLAP systems development.


This is more of research notes on the evolution of the concept than explanation, but it is still very good.


Anyone have experience with Apache Kylin? Seemed to me as a way to blend the cube concept with columnar stores.

https://en.m.wikipedia.org/wiki/Apache_Kylin


Is there any advantage to OLAP cube over plain SQL (large historical database regularly updated with production data)?

It sounds like certain common queries may be extra fast as it's kind of pre-computed?


Depends if you want to go the vendor route, where "cube" is very specific to the implementation of the tool you're implementing. Dealing with semi-additive measures and drill-down/drill-through will differ between SSAS, Pentaho, or Cognos.

I think the real value in the process comes from all the data modeling decisions, which involves a lot of interactions with the business users that are asking for data/reports. Something as simple as an e-commerce order can look very different between CEO KPI reports, marketing, purchasing, and accounting. For example, accounting cares only when the product ships while marketing cares when it's sold. Multiply this by a hundred, and you end up with nuanced data pipelines that encapsulate all this domain-specific logic.

Multidimensional modeling is very useful even if you're not adopting cubes specifically, and just using dbt on a read-only replica of your database to create aggregate table for a few dozen reports.


Thanks this helps. It seems like it shines when you have a lot of non technical staff requesting the data. At my job, the data analysis folks have to write complex ad-hoc queries all the time. We store the queries for consistency reasons (e.g. to make sure we can be consistent and run the same query each year for the same report) It seems like OLAP wouldn't be a good fit there.


> non technical staff requesting the data

That's a good way of looking at it.

Consider the cube as the end product of a process of automating data the business needs to make decisions. It starts with "How much did we sell yesterday?" and expands to include week-over-week/year-over-year comparisons, breakdowns by category, leaders/losers and other frequently looked at answers.


Thank you for the help!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: