Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Afaik the win of columnar storage comes from the fact that you can very quickly scan the entire column across all rows making very efficient use of os buffering etc. so queries like select a where b = 'x' are very quick.


> so queries like select a where b = 'x' are very quick

I wouldn’t say “very quick”. They only need to read and look at the data for columns a and b, whereas, with a row-oriented approach, with storage being block-based, you will read additional data, often the entire dataset.

That’s faster, but for large datasets you need an index to make things “very quick”. This format supports that, but whether to have that is orthogonal to being row/column oriented.


Sum(x) is a better example. Indexing x won’t help when you need all the values.


Another useful one is aggregations. Think sum(), concat(), max(), etc. You can operate on the column.

This is in contrast to row based. You have to scan the full row, to get a column. Think how you'd usually read a CSV (read line, parse line).


When b = 'x' is true for many rows and you select * or multiple columns, then it's the opposite, because reading all row data is slower in column based data structures than in row based ones.

IMO it's easier to explain in terms of workload:

- OLTP (T = transactional workloads), row based, for operating on rows - OLAP (A = analytical workloads), column based, for operating on columns (sum/min/max/...)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: