A columnar database is optimized for reading and writing columns of data as opposed to rows of data. Column-oriented storage for database tables is an important factor in analytic query performance because it drastically reduces the overall disk I/O requirements and reduces the amount of data you need to load from disk.

Like other NoSQL databases, column-oriented databases are designed to scale “out” using distributed clusters of low-cost hardware to increase throughput, making them ideal for data warehousing and Big Data processing.

What’s Unique About a Columnar Database?

Looking for the right database technology to use? Luckily there are many database technologies to choose from, including relational databases (MySQL, Postgres), NoSQL (MongoDB), columnar databases (Amazon Redshift, BigQuery), and others. Each choice has its own pros and cons, but today let’s walk through how columnar databases are unique, by comparing it against the more traditional row-oriented database (e.g., MySQL).

Row-Oriented Database

Traditional databases store data by each row. The fields for each record are sequentially stored. Let’s say you have a table like this:

| ID | name         | address              | zip code | phone       | city       | country   | age |
|  1 | Benny Smith  | 23 Workhaven Lane    | 52683    | 14033335568 | Lethbridge | Canada    |  43 |
|  2 | Keith Page   | 1411 Lillydale Drive | 18529    | 16172235589 | Woodridge  | Australia |  26 |
|  3 | John Doe     | 1936 Paper Blvd.     | 92512    | 14082384788 | Santa Clara| USA       |  33 | 

This two-dimensional table would be stored in a row-oriented database like this:

1,Benny Smith,23 Workhaven Lane,52683,14033335568,Lethbridge,Canada,43;2,Keith Page,1411 Lillydale Drive,18529,16172235589,Woodridge,Australia,26;3,John Doe,1936 Paper Blvd.,92512,14082384788,Santa Clara,USA,33;

As you can see, a record’s fieldsare stored one by one, then the next record’s fields are stored, then the next, and on and on…

Columnar Database

Contrast the above with how a columnar database would store this data:

1,2,3;Benny Smith,Keith Page,John Doe;23 Workhaven Lane,1411 Lillydale Drive,1936 Paper Blvd.;52683,18529,92512;14033335578,16172235589,14082384788;Lethbridge,Woodridge,Santa Clara;Canada,Australia,USA;43,26,33;

Each field is stored by the column so that each ‘id’ is stored, then the ‘name’ column, then the ‘zip codes’, etc. So what implications are there when storing data in a column-oriented fashion?

The Advantages of Columnar Databases, and the Disadvantages

The primary benefit you can get by storing data in a column-oriented database is that some of your queries could become really fast. Imagine, for example, that you wanted to know the average age of all of your users. Instead of looking up the age for each record row by row (row-oriented database), you can simply jump to the area where the “age” data is stored and read just the data you need. So when querying, columnar storage lets you skip over all the non-relevant data very quickly. Row-oriented:

1,Benny Smith,23 Workhaven Lane,52683,14033335568,Lethbridge,Canada,43;2,Keith Page,1411 Lillydale Drive,18529,16172235589,Woodridge,Australia,26;3,John Doe,1936 Paper Blvd.,92512,14082384788,Santa Clara,USA,33;


(skip) 43,26,33;

Hence, aggregation queries (queries where you only need to lookup subsets of your total data) could become really fast compared to row-oriented databases. Further, since the data type for each column is similar, you get better compression when running compression algorithms on each column (which would make queries even faster). And this is accentuated as your dataset becomes larger and larger. You might be thinking, “Well, what if I needed to query multiple values for each user? Aggregation is great, but my app needs to show data for each individual user??” You would be spot on. There are many cases where you actually do need multiple fields from each row. And columnar databases are generally not great for these types of queries. The more fields you need to read per record, the less benefits you get from storing in a column-oriented fashion. In fact, if your queries are for looking up user-specific values only, row-oriented databases usually perform those queries much faster.

Another thing to consider about columnar storage is that writing new data could take more time. If you’re inserting a new record into a row-oriented database, you can simply write that in one operation. But if you’re inserting a new record to a columnar database, you need to write to each column one by one. As a result, loading new data or updating many values in a columnar database could take much more time (perhaps, more than you expect). That’s why you would usually want a row-oriented database like MySQL running the back-end of your web app, etc. And once your app becomes huge, you would also want to consider having a columnar database like Amazon Redshift to run your BI (business intelligence) analytics queries (which usually consist of aggregation queries). We’ve seen many companies that make mobile games or web apps go through this same transition.

Columnar Databases on AWS

Amazon Web Services (AWS) provides a variety of columnar database options for developers. You can operate your own non-relational columnar data store in the cloud on Amazon EC2 and Amazon EBS, work with AWS solution providers, or take advantage of fully managed columnar database services.

Amazon Redshift

Amazon Redshift is a column-oriented, fully managed, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all your data using your existing business intelligence tools. Amazon Redshift achieves efficient storage and optimum query performance through a combination of massively parallel processing, columnar data storage, and very efficient, targeted data compression encoding schemes. Learn more about Amazon Redshift » 

Columnar Databases on Amazon EC2 or Amazon EMR

Developers may install column-oriented databases of their choice on Amazon EC2 and Amazon EMR, which means developers avoid the friction of infrastructure provisioning while gaining access to a variety of standard columnar database engines.

Apache Cassandra

Cassandra is an open source, column-oriented database designed to handle large amounts of data across many commodity servers. Unlike a table in a relational database, different rows in the same table (column family) do not have to share the same set of columns.

See a multi-region Cassandra configuration with a look inside Vidora’s globally distributed, low-latency A.I.

Consider EBS when running Cassandra workloads (learn how CrowdStrike ran dense, cheaper Cassandra clusters with EBS). For more about working with Cassandra and running Cassandra on AWS, read the Apache Cassandra on AWS whitepaper and visit the AWS Marketplace »

Best Practices for Running Apache Cassandra on AWS
Apache HBase

Apache HBase is an open-source, column-oriented, distributed NoSQL database. HBase runs on the Apache Hadoop framework. HBase provides you a fault-tolerant, efficient way of storing large quantities of sparse data using column-based compression and storage.

You can deploy HBase on Amazon Elastic Cloud Compute (Amazon EC2) and manage it yourself or leverage Apache HBase as a managed service on Amazon Elastic MapReduce (Amazon EMR).  Learn more by reading the EMR Developer Guide and this post on the AWS Big Data Blog »