Over the past few decades, traditional row store databases have served many needs, but within the past 10 years we've seen tremendous advancements in computing power, disk performance, memory density and pricing that lets us look to new ways of computing. A new crop of Analytic Databases provide performance at a scale and cost that was seemingly impossible only a few years ago. These databases are designed from the ground up to get answers back. Fast. We know the options and can help you pick the right one(s) for your needs.
Column Store Databases
Legacy RDBMS' usually store data in rows. Many analytic databases have flipped this approach on end, choosing to store data by column -- an approach called "vertical partitioning." When data in the same field are stored together, it can be compressed for smaller disk footprint and incredibly fast access. Additionally, most analytic queries select only a subset of the columns in a table. Performance is enhanced even more since column stores only retrieve the necessary column data.
Many modern analytic databases use some form of columnar storage engine, including Actian, Vertica, Redshift and EXASOL. There are also nice open source variants of this technology, including MonetDB and InfiniDB.
Symmetric Multiprocessing (SMP) vs Massively Parallel Processing (MPP)
Analytic databases are generally architected as either SMP or MPP. SMP databases run on a single server with many cpus/cores, taking advantage of shared memory to speed query performance. MPP databases leverage clusters of servers that spread the data and distribute query processing in order to crush performance through parallelism. As a result, SMP databases are often called "scale up" and MPP "scale out". If you want to process more data on SMP, then add capacity to your server or buy a bigger server. If you want to extend MPP databases, then add more servers.
We've used SMP databases for many years as they are usually easier to configure, provide lower cost of entry and can sometimes scale to support interactive query on a billion rows. SMP databases generally can execute individual queries faster than MPP on a given size dataset, but have a ceiling on their scalability -- the capacity of a single server. We've worked with SMP analytic databases like Actian Vector, InfoBright, InfiniDB(single server) and MonetDB
MPP databases are often used when record volumes enter the billions or even trillions. Due to their architecture, every query usually has a small overhead cost which makes them underperform traditional RDBMS and SMP analytic databases on lower volume data. However,on a properly sized cluster, very few queries require an inordinate amount of time to execute. We've worked with MPP analytic databases like Redshift, Greenplum, Netezza and Teradata
Cloud-Based Analytical Databases
Sometimes the need for a highly scalable database is out there, but the investment required is not. We've used Amazon Redshift in these cases. Amazon licensed the Par Accel technology prior to the Actian purchase, and has taken it to new levels, offering a cloud-based alternative for those who would like to do high volume analytics but don't want to buy and administer a cluster of servers and disk. Amazon offers highly competitive pricing for these types of infrastructures, and we've seen excellent performance extending to billions of rows of data.
SQL on Hadoop
As the Hadoop ecosystem matures, many of our customers are looking at a new class of databases that leverage the massive storage, parallel compute and fault tolerance of Hadoop. These databases are often called SQL on Hadoop and can be categorized by their storage and compute architectures. Technologies such as Hive/Stinger, Impala and SparkSQL can access prepared data stored natively on HDFS using industry standard file formats. These platforms provide seamless data access between Hadoop data processing and data query. A second class of SQL on Hadoop is represented by a growing number of database vendors that have platformed their core engine to run in-cluster, usually managed by YARN. Most of these vendors, however, require you to load your HDFS data into their proprietary storage format in order for their query engines to perform.
As Hadoop continues its march toward becoming "the" data warehouse, we expect to see tremendous innovation in this class of analytic database. Rest assured that we'll help you stay on top of it.