Data warehouses are typically associated with traditional relational databases such as DB2, Oracle or SQL Server. Data is gathered from multiple sources, inconsistencies and duplication dealt with via some sort of pre-processing, and then a consistent set of data is loaded into a set of tables. There are usually one or more 'fact' tables (where the bulk of the data to be analysed is stored) and assorted 'dimension' tables (where master data, such as customer, product, location, asset is stored, giving context to the 'facts'). Once in the data warehouse, a wide array of business intelligence tools can be brought to bear, generating queries in SQL, the lingua franca of relational databases.
This is all well and good, but over recent years it has become apparent that not only are conventional data warehouses being stretched to handle growing data volumes - the Big Data of marketing speak - but that the licensing costs of the databases themselves, as well as the associated ETL and business intelligence tools, can be considerable.
Some organisations have started to experiment with an intriguing alternative approach. Big Data is usually associated with semi-structured data such as web logs and sensor data, and with the storage and processing of this data using Apache's open source software framework Hadoop and its associated file system HDFS. At first sight this offers little to interest data warehouse specialists, but there is a SQL implementation layer on top of Hadoop called Hive that is suitable for data warehouse applications.
Hive allows you to build a database schema and create tables, build indexes, and even partition tables using a SQL variant that any database administrator will be familiar with. Shielded from the internals by the Hive software layer, administrators can load data using similar commands, and once loaded, can query the data using HiveQL, which has familiar SQL constructions. The SQL queries are hived off to the underlying file system and an answer set returned.
Early results indicate that Hive is not as fast as traditional databases at real-time queries, but its inherent scalability means that it can perform well when executing batch queries, which are a major part of most data warehouse systems. Indeed some early comparative benchmarks have shown Hive outperforming a conventional SQL database running batch queries on a small test database, indicating linear performance as data volumes grew. It will be interesting to see such experiments tested on a much larger scale.
This approach is not just being used in laboratories. One US image-hosting company used Hive for its multi-terabyte data warehouse, and found it significantly outperformed their previous SQL-based warehouse on the most important queries. Another real-life case study has been with a US government agency, in which Hive successfully analysed billion of rows of spatial data which had previously crashed its SQL database.
Hive is in its infancy but presents an intriguing challenge to conventional approaches. No database licence fees are involved and it is likely that more tool support will follow for HiveQL, allowing a broader choice of query facilities than currently exists. There are challenges such as skills availability, but the pioneering efforts are certainly interesting.
An example of the growing Hive bandwagon can be seen with Quest Software, provider of the popular Toad database administration tools, already offering Hive support. Open source ETL company Pentaho also supports Hive data warehouses, providing an OLAP tool in the form of Mondrian and reporting capabilities. Analytics giant SAS has provided support for access to Hive tables as if they were native SAS data sets, allowing the rich set of data mining and predictive analytics that SAS provides to be applied to Hive data.
As experience grows and technologies mature, Hive could represent a serious open source alternative to conventional data warehouse approaches, initially at the lightweight end of the scale. Such competition will bring pressure on existing vendors to up their games, add greater value and review their pricing, which can only be a healthy thing for beleaguered customers. Roman Emperor Marcus Aurelius said: "That which is not good for the hive cannot be good for the bees." In this case, the growing success of Hive may well be good for the inhabitants of its ecosystem - the open source data warehouse community.