Around the world, the volume of data that scientists and corporations are collecting is accelerating at a rapid pace. These volumes of data quickly outpace the traditional relational databases that have been used to store and query this data in the past, and the world is rapidly replacing these systems with NoSQL data stores such as Apache Accumulo.
Previously, analysts who have been able to run SQL queries to analyze their data had to transform and adapt their analytics to more cumbersome languages and frameworks such as Java with Apache MapReduce or Scala with Apache Spark. Today, with the high-performance Apache Accumulo connector for Presto, we bring back the accessibility and ease of being able to execute traditional SQL queries to the NoSQL Big Data age.
In this presentation, we will discuss how the connector abstracts away common design patterns for interacting with data stored in Accumulo. Through the use of server-side iterators and secondary indexing, the connector can project a SQL interface onto their large data sets. This enables applications, data consumers, and data scientists to run production and ad-hoc queries against large data sets using SQL. When issuing queries with fine-grained predicates, the connector can skip scanning the entire Accumulo table and use the built-in indexing scheme to quickly retrieve data from Accumulo. Additionally, through the use of external tables and the built-in or custom serializer framework, the connector is able to query data stored in existing Accumulo tables.
Finally, we will discuss the use cases at Bloomberg that led to the creation of the connector and discuss how it is being used in production today.