Accumulo Summit 2014 is proud to announce the following talk selections. The Schedule has been posted.
Ely Kahn (Sqrrl) & Don Miner (CTO — ClearEdge IT Solutions)
A Tour of Internal Accumulo Testing
Bill Havanki (Solutions Architect, Cloudera)
Accumulo includes a remarkable breadth of testing frameworks, which helps to ensure its correctness, performance, robustness, and protection of your vital data. This presentation takes you on a tour from Accumulo's basic unit testing up through performance and scalability testing exercised on running clusters. Learn the extent to which Accumulo is put through its paces before it is released, and get ideas for how you can similarly enhance testing of your own code.
Accumulo Backed Tinkerpop Implementation
Ryan Webb (Associate Professional Staff, JHU APL)
As graph processing grows as a field, eventually standards will be created. The TinkerPop graph processing stack is one such potential standard. The TinkerPop stack contains an algorithm engine, a scripting engine and a RESTful service for accessing graphs. At the base of TinkerPop is Blueprints; an interface for accessing and creating property graphs. Blueprints has already been implemented with several different backing technologies (e.g., relational databases, RDF triple stores, graph databases) and implementations (e.g., JDBC-based, OpenRDF Sail, and Neo4j). This presentation will discuss our implementation of the Blueprints API backed by Accumulo to enable storage of arbitrarily large, distributed graphs. Our implementation falls between the extremes of distributed graph processing systems which require the entire graph fit within the available RAM of the cluster and batch-oriented systems that incur significant disk I/O costs during execution and generally handle iterative algorithms poorly. We will discuss the benefits of supporting the TinkerPop API and the design and performance trade-offs we faced when developing the Accumulo backend and integrating with the Hadoop MapReduce framework. We aim to merge the advantages of the TinkerPop software ecosystem with the scalability and fault-tolerance of Accumulo and provide a robust, turn-key solution for certain classes of large-scale, graph-related challenges.
Accumulo on Yarn
Billie Rinaldi (Senior Member of Technical Staff, Hortonworks)
In their OSDI 2006 paper, Google describes that "Bigtable depends on a cluster management system for scheduling jobs, managing resources on shared machines, dealing with machine failures, and monitoring machine status." Until recently, no such system existed for Apache Accumulo to rely upon. Apache Hadoop 2 introduced the Yarn resource management system to the Hadoop ecosystem. This talk will describe the benefits Yarn can provide for Accumulo installations and how the Slider project (proposed for the Apache Incubator) makes it easier to deploy long-running applications on Yarn. It will describe the details of the Accumulo App Package for Slider and how to use Slider to deploy an Accumulo instance, as well as how instances can be actively managed by other applications such as Apache Ambari.
Accumulo Visibility Labels and Pluggable Authorization Systems: A Love Story
John Vines (Founding Engineer at Sqrrl)
Labels in Accumulo provide great power and flexibility. However, nearly everyone makes the same set of mistakes when first applying labels to their data. In this talk, we will follow two data architects as they first come to the labeling system in Accumulo, and see how they work their way out of the pitfalls they create for themselves. Along the way, they'll learn about Accumulo's pluggable security architecture surrounding the core functionality of the labeling system.
Accumulo with Distributed SQL Queries
Arshak Navruzyan (Argyle Data)
SQL queries are often the #1 requested feature of key/value stores. Argyle will present our integration of Accumulo with Facebook’s PrestoDB distributed query engine. We will discuss:
- Data locality between PrestoDB and Accumulo
- Predicate pushdown for row keys
- Leveraging a secondary index for column based queries
The talk will include a live demonstration of big data benchmark queries.
Addressing Big Data Challenges Through Innovative Architecture, Databases and Software
Vijay Gadepally (Technical Staff, MIT Lincoln Laboratory)
The ability to collect and analyze large amounts of data is a growing problem within the scientific community. The growing gap between data and users calls for innovative tools that address the challenges faced by big data volume, velocity and variety. The Massachusetts Institute of Technology’s Lincoln Laboratory has taken a leading role in developing a set of tools to address these challenges. Big data volume stresses the storage, memory, and compute capacity of a computing system and requires access to a computing cloud. The velocity of big data stresses the rate at which data can be absorbed and meaningful answers produced and can be addressed by the NSA led Common Big Data Architecture (CBDA). Big data variety may present both the largest challenge and the greatest set of opportunities. The promise of big data is the ability to correlate diverse and heterogeneous data to generate new insights. The centerpiece of the CBDA is the NSA-developed Apache Accumulo database (capable of millions of entries per second) and the Lincoln Laboratory-developed D4M schema. This talk will concentrate on how we utilize innovative technologies in our mission to apply advanced technology to problems of national security.
Benchmarking Accumulo: How Fast Is Fast?
Mike Drob (Software Engineer, Cloudera)
Apache Accumulo has long held a reputation for enabling high-throughput operations in write-heavy workloads. In this talk, we use the Yahoo! Cloud Serving Benchmark (YCSB) to put real numbers on Accumulo performance. We then compare these numbers to previous versions, to other databases, and wrap up with a discussion of parameters that can be tweaked to improve them.
Data-Center Replication with Apache Accumulo
Josh Elser (Member of Technical Staff, Hortonworks)
Apache Accumulo presently lacks the ability to automatically replicate its contents to another Accumulo instance with low latency. The only options currently available involve quiescing a table, exporting that table, copying it to the remote instance and importing it. This is unacceptable for a few reasons, the most important of these reasons being the require unavailability to export the given table. This talk will outline the problems in designing a low-latency replication system for Accumulo tables, describe an implementation that leverages some useful features of Accumulo, and outlines future work in the area.
Dynamically Scaling Accumulo using Docker
Sapan “Soup” Shah (Lead Engineer, 42six Solutions)
As a whole the community buys a lot of hardware, and currently we run Accumulo in a very static context. Users provision servers up front and we have a lot of applications sharing the same database. As Accumulo adds more features for isolation in the newer versions, we take a little bit of a different approach. We are going to go about using Docker to provision new databases and allow all the databases to talk on a “local” network, and use a shared zookeeper/HDFS cluster. What makes this solution even more attractive is the ability to dynamically spin up and even better spin down tablet servers as the database is going through peak load. Another nice advantage of this approach is that users can deploy iterators into this environment with little fear that someone else’s iterator will take down their accumulo. In the future of this we would like to hook into Accumulo even more using the JMX messages that the monitor uses currently to gather statistics.
Four Orders of Magnitude: Running Large Scale Accumulo Clusters
Aaron Cordova (Cofounder and Chief Technical Officer, Koverse)
Most users of Accumulo start developing applications on a single machine and will to scale to up to four orders of magnitude more machines without having to rewrite. In this talk we describe techniques for designing applications for scale, planning a large scale cluster, tuning the cluster for high speed ingest, dealing with a large amount of data over time, and unique features of Accumulo for taking advantage of up to ten thousand nodes in a single instance. We also include the largest public metrics gathered on Accumulo clusters to date and include a discussion of overcoming practical limits to scaling in the future.
Monitoring Apache Accumulo
Ravi Mutyala (Systems Architect, Hortonworks)
When we started using Apache Accumulo on large scale, our key concern was on monitoring the health of the cluster. Accumulo exposes metrics through JMX. Ganglia and Nagios are the de-facto metrics and monitoring tools for hadoop clusters. We identified that integration with ganglia, nagios and Apache Ambari will provide ease of use both for monitoring and managing Accumulo clusters. We started with ganglia and nagios integration which helps reuse all the hadoop monitoring infrastructure for Accumulo. Our next target is Apache Ambari integration for Accumulo.
In this talk, we focus on why we need to integrate and how this can be done. We will show a Hands On for ganglia and nagios integration and share the status of ambari integration.
Open Source Graph Analysis and Visualization Powered by Accumulo
Jeff Kunkle (Director of Research and Development, Altamira Technologies)
Lumify is a relatively new open source platform for big data analysis and visualization, designed to help organizations derive actionable insights from the large volumes of diverse data flowing through their enterprise. Utilizing popular big data tools like Hadoop, Accumulo, and Storm, it ingests and integrates many kinds of data, from unstructured text documents and structured datasets, to images and video. Several open source analytic tools (including Tika, OpenNLP, CLAVIN, OpenCV, and ElasticSearch) are used to enrich the data, increase its discoverability, and automatically uncover hidden connections. All information is stored in a secure graph database implemented on top of Accumulo to support cell-level security of all data and metadata elements. A modern, browser-based user interface enables analysts to explore and manipulate their data, discovering subtle relationships and drawing critical new insights. In addition to full-text search, geospatial mapping, and multimedia processing, Lumify features a powerful graph visualization supporting sophisticated link analysis and complex knowledge representation.
This talk will blend a high-level use case demo with a more technical presentation of Lumify's underpinnings, focusing on its use of Accumulo to implement fine-grained access control of the graph data.
Past and Future Threats: Encryption and Security in Accumulo
Michael Allen (Security Architect, Sqrrl)
The early Accumulo developers made security a core part of Accumulo's codebase. As the open source community around Accumulo continues to thrive, this talk examines the current state of Accumulo's security features. The talk will detail some exciting developments in the upcoming 1.6 release, which include enhancements around encryption at rest and in motion. We will also take a broader look at new use cases suggesting a wider set of threats, and how current and future work addresses those threats.
Percolating with Accumulo
Keith Turner (Peterson Technologies)
A talk about conditional mutations and Accismus (a Percolator prototype) covering the following topics.
- Conditional mutation use cases and overview
- Conditional mutation implementation
- Percolator overview and use cases
- Percolator implementation
SQL-on-Accumulo with Pivotal HAWQ and PXF
Oren Efraty (Architect, Pivotal Software), Zach Radtka (Senior Software Engineer, ClearEdge IT Solutions)
Pivotal Xtension Framework (PXF) support for Accumulo within HAWQ provides a fully-featured and native SQL interface to data stored in Accumulo. The Accumulo/PXF module works by intelligently extracting data from Accumulo through iterators and the Accumulo APIs to deliver data to HAWQ's SQL execution engine. Data extraction is fully parallel and utilizes query predicate push downs for an additional performance boost. Additionally, it natively supports Accumulo's security labels functionality.
PXF is an external table interface in HAWQ, a SQL-on-Hadoop system, which allows you to read data stored within the Hadoop ecosystem. External tables can be used to load data into HAWQ from Hadoop and/or also query Hadoop data without materializing it into HAWQ PXF enables analysis of HAWQ data and Hadoop data in a single query. It supports a wide range of data formats such as Text, AVRO, Hive, Sequence, RCFile formats, HBase, and now Accumulo.
Using Accumulo to Implement Confidentiality Protection in Message Queuing
Rod Moten (Chief Scientist, PROARC)
Accumulo is primarily used as a Big Data storage facility in a clustered environment. Accumulo’s columnar arrangement of rows, key-value pair indices and cell-level security make it attractive for non-Big Data applications as well. In this talk, we describe how to use Accumulo to implement message queuing that provides confidentiality protection. One feature of message queuing is broadcasting messages from a producer to multiple consumers. The messages could be part of a stream that the producer is providing to multiple consumers. In some cases, not all consumers should see every message in the stream. In a traditional queuing system, separate queues would be created for different levels of access. Thereby the messages would be duplicated for each level of access. In this talk, we show how to use Accumulo to create a queuing system that does not require duplication. We also present results from experiments testing the performance of such a system under different loads. We also present results comparing the performance of streaming messages using a queuing system based on Accumulo compare to traditional queuing systems, such as Apache QPid.