In this talk, we will show how Apache Accumulo can be used to provide quick and secure access to billions of genomic observations for clinical and research purposes.
We’ll start by introducing the precision medicine problem space:
Specifically, we will focus on critical challenges related to cohort analysis:
Essentially, these challenges are “two sides of the same coin”: mapping from genotype (an organism’s full hereditary information) to phenotype (an organism’s actual observed properties) and then back again. We will explore how you can define a key schema in Accumulo to move between these two “sides” easily and efficiently.
We will also demonstrate how the Accumulo SeekingFilter and well-understood constructs (like a transpose table) can be used to address these core challenges.
We will also discuss the access control requirements necessary in the precision medicine domain, and how Accumulo’s cell-level security model can be used to satisfy these requirements from both a regulatory and organizational perspective.
Finally, we will demonstrate an implementation of these concepts using Spark and Zeppelin to analyze a dataset of several billion genomic observations. This will show how Accumulo’s distributed index gives sub-second responses to multi-criteria point queries, as well as interactive access to large datasets.