There has been a “Cambrian explosion” of big data systems proposed and evaluated in the last eight years, but relatively little understanding of how these systems or the ideas they represent compare and complement one another. In enterprise and science situations, ``one size is unlikely to fit all”: we see analytics teams running multiple systems simultaneously. However, the highest level of abstraction for interoperability achieved in practice is basically at the file system; for example, HDFS. At the same time, there has been some convergence around higher-level data models (relations, arrays, graphs) and higher-level computational models (relational algebra, parallel data-flow, iteration, linear algebra).
As a result, the design space seems narrower than the implementation space, suggesting an opportunity to build a common ``complexity hiding” interface to all these seemingly disparate systems to make them easier to compare, easier to use together, and perhaps to improve overall performance by affording cross-platform, federated optimization.
We are exploring a common programming model for big data systems, subscribing to three design principles:
Motivated by these ideas, we are working to expand the University of Washington Myria system to act a comprehensive shared interface for big data systems, regardless of model, system, or task.
Myria is a hosted Big Data management and analytics service consisting of three components:
To register a system with MyriaMiddleware, the sytem designer provides four components by extending appropriate classes in the MyriaMiddleware Python library (currently called RACO):
Data scientists need to be insulated from the complexity and uncertainty that is dominating the systems research in big data today; we can’t ask them to learn and re-learn a new API every month, along with the algorithmic tricks and configuration practices needed for decent performance. But a shared interface to big data systems will not only make things easier for end users — it is critical to advance the science. To do big data systems research today takes a phenomenal effort: N systems must be installed and maintained, and M applications must be implemented and tuned on each of them. As a result, corners are cut: experiments compare just two systems, focus on only simple, narrow use cases, or both. As a field, we must make it significantly easier to do good science, evaluate realistically complex applications, and compare a variety of state of the art systems. We hypothesize that a middleware layer that provides ``write once, run anywhere’’ capabilities would be a significant step towards solving this problem.