Sequoia: Building a compiler/interpreter for an SQL-like data analytics language using Python
Data Analytics (DA) and Machine Learning (ML) in the Big data era depend heavily on handling huge volumes of data efficiently and in a generic way, exploiting standardized protocols for interface and storage compatibility. However, implementation-agnostic functionality is necessary to ensure application decoupling and easy maintenance of the DA/ML codebase. In many cases, SQL aggregations and statistics extensions can provide a similar abstraction layer, but this usually comes at the cost of constructing very lengthy and complex queries for functionality that can be implemented in just few lines of code e.g. in Python, or cannot be implemented at all if it requires advanced statistics, machine learning algorithms or hardware acceleration. “Sequoia” is a starting codebase for such a generic abstraction layer between DA/ML and data management, providing an SQL-like scripting language with rich and “dense” functionality with minimal implementation details exposed to the application level. It provides a full compiler/interpreter developed in pure Python with lex/yacc functionality, implementing DA/ML “primitives” like unified data source import and in-memory database, automated data pre-processing (e.g. missing values removal, error checking, noise removal, trend removal, normalization, rescaling), data resampling, advanced statistics, n-order regression, etc. More advanced primitives can provide adaptive signal processing for time series, including Wiener filtering, Kalman filtering, RLS/LMS filtering, etc. Furthermore, it can be easily extended to application-specific functionality, e.g. implementing 2-D convolutions via TensorFlow with only one line of the custom Sequoia language. The library is currently under development and provides interpreter functionality, while in the next versions it will also provide pre-compiled intermediate forms in the sense of Just-In-Time compilation for much faster execution in the Sequoia engine. It is also a very useful educational tool for academic courses in compiler theory and advanced programming in Python.