Databricks: All in one?
Custom binary format, ingestion, transforming thousands of column datasets, data quality testing, pipeline monitoring, data sharing... Is there anything you cannot do with Databricks?
Written by — Juuso Halonen, Data Engineer
Databricks is a unified analytics platform that claims to "accelerate innovation by unifying data science, engineering, and business". In other words, it should handle all of your data needs. But is it actually that versatile and how does it work in practice?
Let’s take a step out of the tech bubble for a moment and dig into (pun intended) the world of mining and mineral exploration.
The global need to reduce CO2 emissions speeds up the electrification of transportation. That among other megatrends such as urbanization increases the demand for metals. A circular economy cannot yet fulfill the need and thus mining is needed.
Mineral deposits occur naturally in the Earth’s crust, some deeper than others. The ones near the surface, however, have mostly been discovered already. The problem is that mineral deposits deep in the ground can be extremely hard to identify without proper techniques and technologies.
The current way of analyzing and identifying mineral samples for new and existing mines is cumbersome, to say the least. Our customer Lumo Analytics is disrupting the field by utilizing IoT devices, cloud technologies, and big data analytics. After the diamond drilling of rock samples, Lumo Analytics analyzes them with a scanner. Our role in the process is to ensure that the results are retrieved in almost real-time.
Before, the samples had to be moved physically from A to B to C in order to get the results. With the new data-intensive solution, the results can be delivered from the ground around the globe more or less stat. This has the possibility to enable a significant reduction of both time and money.
Taking a sample, analyzing it, and spitting it out ought to be rather simple, right? There are a few catches, though. One is the file and its format; There are quite a few requirements. The format has to be efficient on-write, not so much on-read and understandable by the IoT device. It also has to be able to hold close to a million rows of compressed binary data. And yes, data has to be compressed in order to save on upload time and storage costs.
Our initial idea was to have Protocol Buffers with a max file size of 2 GB. However, the files are a lot larger in size – even with compression. So that was a bust. Since the requirements lean towards on-write and row-based format, we eventually chose Apache Avro. It happens to be highly supported outside Apache’s ecosystem, for example, in Databricks.
We paired Avro with the Apache Spark framework to enable big data processing in real-time. A system using Avro and Spark has the ability to achieve better throughput than a system without. In addition, Spark can run on a variety of different systems, making it possible to use it with virtual machines provisioned in the cloud environment. If you want to know more about the research behind these findings, check out the paper by Nothaft, F., et al (2015).
Spark can, however, be difficult to manage “manually” on a larger scale. There are many aspects to take into consideration to run Spark smoothly: setup, configurations, optimizing, utilizing, monitoring, and shutdowns. Fortunately, there is Databricks: A platform that provides services that make it possible for everyone to use machines that run Spark out of the box with simple and user-friendly notebooks. Databricks is also deeply integrated with multiple cloud providers.¨
In terms of the projects, the moment the IoT device sends a file to the cloud to be analyzed is when Databricks enters the stage. In practice, everything that happens after that takes place there. The functionalities we use in Databricks to accomplish the calculations of mineral contents include Delta Live Tables, Databricks Jobs, and Delta Sharing.
As the Avro file has landed in the cloud, its contents need to be extracted, loaded, and transformed to proceed with the analysis. For the sake of clarity, this is also known as ELT. A feature in Databricks called Delta Live Tables provides such functionality. Basically, it leverages Apache Spark and creates different datasets from the Avro file, and stores them as separate tables in the data lake that can later be used for complex calculations.
The datasets that come out of the Delta Live Tables pipeline are configured to be partitioned at least three levels deep. Partitioning makes it more efficient to read the data since not all available data needs to be read, just the ones in the desired partition. In addition, partitioning helps to avoid overlapping and concurrent updates in some tables.
This is when the actual domain-specific analytics can begin. Analytics is done with the help of Databricks Jobs, which is a way to run code in a Databricks provisioned machine on a schedule or by using triggers. The jobs are multi-task workflows with complex dependencies. The tasks include e.g.:
- Creating a minimal dataset
- Calculating elemental and mineral compositions
- Updating processing status
On some occasions, the raw data is also used to do additional calculations. This requires the implementation of cross-platform data sharing. In this case, data sharing is done with the help of yet another Databricks functionality called Delta Sharing. It enables us to securely share data with any business partner without actually requiring them to be on the Databricks platform. All they need is a way to read tables with SQL, Python, or BI tools. Simple.
The variety and flexibility of Databricks play a key part in delivering elemental and mineral compositions at certain depths. As one of our colleagues put it, we can do almost everything on Databricks in this project: From SQL endpoints for applications using Databricks SQL Serverless on the very same Delta tables to monitoring the pipelines including dashboards and alerts. There’s simply no need to build any additional doohickeys.
WANNA KNOW MORE?
Thought so. Let us know and we'll keep you posted on what's to come in the data game. We have another piece about Databricks and this project coming up as well...