Migration of on-premises data to the cloud at scale using Databricks

To unlock the value of your organization’s data, you must efficiently extract data from on-premises databases. Many tools can do the job, but deciding which tools to use and how to use them can be tricky. It can also be tempting to bring high-end tools, hoping they’ll do the job more efficiently. However, let’s assume we understand the underlying architecture of database connectivity. In this case, we can write code to handle the work, while saving a lot of additional licensing costs and time to learn a new framework.

Connectivity between an application and a DBMS (Database Management System) is via JDBC or ODBC.

Java™ Database Connectivity (JDBC) is JavaSoft’s specification of a standard application programming interface (API) that allows Java programs to access database management systems. The JDBC API consists of a set of interfaces and classes written in the Java programming language.[1]

ODBC is the equivalent of JDBC, developed by Microsoft and Simba Technologies, for use in Microsoft Windows environments. Most of the time, vendors won’t try to reinvent the workings of these industry standards. Instead, under the hood it’s all about whether connectivity can be run concurrently, and what kind of tools to use to achieve that goal.

There are generally two main distributed query architectures: massively parallel processing (MPP) and MapReduce.

MPP database engine

A common MPP database engine is PrestoDB from Meta (Facebook). MPP promises to take advantage of multiple machines called worker nodes with independent processors and memory and coordinate to divide and conquer an otherwise large workload or large query. In addition, the processing is done in memory, so the results can be very fast.

Below is the architecture of PrestoDB:

MapCollapse

MapReduce is an algorithm originally developed by Google to process large sets of data. The idea is to split the data into smaller chunks using a mapper and a reducer. The job of the mapper is to produce a key-value pair using a unique hash to divide the data into smaller subsets for processing. The job of the reducer is to aggregate the data after it has been processed.

Diagram description automatically generated

https://www.researchgate.net/figure/MapReduce-diagram-10_fig1_325848553

Spark is an enhanced version of MapReduce, capable of in-memory processing and shedding to disk when the data is too large, whereas MapReduce only works on HDFS. According to Databricks, Spark can be 100 times faster than Hadoop MapReduce.[2]

MPP vs. MapReduce

Since everything is processed in memory with MPP, it is not easy to manage an extremely large data set. That’s why Meta, who developed a very efficient PrestoDB MPP engine, decided to put it on Spark for more efficient processing.[3]

With a managed Spark cluster, Databricks is a clear winner for data processing. In the next section, we’ll look at the JDBC interfaces provided by Spark to optimize data loading performance when interacting with traditional DBMSs.

Data load types

There are two types of data loading: batch loading, where we load the whole table in one batch, and refresh pipeline, where we separate the loading to do it in sequence due to the large volume of data in The source.

In general, domain or dimension tables are smaller so that they can be loaded as a whole. This generates less load on the source system; however, larger tables must be loaded into sentences. In these scenarios, we must meet the following two conditions:

  1. Manage parallelism

It is important to understand how parallelism works when querying DMBS from Databricks. As a result, we can enjoy the performance boost.

2. Find partition column

Finding the upper and lower bound of a business key from which we want to extract will help limit the load on the source system. Divide and conquer is always faster and easier than trying to load it all into one jackpot. Most of the time, if there is a date column available, it would mean one month at a time. But if the data volume is huge, it can be a day at a time or a week at a time.

Push-down optimization

Another performance trick is to take advantage of push-down optimization by applying filters in the query. This will allow the query to take advantage of the source DBMS index or primary key for faster loading.

Partitioning

There are two interfaces to manage partitioning in Spark’s Dataframe API:

  • jdbc(url:String,table:String,partitionColumn:String,lowerBound:Long,upperBound:Long,numPartitions:Int,…)

Available in Python and Scala, this interface takes an integer partition column and an integer upper and lower bound and divides the data into an equal number of specified partitions. It works well with evenly distributed scenarios, but if the data is skewed, it doesn’t work well. The downside of this interface is that it does not filter the source data, so the dataset must be returned to a Dataframe for further processing.

  • jdbc(url:string,table:string,predicates:array[String],…)

Available only in Scala, this interface takes a partition column and a predicate array, essentially a WHERE clause. It will construct a series of WHERE clauses based on the specified array, push those queries to the database, spawn a number of threads the size of the array, and then return the data to Dataframe all at once. This allows greater flexibility in managing data loading.

Other Factors to Consider

While it’s tempting to use Python to do all the work, the low-level Scala API offers greater flexibility in controlling data size. However, whichever interface we choose, we must always consider the impact on the source DMBS, including:

  • Choose the right partition
  • Choose the right index
  • Hit the primary key, if possible
  • Limit the number of rows returned at a time
  • Limit the number of columns returned if the entire table is not needed
  • Ensure fast connectivity between Databricks and the source, for example, Azure Express Route

Conclusion

By understanding the underlying JDBC APIs provided by Spark, we can develop our own notebooks using Databricks, control the flow, and take advantage of the blazing speed to extract data from the source DMBS. Additionally, even though an organization is required to use tools like Azure Data Factory, we now have a better understanding of their limitations and why the tools only accept certain parameters so that we can carefully consider the pros and cons when selecting. .

Finally, for a more technical dive into APIs, please refer to the below articles from Databricks:
https://docs.databricks.com/data/data-sources/sql-databases.html

__________________________________

1 https://www.ibm.com/docs/en/informix-servers/12.10?topic=started-what-is-jdbc

2 https://databricks.com/spark/about

3 https://prestodb.io/blog/2021/10/26/Scaling-with-Presto-on-Spark

Sean N. Ayres