footwear

Case Study 4: Transitioning from User Defined Functions to Scalable Spark-Based Solutions

Problem

Previously, one of our data aggregation logics was implemented as a monolithic Python script within a Spark User-Defined Function (UDF). This UDF processed observations in a sequential manner, applying multiple decision algorithms (multiple detection logics and utility calculations) within a single execution flow. While effective for smaller datasets, this approach created several performance bottlenecks and scalability issues as our observation volumes grew. The use of a UDF meant that each observation was processed individually, and our entire analysis was performed in-memory, leading to long execution times and high memory consumption, making it impractical for large-scale, real-time data processing.

Challenges & Solutions

Our aim was to do a refactoring and translate the aggregation logic to leverage Spark’s native operations and distributed computing capabilities, replacing the UDF-based approach. The new design breaks down the monolithic script into distinct processing stages, utilizing Spark’s built-in functions and window operations to manage and optimize each stage efficiently. This refactoring allows our logic to aggregate larger datasets and higher observation volumes in a structured and scalable manner.

UDF overhead and memory consumption: The use of Spark UDFs caused significant overhead because Spark had to serialize and deserialize each row, leading to inefficient execution. Additionally, processing all observations in-memory created huge memory pressure. Our approach to this problem was the refactoring of the code to use Spark DataFrames and window functions, eliminating the need for UDFs. The logic can then be applied in parallel across multiple clusters. This reduces memory pressure and distributes the computation load efficiently, allowing the system to scale with larger datasets.
Code modularity and scalability: Our previous code, tightly coupled and enclosed within the UDFs, made it challenging to scale or modify specific parts without affecting our entire system. We decomposed the aggregation logic into distinct phases: detection, utility property calculation, and final decision logic. We implemented each phase as a modular Spark function, which can be optimized and scaled independently. This modularity enables flexibility and allows us to run each phase on clusters optimized for the specific task, such as memory-optimized nodes for ingestion or CPU-optimized nodes for processing.
Concurrency and task scheduling: The old, UDF-based approach had limited concurrency capabilities, resulting in inefficient resource usage and task scheduling. By makeing use of the DataFrame native functions, we enabled parallel task execution. Additionally, using Spark’s scheduling capabilities, tasks are triggered based on data availability, reducing idle time and improving the overall efficiency of the pipeline.

Quantitative Impact

Our transition from a sequential UDF-based approach to a modular, Spark-based solution significantly optimized our performance, surpassing our initial expectations. Below is again a detailed comparison for different data volumes experiments:

Table 6: Load test experiments

Data volume	Metric	Old logic	New logic	Improvement
10 million	Total task runtime	15m 7s	8m 33s	43.45%
	Logic runtime	7m 9s	57s	86.71%
	Databricks costs	$1.43	$0.76	46.85%
	AWS costs	$1.08	$0.21	80.56%
100 million	Total task runtime	55m 46s	10m 31s	81.15%
	Logic runtime	29m 54s	1m 11s	96.04%
	Databricks costs	$8.79	$0.83	90.56%
	AWS costs	$9.13	$3.29	63.98%
1 billion	Total task runtime	7h 24m	14m 32s	96.73%
	Logic runtime	4h 50m	3m 10s	98.91%
	Databricks costs	$119.41	$1.43	98.80%
	AWS costs	$101.69	$4.38	95.69%

These results demonstrate the significant performance and cost improvements we achieved through the refactored approach. The new modular and distributed design not only reduces the logic runtime for larger data volumes by up to 96% and 99% but also lowered operational costs by as much as 90% and 98% respectively. We were astonished by the structured approach providing such a scalable and efficient solution for processing large datasets. This timely and cost-effective behaviour supports our future growth in processing increasing data volumes.

On to the next Case Study 5: Reduction in DataFrame Operations.