What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism in Spark?
July 31, 2023Improving Performance with Adaptive Query Execution in Apache Spark 3.0
August 14, 2023Apache Spark, the popular distributed computing framework, has taken a significant leap forward with the release of Apache Spark 3.0. Packed with new features and enhancements, Spark 3.0 introduces Adaptive Query Execution (AQE) along with several other advancements that enhance performance and usability. In this blog post, we will delve into the key features of Spark 3.0.
Adaptive Query Execution (AQE)
Adaptive Query Execution is a game-changer introduced in Spark 3.0. It addresses the limitations of traditional static execution plans by dynamically optimizing query execution based on runtime statistics. AQE leverages runtime feedback to make informed decisions and adjust the execution plan accordingly. This results in improved performance by adapting to the actual data and query conditions.
Join Optimizations
Adaptive optimization can automatically convert sort-merge join to broadcast-hash join at runtime, further simplifying tuning and improving performance.
Skew joins can lead to an extreme imbalance of work and severely downgrade performance. After AQE detects any skew from the shuffle file statistics, it can split the skew partitions into smaller ones and join them with the corresponding partitions from the other side. This optimization can parallelize skew processing and achieve better overall performance.
Join Hints
Join algorithm selection is based on statistics and heuristics. When the compiler is unable to make the best choice, users can use join hints to influence the optimizer to choose a better plan. This release extends the existing join hints by adding new hints: SHUFFLE_MERGE, SHUFFLE_HASH and SHUFFLE_REPLICATE_NL.
Dynamic Partition Pruning
Querying partitioned tables becomes more efficient in Spark 3.0 with Dynamic Partition Pruning. This optimization technique skips unnecessary partitions based on query predicates, significantly reducing the amount of data read from disk. This improvement enhances query performance and reduces resource utilization.
Shuffle partition Optimization
Large number of smaller shuffle partitions can significantly slow the execution of applications. Spark 3.0 simplifies or even avoids tuning the number of shuffle partitions. Users can set a relatively large number of shuffle partitions at the beginning, and AQE can then combine adjacent small partitions into larger ones at runtime.
ANSI SQL Compliance
Spark 3.0 enhances its SQL capabilities by adding support for several ANSI SQL features. Users can now leverage improved subquery support, ANSI SQL window functions, and additional SQL functions, making Spark’s SQL interface more powerful and aligned with industry standards.
Improved Structured Streaming
Structured Streaming in Spark 3.0 receives notable enhancements, including the ability to use continuous processing mode. With continuous processing, Spark can provide low-latency processing for near real-time analytics. Stateful streaming aggregations and better integration with Apache Kafka further strengthen the streaming capabilities of Spark. Spark 3.0 also introduces a new UI for streaming.
Pandas UDFs
Pandas UDFs were initially introduced in Spark 2.3 for scaling user-defined functions in PySpark and integrating pandas APIs into PySpark applications. However, the existing interface is difficult to understand when more UDF types are added. This release introduces a new pandas UDF interface that leverages Python type hints to address the proliferation of pandas UDF types.
Scala 2.12 Support
Spark 3.0 drops support for Scala 2.11 and embraces Scala 2.12. This update allows users to leverage the latest features and improvements in Scala while utilizing the power of Spark. Scala 2.12 brings performance improvements and new language features that enhance the development experience.
Accelerator-Aware Scheduling
Spark 3.0 introduces Accelerator-Aware Scheduling, enabling Spark to leverage hardware accelerators such as GPUs for machine learning and data processing workloads. This feature harnesses the power of accelerators to accelerate computations, unlocking significant performance gains for GPU-enabled environments.
Conclusion
Apache Spark 3.0 marks a significant milestone in the evolution of the distributed computing framework. The introduction of Adaptive Query Execution revolutionizes query processing by dynamically optimizing execution plans, resulting in improved performance. Additionally, features like Pandas UDFs, Dynamic Partition Pruning, ANSI SQL compliance, and improved Structured Streaming further enhance the usability and versatility of Spark.
With Spark 3.0, data engineers and scientists can take advantage of the latest advancements, unlock higher performance, and derive valuable insights from large-scale datasets. Whether it’s accelerating data processing, performing complex data manipulations, or building real-time analytics pipelines, Spark 3.0 provides a robust platform for big data analytics.