Exploring the Power of Apache Spark 3.0: Adaptive Query Execution and More

What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism in Spark?

July 31, 2023

Improving Performance with Adaptive Query Execution in Apache Spark 3.0

August 14, 2023

Published by Big Data In Real World at August 7, 2023

Adaptive Query Execution (AQE)

Adaptive Query Execution is a game-changer introduced in Spark 3.0. It addresses the limitations of traditional static execution plans by dynamically optimizing query execution based on runtime statistics. AQE leverages runtime feedback to make informed decisions and adjust the execution plan accordingly. This results in improved performance by adapting to the actual data and query conditions.

Join Optimizations

Adaptive optimization can automatically convert sort-merge join to broadcast-hash join at runtime, further simplifying tuning and improving performance.

Skew joins can lead to an extreme imbalance of work and severely downgrade performance. After AQE detects any skew from the shuffle file statistics, it can split the skew partitions into smaller ones and join them with the corresponding partitions from the other side. This optimization can parallelize skew processing and achieve better overall performance.

Join Hints

Join algorithm selection is based on statistics and heuristics. When the compiler is unable to make the best choice, users can use join hints to influence the optimizer to choose a better plan. This release extends the existing join hints by adding new hints: SHUFFLE_MERGE, SHUFFLE_HASH and SHUFFLE_REPLICATE_NL.

Dynamic Partition Pruning

Querying partitioned tables becomes more efficient in Spark 3.0 with Dynamic Partition Pruning. This optimization technique skips unnecessary partitions based on query predicates, significantly reducing the amount of data read from disk. This improvement enhances query performance and reduces resource utilization.

Shuffle partition Optimization

Large number of smaller shuffle partitions can significantly slow the execution of applications. Spark 3.0 simplifies or even avoids tuning the number of shuffle partitions. Users can set a relatively large number of shuffle partitions at the beginning, and AQE can then combine adjacent small partitions into larger ones at runtime.

ANSI SQL Compliance

Spark 3.0 enhances its SQL capabilities by adding support for several ANSI SQL features. Users can now leverage improved subquery support, ANSI SQL window functions, and additional SQL functions, making Spark’s SQL interface more powerful and aligned with industry standards.

Improved Structured Streaming

Structured Streaming in Spark 3.0 receives notable enhancements, including the ability to use continuous processing mode. With continuous processing, Spark can provide low-latency processing for near real-time analytics. Stateful streaming aggregations and better integration with Apache Kafka further strengthen the streaming capabilities of Spark. Spark 3.0 also introduces a new UI for streaming.

Pandas UDFs

Pandas UDFs were initially introduced in Spark 2.3 for scaling user-defined functions in PySpark and integrating pandas APIs into PySpark applications. However, the existing interface is difficult to understand when more UDF types are added. This release introduces a new pandas UDF interface that leverages Python type hints to address the proliferation of pandas UDF types.

Scala 2.12 Support

Spark 3.0 drops support for Scala 2.11 and embraces Scala 2.12. This update allows users to leverage the latest features and improvements in Scala while utilizing the power of Spark. Scala 2.12 brings performance improvements and new language features that enhance the development experience.

Accelerator-Aware Scheduling

Spark 3.0 introduces Accelerator-Aware Scheduling, enabling Spark to leverage hardware accelerators such as GPUs for machine learning and data processing workloads. This feature harnesses the power of accelerators to accelerate computations, unlocking significant performance gains for GPU-enabled environments.

Conclusion

Apache Spark 3.0 marks a significant milestone in the evolution of the distributed computing framework. The introduction of Adaptive Query Execution revolutionizes query processing by dynamically optimizing execution plans, resulting in improved performance. Additionally, features like Pandas UDFs, Dynamic Partition Pruning, ANSI SQL compliance, and improved Structured Streaming further enhance the usability and versatility of Spark.

With Spark 3.0, data engineers and scientists can take advantage of the latest advancements, unlock higher performance, and derive valuable insights from large-scale datasets. Whether it’s accelerating data processing, performing complex data manipulations, or building real-time analytics pipelines, Spark 3.0 provides a robust platform for big data analytics.

Big Data In Real World

We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

Exploring the Power of Apache Spark 3.0: Adaptive Query Execution and More

What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism in Spark?

Improving Performance with Adaptive Query Execution in Apache Spark 3.0

What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism in Spark?

Improving Performance with Adaptive Query Execution in Apache Spark 3.0

Adaptive Query Execution (AQE)

Join Optimizations

Join Hints

Dynamic Partition Pruning

Shuffle partition Optimization

ANSI SQL Compliance

Improved Structured Streaming

Pandas UDFs

Scala 2.12 Support

Accelerator-Aware Scheduling

Conclusion

Big Data In Real World

Related posts

How to kill a running Spark application?

What is the default number of executors in Spark?

What is the default number of cores and amount of memory allocated to an application in Spark?