Hadoop Developer In Real World 3 node Hadoop cluster will be shutdown permanently on June 30th 2024. Reason for the shutdown Hadoop Developer In Real World […]
Apache Spark is a powerful open-source distributed computing system used for big data processing. However, sometimes you may need to kill a running Spark application for […]
This is going to be a short post. Number of executors in YARN deployments Spark.executor.instances controls the number of executors in YARN. By default, the number […]
Number of cores spark.executor.cores controls the number of cores available for the executors. By default, it is 1 core per executor in YARN and all available […]
Apache Spark, the popular distributed computing framework, has been widely adopted for processing large-scale data. With the release of Apache Spark 3.0, a groundbreaking feature called […]
Apache Spark, the popular distributed computing framework, has taken a significant leap forward with the release of Apache Spark 3.0. Packed with new features and enhancements, […]
Both spark.sql.shuffle.partitions and spark.default.parallelism control the number of tasks that get executed at runtime there by controlling the distribution and parallelism. Which means both properties have […]
In modern software systems, data is often generated and consumed in real-time. To handle these data streams, various processing techniques have been developed, including stream processing […]
This is a pretty common requirement and here is the solution. Solution Let’s create a bucket named hirw-sample-aws-bucket first. [osboxes@wk1 ~]$ aws s3 mb s3://hirw-sample-aws-bucket Use […]
Simple problem with a simple solution. In this post we will see how to delete an index in Elasticsearch. Solution We have 3 indices in Elasticsearch […]
In this post we will see how to recursively delete files/objects, folders and bucket from S3. Recursively deleting a folder in S3 rm –recursive followed by […]
Kafka uses Zookeeper to manage it’s internal state. So it is not possible to run Kafka without Zookeeper. Even if you don’t have access to Zookeeper […]
Here is our data. We have an employee DataFrame with 3 columns, name, project and cost_to_project. An employee can belong to multiple projects and for each […]
If your organization is working with lots of data you might be leveraging Spark to compute distribution. You could also potentially have some or all your […]