Common problem with a pretty simple solution. Solution When you are doing the directory listing use the -R option to recursively list the directories. If you […]
When you search or lookup a document, Elasticsearch by default returns or shows you all the fields in the document. $ curl -X GET "localhost:9200/account/_doc/954?pretty" { […]
cache() and persist() functions are used to cache intermediate results of a RDD or DataFrame or Dataset. You can mark an RDD, DataFrame or Dataset to […]
HDFS divides the files into blocks and stores the blocks locally in datanodes. The location varies from cluster to cluster based on the configuration in hdfs-site.xml […]
Here are three high level reasons using cache() or persist() functions would be appropriate and helpful. Performing multiple actions on the same RDD numbersRDD is used […]
Both INNER JOIN and LEFT SEMI JOIN return matching records between both tables with a subtle difference. Let’s consider 2 tables – employee and employee_department_mapping with […]
Over time we have written several posts on Spark joins and join algorithms explaining the internal working of these join algorithms. Here are all the posts […]
Ever wondered the differences between Rabbit MQ and Kafka and the need for Kafka when there is already RabbitMQ? Read through the post and you will […]
Shuffle-and-Replication does not mean a “true” shuffle as in records with the same keys are sent to the same partition. Instead the entire partition of the […]
EBS, S3 and Glacier are different storage options available in Amazon Web Services. They differ in cost, use case and purpose. Here is a super quick […]
Broadcast Nested Loop join works by broadcasting one of the entire datasets and performing a nested loop to join the data. So essentially every record from […]
The answer to this question depends on the version of Hadoop you are using. Older version of Hadoop (1.x.x) The first argument of copyFromLocal is restricted […]
Elasticsearch offers 2 different contexts on how you have query or filter documents in an index – Query context and Filter context. Query context In the […]
Take a look at the below execution plan. Currently when you print the executed plan, you see that Spark is using Sort Merge Join. scala> dfJoined.queryExecution.executedPlan […]