{"id":1142,"date":"2017-01-30T10:00:42","date_gmt":"2017-01-30T16:00:42","guid":{"rendered":"http:\/\/hadoopinrealworld.com\/?p=1142"},"modified":"2023-02-19T07:33:00","modified_gmt":"2023-02-19T13:33:00","slug":"how-to-find-directories-in-hdfs-which-are-older-than-n-days","status":"publish","type":"post","link":"https:\/\/www.bigdatainrealworld.com\/how-to-find-directories-in-hdfs-which-are-older-than-n-days\/","title":{"rendered":"How to find directories in HDFS which are older than N days?"},"content":{"rendered":"<h2><span style=\"font-weight: 400;\">How to find directories in HDFS which are older than N days?<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">Cleaning up older or obsolete files in HDFS is important. Even if you have a big enough cluster with lot of space, if you don\u2019t have good clean up scripts to keep your cluster clean, little things add up and before you know you will run out of space in your cluster.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">HDFS does not have a command out of the box to list all the directories that are N days old. But you can write a simple script to do so.<\/span><\/p>\n<h2>Script<\/h2>\n<p><span style=\"font-weight: 400;\">Here is a small script to list directories older than 10 days.<\/span><\/p>\n<pre class=\"lang:default decode:true \">now=$(date +%s)\r\nhadoop fs -ls -R | grep \"^d\" | while read f; do\r\n\u00a0dir_date=`echo $f | awk '{print $6}'`\r\n\u00a0difference=$(( ( $now - $(date -d \"$dir_date\" +%s) ) \/ (24 * 60 * 60 ) ))\r\n\u00a0if [ $difference -gt 10 ]; then\r\n\u00a0\u00a0\u00a0echo $f;\r\n\u00a0fi\r\ndone<\/pre>\n<p>hadoop fs -ls -R command list all the files and directories in HDFS. grep &#8220;^d&#8221; will get you only the directories. Then with while..do let\u2019s loop through each directory.<\/p>\n<pre class=\"lang:default decode:true\">hadoop fs -ls -R | grep \"^d\" | while read f; do<\/pre>\n<p><span style=\"font-weight: 400;\"><span class=\"lang:default decode:true  crayon-inline\">awk &#8216;{print $6}&#8217;<\/span>\u00a0 gets the date of the directory and save it in dir_date.<\/span><\/p>\n<pre class=\"lang:default decode:true \">dir_date=`echo $f | awk '{print $6}'`<\/pre>\n<p><span style=\"font-weight: 400;\">Below script calculate the difference between the date from the directory and the current date and convert the difference to the number of days.<\/span><\/p>\n<pre class=\"lang:default decode:true\">difference=$(( ( $now - $(date -d \"$dir_date\" +%s) ) \/ (24 * 60 * 60 ) ))<\/pre>\n<p>Finally print the directory if the difference is more than 10 days.<\/p>\n<pre class=\"lang:default decode:true \">\u00a0if [ $difference -gt 10 ]; then\r\n\u00a0\u00a0\u00a0echo $f;\r\n\u00a0fi<\/pre>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>How to find directories in HDFS which are older than N days? Cleaning up older or obsolete files in HDFS is important. Even if you have<span class=\"excerpt-hellip\"> [\u2026]<\/span><\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-1142","post","type-post","status-publish","format-standard","hentry","category-hadoop"],"_links":{"self":[{"href":"https:\/\/www.bigdatainrealworld.com\/wp-json\/wp\/v2\/posts\/1142","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.bigdatainrealworld.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.bigdatainrealworld.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.bigdatainrealworld.com\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/www.bigdatainrealworld.com\/wp-json\/wp\/v2\/comments?post=1142"}],"version-history":[{"count":1,"href":"https:\/\/www.bigdatainrealworld.com\/wp-json\/wp\/v2\/posts\/1142\/revisions"}],"predecessor-version":[{"id":1143,"href":"https:\/\/www.bigdatainrealworld.com\/wp-json\/wp\/v2\/posts\/1142\/revisions\/1143"}],"wp:attachment":[{"href":"https:\/\/www.bigdatainrealworld.com\/wp-json\/wp\/v2\/media?parent=1142"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.bigdatainrealworld.com\/wp-json\/wp\/v2\/categories?post=1142"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.bigdatainrealworld.com\/wp-json\/wp\/v2\/tags?post=1142"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}