Input For Page Ranking Using Hadoop - Big Data In Real World

Input For Page Ranking Using Hadoop

Fixing org.apache.hadoop.security.AccessControlException: Permission denied
March 3, 2014
Using Million Song Dataset In Hadoop
April 19, 2014
Fixing org.apache.hadoop.security.AccessControlException: Permission denied
March 3, 2014
Using Million Song Dataset In Hadoop
April 19, 2014

If you are new to Hadoop, you are probably tired of WordCount and want to get hands on with some real use cases. Page ranking is an excellent use case for Hadoop and it can help understand the true power of Hadoop. Wikipedia expose their pages in XML format and can be used as an input for a Page Ranking MapReduce program.

Wikipedia make XML dumps available for its pages in this location but the file sizes are in GBs and so it is very hard for new learners with few nodes in their Hadoop cluster to play with it. Below instructions will explain how to export a XML file for few pages resulting in a smaller file, which is great for testing and learning.

Get a List

Get a list of Wikipedia pages you would like to rank. For the purpose of this example, we are going to pick the following pages.

http://en.wikipedia.org/wiki/Apache_Software_Foundation
http://en.wikipedia.org/wiki/Apache_Hadoop

 

Use the page title Apache_Software_Foundation and Apache_Hadoop for the extract below.

Extract

Go to the below URL.

http://en.wikipedia.org/w/index.php?title=Special:Export

Enter the page titles from the previous step in the space provided and hit export  to download the XML file and use it as input for your Page Ranking program.

Looking for something more interesting data to analyze in Hadoop? Follow this Post to stream data from Twitter.

Wikipedia Export

 

 

Big Data In Real World
Big Data In Real World
We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

Input For Page Ranking Using Hadoop
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X