Notice: Function _load_textdomain_just_in_time was called incorrectly. Translation loading for the becustom domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /home4/joyplace/public_html/wp-includes/functions.php on line 6114

Notice: Function _load_textdomain_just_in_time was called incorrectly. Translation loading for the wordpress-seo domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /home4/joyplace/public_html/wp-includes/functions.php on line 6114

Warning: Cannot modify header information - headers already sent by (output started at /home4/joyplace/public_html/wp-includes/functions.php:6114) in /home4/joyplace/public_html/wp-includes/rest-api/class-wp-rest-server.php on line 1893

Warning: Cannot modify header information - headers already sent by (output started at /home4/joyplace/public_html/wp-includes/functions.php:6114) in /home4/joyplace/public_html/wp-includes/rest-api/class-wp-rest-server.php on line 1893

Warning: Cannot modify header information - headers already sent by (output started at /home4/joyplace/public_html/wp-includes/functions.php:6114) in /home4/joyplace/public_html/wp-includes/rest-api/class-wp-rest-server.php on line 1893

Warning: Cannot modify header information - headers already sent by (output started at /home4/joyplace/public_html/wp-includes/functions.php:6114) in /home4/joyplace/public_html/wp-includes/rest-api/class-wp-rest-server.php on line 1893

Warning: Cannot modify header information - headers already sent by (output started at /home4/joyplace/public_html/wp-includes/functions.php:6114) in /home4/joyplace/public_html/wp-includes/rest-api/class-wp-rest-server.php on line 1893

Warning: Cannot modify header information - headers already sent by (output started at /home4/joyplace/public_html/wp-includes/functions.php:6114) in /home4/joyplace/public_html/wp-includes/rest-api/class-wp-rest-server.php on line 1893

Warning: Cannot modify header information - headers already sent by (output started at /home4/joyplace/public_html/wp-includes/functions.php:6114) in /home4/joyplace/public_html/wp-includes/rest-api/class-wp-rest-server.php on line 1893

Warning: Cannot modify header information - headers already sent by (output started at /home4/joyplace/public_html/wp-includes/functions.php:6114) in /home4/joyplace/public_html/wp-includes/rest-api/class-wp-rest-server.php on line 1893
{"id":367,"date":"2015-06-19T06:01:47","date_gmt":"2015-06-19T11:01:47","guid":{"rendered":"https:\/\/www.bigdatainrealworld.com\/?p=367"},"modified":"2023-02-19T07:33:27","modified_gmt":"2023-02-19T13:33:27","slug":"what-is-big-data","status":"publish","type":"post","link":"https:\/\/www.bigdatainrealworld.com\/what-is-big-data\/","title":{"rendered":"What is Big Data?"},"content":{"rendered":"

What is Big Data? – A beginner’s tutorial<\/h1>\n

In this blog post, we are going to see the following<\/p>\n

    \n
  1. What is Big Data?<\/li>\n
  2. Examples of Big Data<\/li>\n
  3. What are the problems that come with Big Data?<\/li>\n
  4. What Hadoop can offer in terms of solutions to the Big Data problems<\/li>\n
  5. Hadoop vs. Traditional Solutions<\/li>\n<\/ol>\n

    What is Big Data?<\/h2>\n

    If you ask me, tell me in one sentence what is Big Data?<\/p>\n

    I would say, Big Data is extremely huge volume of data.<\/p>\n

    To that answer the follow up question you can ask is – What is considered \u201chuge\u201d \u2013 10 GB, 100 GB or 100 TB?<\/p>\n

    Well, there is no straight hard number that defines \u201chuge\u201d or Big Data. I know that is not the answer you are looking for. (I can see your disappointment :-)) . There are 2 reasons why there is no straight forward answer –<\/p>\n

    First, what is considered Big or huge today in terms is data size or volume need not be considered as Big a year from now. It is very much a moving target.<\/p>\n

    Second, it is all relative. What you and I consider to be \u201cBig\u201d may not be the case for companies like Google and Facebook.<\/p>\n

    Hence for the above 2 reasons, it is very hard to put a number to define big data volume. Let\u2019s say if we are defining Big Data problem in terms of pure volume alone then in our opinion \u2013<\/p>\n

    100 GB<\/strong> \u2013 Not a chance. We all have hard disk greater than 10 GB. Clearly not Big Data.
    \n1 TB<\/strong> \u2013 Still no Because a well defined traditional database can handle 1 TB or even more without any issues.
    \n100 TB<\/strong> \u2013 Likely. Some would claim 100 TB to be a Big Data problem and others might disagree. Again, its all relative.
    \n1000 TB<\/strong> \u2013 Big Data problem.<\/p>\n

    Also volume of data is not the only factor to classify your data to be Big Data or not. So what other factors should be considered? Perfect segue to our next question.<\/p>\n

    Lets say we work at a startup and we recently launched a very successful email service where users can login to send and receive emails. Our email service is so good (better than gmail :-)), in 3 months we have 100,000 active users signed up and using our service. Hypothetically let\u2019s say we are currently using a traditional database to store email messages and its attachments. Also, our current size of the database is 1 TB.<\/p>\n

    So, do we have a Big Data problem? The simple and straight answer is No. 1 TB or even more is a manageable size for a traditional database. The more important question is \u2013 at this growth rate, will we have a Big Data problem in the (near) future? To answer that we need to consider 3 factors.<\/p>\n

    \"volume-velocity-variety\"<\/a><\/p>\n

    Volume<\/h3>\n

    Obvious factor. In 3 months our startup has 100,000 active users and our volume is 1 TB. If we have positive growth at the same rate at the end of the year we will have 400,000 users and our volume will be 4 TB. End of year 2 with the same growth rate, we will have 8 TB. What if we doubled or tripled our user base every 3 months? So the bottom line is we should not just look at the volume when we think of Big Data we should also look at the rate in which our data grows. In other words, we should watch the velocity or speed of our data growth.<\/p>\n

    Velocity<\/h3>\n

    This is an important factor. Velocity tells you how fast your data is growing. If your data volume stays at 1 TB for a year all you need is a good database but if the growth rate is 1 TB every week then you have to think about a scalable solution.<\/p>\n

    Most of the time Volume and Velocity is all you need to decide whether you have a Big Data problem or not.<\/p>\n

    Variety<\/h3>\n

    Variety adds one more dimension to your analysis. Our data in traditional databases are highly structured i.e. rows and columns. But take for instance our hypothetical startup mail service, it receives data in various formats \u2013 text for mail messages, images and videos as attachments. When you have data coming in to your system in different formats and you have to process or analyze the data in different formats traditional database systems are sure to fail and when combined with high volume and velocity you for sure have a Big Data problem.<\/p>\n

    This happen to Big Data consultants all the time; they will be called in by clients with data storage or performance issues and hope that a Big data Solution like Hadoop is going to solve their problem and most of the time their answers will fail in the volume and velocity tests..<\/p>\n

    Their volume will be in the higher gigabytes and low Terabytes and their data growth is relatively low for the past 6 months and in the foreseeable future. Hence their volume does not qualify as big data and also the data growth is low and it fails the velocity test as well. What the client needs is to optimize the existing process and not a sophisticated Big Data solution.<\/p>\n

    Now you know what is Big Data and given a scenario if someone comes up to you and ask whether their data problems can be solved by Big Data solutions you know what are the factors that is – volume, velocity and variety that needs to be considered to make a sound decision.<\/p>\n

    So when we say Big Data we are potentially talking about 100s to 1000s of Terabytes.. if you are new to the Big Data space you are probably wondering is there really a use case? The answer is absolutely yes across all domains – science, government or private sector.<\/p>\n

    Examples of Big Data<\/h2>\n

    \"hadoop<\/a><\/p>\n

    Lets talk about science first. The large Hadron Collider at CERN produce about 1 Petabyte of data every second, mostly sensor data. Their volume is so huge they don’t even retain or store all the data they produce. Source<\/a><\/p>\n

    NASA gathers about 1.73 Gigabytes of data every hour about weather, geo location data etc.\u00a0Source<\/a><\/p>\n

    Lets talk about the government.. NSA is known for its controversial data collection programs and guess how much NSA\u2019s data center at Utah can house in terms of volume? —- A Yottabyte of data that is, 1 Trillion Terabytes of data, pretty massive isn’t it? Source<\/a><\/p>\n

    In March of 2012 Obama’s administration announced to spend $200 million dollars in Big Data initiatives. Source<\/a><\/p>\n

    Even though we can not technically classify the next one under government, it’s an interesting use case so we included it anyway. Obama’s 2nd term election campaign used big data analytics which gave them a competitive edge to win the election. Source<\/a><\/p>\n

    Next let\u2019s look at the private sector. With the advent of social media like Facebook, Twitter, LinkedIn etc. there is no scarcity of data. eBay is known to have at 30 PB cluster and Facebook 30 PB. The numbers are probably much much more now since the stats are old. Source<\/a><\/p>\n

    Not only in social media companies but also in retail space. It is most common in several major retail websites to capture click stream data. For example let\u2019s say you shop at amazon.com, Amazon is not only capturing data when you click checkout, your every click on their website is tracked to bring a personalized shopping experience. When Amazon shows you recommendations, Big data analytics is at work behind the scenes.<\/p>\n

    You might be interested to know how Target could find out a women is pregnant using data analytics<\/a>.<\/p>\n

    What are the problems that come with Big Data?<\/h2>\n

    Now you should be convinced that Big Data exists, even if you did not believe it before.<\/p>\n

    So we have Big Data, so what?<\/em><\/p>\n

    Big data comes with big problems. Lets talk about few problems you may run in to when you deal with Big Data.<\/p>\n

    Since the datasets are huge, you need to find a way to store them as efficient as possible, I am not just talking about efficiency just in terms of storage space but also efficiency in storing the dataset that is suitable for computation. Another problem when you deal with big dataset you should worry about about data loss due to corruption in data or due to hardware failure and you need have proper recovery strategies in place.<\/p>\n

    The main purpose of storing data is to analyze them and how much time does it take to analyze and provide a solution to a problem using your big data is a million dollar question. What’s good in storing the data when you can not analyze or process the data in reasonable time right? With big datasets computation with reasonable execution times is a challenge.<\/p>\n

    Finally, cost. You are going to need a lot of storage space. So the storage solution that you plan to use should be cost effective.<\/p>\n

    But what is the need for new Big Data solution like Hadoop? Why traditional database solutions like MySQL or Oracle is not a good solution to store and analyze big datasets?<\/p>\n

    You will encounter scalability issues with traditional RDBMS when you start moving up in data volume in terms of terabytes. You will be forced to denormalize and pre aggregate the data for faster query execution time. As the data gets bigger you will be forced to make changes to the process in terms of changes to the indexes, optimizing queries etc.<\/p>\n

    If you worked with databases before, assuming your database is running in with enough hardware resources and when you see a performance issue still you will have to make some changes to the query itself or the way in which the data that you are accessing is stored and there is not working around it. You can not add more hardware resources or more computer nodes and distribute the problem to bring the computation times down. Meaning to say a database is not horizontally scalable.<\/p>\n

    The second problem with databases is that databases are designed to process structured data. Imagine when you have records that does not conform to a table like structure or each record in your dataset differs in the number of columns and there is no uniformity in the column lengths and types between rows. In this case database is not a good choice and when your data does not have a proper structure a database will struggle to store and process the data. Further more a database is not a good choice when you have a variety of data that is data in several formats like text, images etc.<\/p>\n

    Other challenge is cost. A good enterprise grade database solution can be quite expensive for relative low volume of data. When you add the hardware costs and the platinum grade storage costs it quickly adds up to the quite expensive.<\/p>\n

    So what is the solution? A good solution should –<\/p>\n