“ How large companies managing and manipulateing Big data using hadoop ’’
What is Big Data ?
| let’s understand what is big data
Simply put, this is a term that describes the large volumes of data that a business deals with on a daily basis.
According to Gartner, it can be defined as “high-volume, velocity, and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.”
Big data analytics allows businesses to process this data or information. Properly processed this provides valuable insights.These can help a business to cut costs, increase customer interaction with a brand or grow their operation.These terms are relatively new. However, the process of gathering data on customers or business operations, and using it to improve and develop strategies is not new.
In other words, big data gets generated in multi terabyte quantities. It changes fast and comes in varieties of forms that are difficult to manage and process using RDBMS or other traditional technologies. Big Data solutions provide the tools, methodologies, and technologies that are used to capture, store, search & analyze the data in seconds to find relationships and insights for innovation and competitive gain that were previously unavailable. Big Data is also data but with a huge size. Big Data is a term used to describe a collection of data that is huge in volume and yet growing exponentially with time. In short such data is so large and complex that none of the traditional data management tools are able to store it or process it efficiently.
80% of the data getting generated today is unstructured and cannot be handled by our traditional technologies. Earlier, an amount of data generated was not that high. We kept archiving the data as there was just need of historical analysis of data. But today data generation is in petabytes that it is not possible to archive the data again and again and retrieve it again when needed as data scientists need to play with data now and then for predictive analysis unlike historical as used to be done with traditional.
It is saying that- “An image is a worth of thousand words“. Hence we have also provided a video tutorial for more understand what is Big data and its need
|Why Is Big Data Important?
Big data is increasingly growing in importance. Used alongside high powered analytics allows a business to gain valuable insights. These insights can be used to develop and modernise, improve strategies and refine processes. However, the amount of data you have isn’t as important as what you do with it. Data from any source, if properly analysed, can be used to cut costs, save time, optimise processes, develop new products or make informed decisions. Rich in potential, the importance of this area is only going to increase in the coming years.
| Features Of ‘Big Data’
Big data characteristics can be defined by one or more of these three characteristics:
- A massive amount of data that is quick growing
- Data that grows so quickly in volume it can’t be processed with conventional means
- The mining, storage, analysis, sharing and visualisation of data.
Its term can also include data frameworks as well as any tools or techniques employed to analyse the data
Data can be structured, unstructured or semi-structured.
Structured data is highly organised. Coming in a fixed format it is easy to process, store and access.
Semi-structured data is a data set that holds both structured and unstructured pieces.
This form of data hasn’t been classified in a database.
However, it still contains vital information that segregates elements in the data set.
Unstructured data lacks any set form or structure.
This makes accessing, processing and analysing it a time consuming, and difficult task
| The Vs of Big Data
Big data is often characterized by Vs.
- Velocity, the speed at which data is generated, collected and analyzed.
- Volume, the amount of data generated each second. Volume is often used in reference to tools such as social media, credit cards, phones, photographs.
- Value, this refers to the worth of the extracted data. Large amounts of data are useless unless you use it correctly.
- Variety, this describes the different types of data generated. This term is largely used in reference to unstructured data such as images or social media posts.
- Veracity, this refers to how trustworthy data is. If the data is not accurate or of poor quality, it is of little use.
- Validity, like veracity this tells us how accurate the data is for its intended use.
- Volatility refers to the age of the data. As fresh data is generated every hour or even minute stored data can quickly become irrelevant or historic. Volatility also refers to how long data needs to be kept before it can be discarded or archived.
- Visualisation describes how challenging data can be to use. Limitations such as poor scalability or functionality can impact on visualization. Additionally, data sets can be large and vast. This makes it complicated to use or visualize in a meaningful way.
| what is ‘Hadoop’ ?
Hadoop is developed by Doug Cutting and Michale J. It is managed by apache software foundation and licensed under the Apache license 2.0 Hadoop is very useful for the big business because it is based on cheap servers so it required less cost to store the data and processing the data. Hadoop helps to make a better business decision by providing a history of data and various record of the company, So by using this technology company can improve its business. Hadoop does lots of processing over collected data from the company to deduce the result which can help to make a future decision.
Apache Hadoop is an open source software framework used to develop data processing applications which are executed in a distributed computing environment.
Applications built using HADOOP are run on large data sets distributed across clusters of commodity computers. Commodity computers are cheap and widely available. These are mainly useful for achieving greater computational power at low cost.
Similar to data residing in a local file system of a personal computer system, in Hadoop, data resides in a distributed file system which is called as a Hadoop Distributed File system. The processing model is based on ‘Data Locality’ concept wherein computational logic is sent to cluster nodes(server) containing data. This computational logic is nothing, but a compiled version of a program written in a high-level language such as Java. Such a program, processes data stored in Hadoop HDFS.
| Hadoop Architecture
Hadoop has a Master-Slave Architecture for data storage and distributed data processing using MapReduce and HDFS methods.
NameNode:NameNode represented every files and directory which is used in the namespace
DataNode:DataNode helps you to manage the state of an HDFS node and allows you to interacts with the blocks
MasterNode:The master node allows you to conduct parallel processing of data using Hadoop MapReduce.
Slave node:The slave nodes are the additional machines in the Hadoop cluster which allows you to store data to conduct complex calculations. Moreover, all the slave node comes with Task Tracker and a DataNode. This allows you to synchronize the processes with the NameNode and Job Tracker respectively. In Hadoop, master or slave system can be set up in the cloud or on-premise
| Features Of ‘Hadoop’
• Suitable for Big Data Analysis
As Big Data tends to be distributed and unstructured in nature, HADOOP clusters are best suited for analysis of Big Data. Since it is processing logic (not the actual data) that flows to the computing nodes, less network bandwidth is consumed. This concept is called as data locality concept which helps increase the efficiency of Hadoop based applications.
HADOOP clusters can easily be scaled to any extent by adding additional cluster nodes and thus allows for the growth of Big Data. Also, scaling does not require modifications to application logic.
• Fault Tolerance
HADOOP ecosystem has a provision to replicate the input data on to other cluster nodes. That way, in the event of a cluster node failure, data processing can still proceed by using data stored on another cluster node.
| Network Topology In Hadoop
Topology (Arrangment) of the network, affects the performance of the Hadoop cluster when the size of the Hadoop cluster grows. In addition to the performance, one also needs to care about the high availability and handling of failures. In order to achieve this Hadoop, cluster formation makes use of network topology.
Typically, network bandwidth is an important factor to consider while forming any network. However, as measuring bandwidth could be difficult, in Hadoop, a network is represented as a tree and distance between nodes of this tree (number of hops) is considered as an important factor in the formation of Hadoop cluster. Here, the distance between two nodes is equal to sum of their distance to their closest common ancestor.
Hadoop cluster consists of a data center, the rack and the node which actually executes jobs. Here, data center consists of racks and rack consists of nodes. Network bandwidth available to processes varies depending upon the location of the processes. That is, the bandwidth available becomes lesser as we go away from-
- Processes on the same node
- Different nodes on the same rack
- Nodes on different racks of the same data center
- Nodes in different data centers
| Here is a list of some of the large and small scale companies using Hadoop:
Yahoo: Used for scaling tests.
Twitter: To store and process tweets, log files using LZO which is a portable
lossless data compression library written in ANSI C. It is fast and
also helps release CPU for other tasks.
LinkedIn: LinkedIn’s data flows through Hadoop clusters. User activity, server
metrics, images, transaction logs stored in HDFS are used by data
analysts for business analytics like discovering people whom you may
JPMorgan: Analytics on the transactions of the customers.
Amazon: Data processing by analyzing the customer reviews and requirements.
Adobe: Social services to structured data storage.
Ebay: With 300+ million users browsing more than 350 million products
listed on their website, eBay has one of the largest Hadoop clusters
in the industry that is run prominently on MapReduce Jobs. Hadoop is
used by eBay for Search Optimization and Research.
Netflix: For decision making.
Aol: Targets machines and dual processors.
Alibaba: Analyzes vertical search engine.
IBM: Client projects in finance, telecom and retail, Machine learning with
Infosys: Client projects in finance, telecom and retail.
TCS: Client projects in finance, telecom and retail.
Spotify: Used for content generation and data aggregation.
How Facebook handles Big Data ?
We all know that facebook is 1st most popular social media platform in the world.Facebook has billions of users.Facebook provides many features for users like to create free account, create and upload as many as posts you want.
Do you know facebook system processes 2.5 billion pieces of content & 500+ terabytes ( 1 terabytes = 1000 gigabytes ) of data each day.It’s pulling in 2.7 billion Like actions & 300 million photos per day.It scans roughly 105 terabytes of data each half hour.
Seems massive data !!! So how facebook managing such a huge data ???
“ Facebook run’s the world’s largest Hadoop cluster ”, says Jay Parikh (Vice President Infrastructure Engineering, Facebook)
Basically facebook runs the biggest hadoop cluster that goes beyond 4000 machines & storing more than hundreds of millions of gigabytes.
Hadoop provides a common infrastructure for Facebook with efficiency and reliability. Beginning with searching, log processing, recommendation system, and data warehousing, to video and image analysis, Hadoop is empowering this social networking platform in each and every way possible
VP of Engineering Jay Parikh explained why this is so important to Facebook: “Big data really is about having insights and making an impact on your business. If you aren’t taking advantage of the data you’re collecting, then you just have a pile of data, you don’t have big data.” By processing data within minutes, Facebook can rollout out new products, understand user reactions, and modify designs in near real-time.
Another stat Facebook revealed was that over 100 petebytes of data are stored in a single Hadoop disk cluster, and Parikh noted “We think we operate the single largest Hadoop system in the world.” In a hilarious moment, when asked “Is your Hadoop cluster bigger than Yahoo’s?”, Parikh proudly stated “Yes” with a wink.
While that sounds like a lot to smaller businesses, he noted that in a few months “No one will care you have 100 petabytes of data in your warehouse”. The speed of ingestion keeps on increasing, and “the world is getting hungrier and hungrier for data.”
And this data isn’t just helpful for Facebook. It passes on the benefits to its advertisers. Parikh explained, “We’re tracking how ads are doing across different dimensions of users across our site, based on gender, age, interests [so we can say] ‘actually this ad is doing better in California so we should show more of this ad in California to make it more successful.’”
Hadoop is answer to the traditional storage & computing problem. Traditional Database management systems are not scalable, not fault tolerant and become very slow in fetching record as the data size increases.