Seminar On
BIG DATA MINING: A CHALLENGE AND HOW TO MANAGE IT
Submitted To:
Submitted By: Dinesh and Jitender
INTRODUCTION Big Data is a new term used to identify the datasets that due to their large size and complexity, We call them BIG DATA because we can not manage them with our current methodologies or data mining software tools. Big Data mining is the capability of extracting useful information from these large datasets or streams of data, that due to its volume, variability, and velocity, it was not possible before to do it. The Big Data challenge is becoming one of the most exciting opportunities for the next years. We present in this issue, a broad overview of the topic, its current status on Big Data mining. This paper shows the challenge and tools to manage heterogeneous information frontier in Big Data mining research
WHAT IS BIG DATA?
‘Big Data’ is similar to ‘small data’, but bigger in size but having data bigger it requires different approaches: Techniques, tools and architecture an aim to solve new problems or old problems in a better way Big Data generates value from the storage and processing of very large quantities of digital information that cannot be analyzed with traditional computing techniques.
WHAT IS BIG DATA Walmart
handles more than 1 million customer transactions every hour. • Facebook handles 40 billion photos from its base. •
THREE CHARACTERISTICS OF BIG DATA V3S
Volume
Velocity
Variety
• Data quantity
• Data Speed
• Data Types
1ST CHARACTER OF BIG DATA VOLUME •A typical PC might have had 10 gigabytes of storage in 2000. •Today, Facebook ingests 500 terabytes of new data every day. • The smart phones, the data they create and consume; sensors embedded into everyday objects will soon result in billions of new, constantly-updated data feeds containing environmental, location, and other information, including video.
DATA VELOCITY(SPEED)
High-frequency stock trading algorithms reflect market changes within microseconds. machine to machine processes exchange data between billions of devices infrastructure and sensors generate massive log data in real-time on-line gaming systems millions of concurrent s, each producing multiple inputs per second.
VARIETY(DATA TYPES IMAGES, VIDEO,SOUND) Big
Data isn't just numbers, dates, and strings. Big Data is also geospatial data, 3D data, audio and video, and unstructured text, including log files and social media.
Traditional
database systems were designed to address smaller volumes of structured data, fewer updates or a predictable, consistent data structure.
Big
Data analysis includes different types of data
PROCESSING BIG DATA Integrating disparate data stores Mapping data to the programming framework Connecting and extracting data from storage Transforming data for processing Subdividing data in preparation for Hadoop MapReduce
Employing Hadoop MapReduce Creating the components of Hadoop MapReduce jobs Distributing data processing across server farms Executing Hadoop MapReduce jobs Monitoring the progress of job flows
THE STRUCTURE OF BIG DATA
Structured •
Semi-structured •
Most traditional data sources
Many sources of big data
Unstructured •
Video data, audio data
10
WHY BIG DATA •
Growth of Big Data is needed –
Increase of storage capacities
–
Increase of processing power
–
Availability of data(different data types)
–
Every day we create 2.5 quintillion bytes of data; 90% of the data in the world today has been created in the last two years alone
WHY BIG DATA •FB generates 10TB daily •Twitter generates 7TB of data Daily •IBM claims 90% of today’s stored data was generated in just the last two years.
HOW IS BIG DATA DIFFERENT? 1) Automatically generated by a machine (e.g. Sensor embedded in an engine) 2) Typically an entirely new source of data (e.g. Use of the internet) 3) Not designed to be friendly (e.g. Text streams) 4) May not have much values •
Need to focus on the important part
13
DATA GENERATION POINTS EXAMPLES Mobile Devices Microphones Readers/Scanners Science facilities Programs/ Software Social Media Cameras
BIG DATA ANALYTICS
Examining large amount of data
Appropriate information
Identification of hidden patterns, unknown correlations
Competitive advantage
Better business decisions: strategic and operational
Effective marketing, customer satisfaction, increased revenue
POTENTIAL VALUE OF BIG DATA $300
billion potential annual value to US health care.
$600 billion potential annual consumer surplus from using personal location data.
60%
potential in retailers’ operating margins.
INDIA – BIG DATA Gaining
attraction
Huge
market opportunities for IT services (82.9% of revenues) and analytics firms (17.1 % )
Current
market size is $200 million. By 2015 $1
billion The
opportunity for Indian service providers lies in offering services around Big Data implementation and analytics for global multinationals
BENEFITS OF BIG DATA •Real-time big data isn’t just a process for storing of data in a data warehouse, It’s about the ability to make better decisions and take meaningful actions at the right time. •Fast forward to the present and technologies like Hadoop give you the scale and flexibility to store data before you know how you are going to process it. •Technologies such as MapReduce,Hive and Impala enable you to run queries without changing the data structures underneath.
BENEFITS OF BIG DATA
Our newest research finds that organizations are using big data to target customer-centric outcomes, tap into internal data and build a better information ecosystem. Big Data is already an important part of the $64 billion database and data analytics market It offers commercial opportunities of a comparable scale to enterprise software in the late 1980s And the Internet boom of the 1990s, and the social media explosion of today.
WHAT IS “BIG DATA”? "Big Data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization” (Gartner 2012) Complicated (intelligent) analysis of data may make a small data “appear” to be “big” Bottom line: Any data that exceeds our current capability of processing can be regarded as “big”
WHAT IS DATA MINING? Discovery of useful, possibly unexpected, patterns in data Extraction of implicit, previously unknown and potentially useful information from data Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns
DATA MINING TASKS Classification [Predictive] Clustering [Descriptive] Association Rule Discovery [Descriptive] Sequential Pattern Discovery [Descriptive] Regression [Predictive]
CLASSIFICATION: DEFINITION
Given a collection of records (training set ) Each
record contains a set of attributes, one of the attributes is the class.
Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible.
A
test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
CLUSTERING
income education age
24
K-MEANS CLUSTERING
25
ASSOCIATION RULE MINING
sales records:
ts c du ht o pr oug b
26
on i t er c a m s to n s a d r i u t c id
market-basket data
• Trend: Products p5, p8 often bough together • Trend: Customer 12 likes product p9
BIG VELOCITY
Sensor tagging everything of value sends velocity through the roof E.g.
27
car insurance
Smart phones as a mobile platform sends velocity through the roof State of multi-player internet games must be recorded – sends velocity through the roof
BIG DATA STANDARDIZATION CHALLENGES (1)
Big Data use cases, definitions, vocabulary and reference architectures (e.g. system, data, platforms, online/offline) Specifications and standardization of metadata including data provenance Application models (e.g. batch, streaming) Query languages including non-relational queries to diverse data types (XML, RDF, JSON, multimedia) and Big Data operations (e.g. matrix operations) Domain-specific languages Semantics of eventual consistency Advanced network protocols for efficient data transfer General and domain specific ontologies and taxonomies for describing data semantics including interoperation between ontologies
Source : ISO
28
Big Data Standardization Challenges (2)
Big Data security and privacy access controls Remote, distributed, and federated analytics (taking the analytics to the data) including data and processing resource discovery and data mining Data sharing and exchange Data storage, e.g. memory storage system, distributed file system, data warehouse, etc. Human consumption of the results of big data analysis (e.g. visualization) Interface between relational (SQL) and non-relational (NoSQL) Big Data Quality and Veracity description and management
29
Source : ISO
TOOLS FOR MANAGING BIG DATA Hadoop is an open-source framework from Apache that allows to store and process big data in a distributed environment across clusters of computers using simple programming models Hadoop is a large-scale distributed batch processing infrastructure. While it can be used on a single machine, its true power lies in its ability to scale to hundreds or thousands of computers, each with several processor cores. Hadoop is also designed to efficiently distribute large amounts of work across a set of machines. Challenges at Large Scale Performing large-scale computation is difficult. To work with this volume of data requires distributing parts of the problem to multiple machines to handle in parallel. Whenever multiple machines are used in cooperation with one another, the probability of failures rises. In a single-machine environment, failure is not something that program designers explicitly worry about very often: if the machine has crashed, then there is no way for the program to recover anyway.
R R programming language is the preferred choice amongst data analysts and data scientists There is no doubt that R is the most preferred programming tool for statisticians, data scientists, data analysts and data architects but it falls short when working with large datasets. One major drawback with R programming language is that all objects are loaded into the main memory of a single machine. Large datasets of size petabytes cannot be loaded into the RAM memory; this is when Hadoop integrated with R language, is an ideal solution. To adapt to the in-memory, single machine limitation of R programming language, data scientists have to limit their data analysis to a sample of data from the large data set.
R and Hadoop were not natural friends but with the advent of novel packages like Rhadoop, RHIVE, and RHIPE- the two seemingly different technologies, complement each other for big data analytics and visualization.
STORM Storm is a distributed real-time computation system for processing large volumes of highvelocity data. Storm is extremely fast, with the ability to process over a million records per second per node on a cluster of modest size. Enterprises combine it with other data access applications in Hadoop to prevent undesirable events or to optimize positive outcomes. Some of specific new business opportunities include: real-time customer service management, data monetization, operational dashboards, or cyber security analytics and threat detection.
APACHE MAHOUT Apache Mahout is a powerful, scalable machine-learning library that runs on top of Hadoop MapReduce. We are living in a day and age where information is available in abundance. The information overload has scaled to such heights that sometimes it becomes difficult to manage our little mailboxes! Imagine the volume of data and records some of the popular websites (the likes of Facebook, Twitter, and Youtube) have to collect and manage on a daily basis. It is not uncommon even for lesser known websites to receive huge amounts of information in bulk. Normally we fall back on data mining algorithms to analyze bulk data to identify trends and draw conclusions. However, no data mining algorithm can be efficient enough to process very large datasets and provide outcomes in quick time, unless the computational tasks are run on multiple machines distributed over the cloud. We now have new frameworks that allow us to break down a computation task into multiple segments and run each segment on a different machine. Mahout is such a data mining framework that normally runs coupled with the Hadoop infrastructure at its background to manage huge volumes of data
Apache Mahout is an open source project that is primarily used for creating scalable machine learning algorithms. It implements popular machine learning techniques such as: Recommendation Classification Clustering Apache Mahout started as a sub-project of Apache’s Lucene in 2008. In 2010, Mahout became a top level project of Apache.
APACHE S4
S4 is a general-purpose, distributed, scalable, fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data.
BIG DATA MINING TOOLS
The Big Data phenomenon is intrinsically related to the open source software revolution. Large companies such as Facebook, Yahoo!, Twitter, LinkedIn benefit and contribute to open source projects. Big Data infrastructure deals with Hadoop, and other related software as:
Apache Hadoop : software for data-intensive distributed applications, based in the MapReduce programming model and a distributed file system called Hadoop Distributed Filesystem (HDFS). Hadoop allows writing applications that rapidly process large amounts of data in parallel on large clusters of compute nodes. A MapReduce job divides the input dataset into independent subsets that are processed by map tasks in parallel. This step of mapping is then followed by a step of reducing tasks. These reduce tasks use the output of the maps to obtain the final result of the job.
Apache S4: platform for processing continuous data streams. S4 is designed specifically for managing data streams. S4 apps are designed combining streams and processing elements in real time. Storm: software for streaming data-intensive distributed applications, similar to S4, and developed by Nathan Marz at Twitter.
In Big Data Mining, there are many open source initiatives. The most popular are the following:
Apache Mahout: Scalable machine learning and data mining open source software based mainly in Hadoop. It has implementations of a wide range of machine learning and data mining algorithms: clustering, clas- sification, collaborative filtering and frequent pattern mining. R: open source programming language and software environment designed for statistical computing and visualization. R was designed by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand beginning in 1993 and is used for statistical analysis of very large data sets.
MOA: Stream data mining open source software to perform data mining in real time. It has implementations of classification, regression, clustering and frequent item set mining and frequent graph mining. It started as a project of the Machine Learning group of University of Waikato, New Zealand, famous for the WEKA software. The streams framework provides an environment for defining and running stream processes using simple XML based definitions and is able to use MOA, Android and Storm. SAMOA is a new software project for distributed stream mining that will combine S4 and Storm with MOA.
Vowpal Wabbit: open source project started at Yahoo! Research and continuing at Microsoft Research to design a fast, scalable, useful learning algorithmt can exceed the throughput of any single machine network interface when doing linear learning, via parallel learning.
MORE SPECIFIC TO BIG GRAPH MINING WE FOUND THE FOLLOWING OPEN SOURCE TOOLS:
Pegasus: Big graph mining system built on top of MapReduce. It allows to find patterns and anomalies in massive real-world graphs. GraphLab: high-level graph-parallel system built without using MapReduce. GraphLab computes over dependent records which are stored as vertices in a large distributed datagraph.
REFERENCE REFERENCES [1] Apache Hadoop, http://hadoop.apache.org. [2] P. Zikopoulos, C. Eaton, D. deRoos, T. Deutsch, and G. Lapis. IBM Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data. McGraw-Hill Companies,Incorporated, 2011 [3] L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S4: Distributed Stream Computing Platform. In ICDM Workshops, pages 170–177, 2010 [4] Storm, http://storm-project.net.
[5] Apache Mahout, http://mahout.apache.org.
[6] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2012. ISBN 3-900051-07-0. [7] A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer. MOA: Massive Online Analysis http://moa. cms.waikato.ac.nz/. Journal of Machine Learning Research (JMLR), 2010. [8] D. Laney. 3-D Data Management: Controlling Data Volume, Velocity and Variety. META Group Research Note, February 6, 2001 [9]U. Kang, D. H. Chau, and C. Faloutsos. PEGASUS: Mining Billion-Scale Graphs in the Cloud. 2012. [10]J. Gantz and D. Reinsel. IDC: The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East. December 2012.
THANK YOU.