Data Science And Big Data Analytics
Instead of assigning a traditional exam for your “Final Exam”, you will demonstrate your understanding of the importance of the final two phases of the Data Analytics Lifecycle: Communicating the Results and Operationalizing. You have studied various analytic methods, along with surveying various available tools and techniques. But as a Data Scientist, you must also communicate what your team has accomplished.
For your Final “Exam”, you will create a Final Sponsor Presentation. You have been hired by Aetna to use big data to improve patient health. The case study you will use can be found at: https://gigaom.com/2012/11/20/how-aetna-is-using-big-data-to-improve-patient-health/
Your assignment is to take this case study and create a PowerPoint Presentation of the Final Sponsor Presentation, as described in Chapter 12 of your text. (Make sure to use the format presented in Chapter 12.)
You will submit a PowerPoint Presentation file (ppt or pptx) with at least the following slides: (You can submit more, but I'm mainly looking for these six slides)
Title (Your name + other info)
Situation and Project Goals
Executive Summary
Approach
Method Description
Recommendations
Note: There is no direct information in the case study that covers the specific analytics methods used. You will extrapolate from the provided information in the case study and provide your own interpretation of the method (or methods) the Data Science team used. Draw from what we have covered this semester and map what you learned to what that team probably did. There is no rubric for this assignment. I want you to be as creative as possible. Tell the story. Inspire your sponsors.
Advanced Analytics - Technology and Tools
1Module 5: Advanced Analytics - Technology and Tools
1Module 5: Advanced Analytics - Technology and Tools
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Advanced Analytics – Technology and Tools
During this lesson the following topics are covered:
• Invoke interfaces from the command line
• Use query languages (Hive and Pig) for data analytics problems using un-structured data
• Build and query an HBase database
• Suggest examples where HBase is most suitable
The Hadoop Ecosystem
2Module 5: Advanced Analytics - Technology and Tools
Hadoop by itself was developed as an open source implementations of Google, Inc.’s Google File System (GFS), MapReduce framework, and BigTable system. Additionally, Hadoop has spawned a number of associated projects, all of which depend on Hadoop’s MapReduce capabilities and HDFS.
You will examine several of these in this lesson.
2Module 5: Advanced Analytics - Technology and Tools
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Query Languages for Hadoop
• Builds on core Hadoop (MapReduce and HDFS) to enhance the development and manipulation of Hadoop clusters
Pig --- Data flow language and execution environment
Hive (and HiveQL) --- Query language based on SQL for building MapReduce jobs
HBase --- Column oriented database built on HDFS supporting MapReduce and point queries
Depends on Zookeeper - a coordination service for building distributed applications
3Module 5: Advanced Analytics - Technology and Tools
These interfaces build on core Hadoop to support different styles of interaction with the data stored in HDFS. All are layered over HDFS and MapReduce.
Pig is a data flow language and execution environment on top of Hadoop.
Hive (HiveQL) is an SQL-like query language for building MapReduce jobs.
HBase is a column-oriented database running over HDFS and supporting MapReduce and point queries.
HBase depends on Zookeeper, a coordination service for building distributed applications. We won’t be discussion Zookeeper in this course except to say that it exists.
Mahout is a project to build scalable, machine learning systems by packaging machine learning algorithms as an executable library. You can specify input and output sources in HDFS to the algorithms, for example: each algorithm can take different parameters.
Mahout is still in early development (0.7 is the latest release). More information is available at http://mahout.apache.org/
3Module 5: Advanced Analytics - Technology and Tools
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
• As you move from Pig to Hive to HBase, you are increasingly moving away from the mechanics of Hadoop and creating an RDBMS view of the world
Levels of Abstraction
HBase Queries against
defined tables
Hive SQL-based language
Pig Data flow language More Hadoop
Visible
Less Hadoop
Visible
DBMS View
Mechanics of
Hadoop
4Module 5: Advanced Analytics - Technology and Tools
In Pig and Hive, the presence of HDFS is very noticeable.
Pig, for example, directly supports most of the Hadoop file system commands.
Likewise, Hive can access data whether it’s local or stored in an HDFS.
In either case, data can usually be specified via an HDFS URL (hdfs:///path>). In the case of HBase, however, Hadoop is mostly hidden in the HBase framework, and HBase provides data to the client via a programmatic interface (usually Java).
Via these interfaces, a Data Scientist can focus on manipulating large datasets without concerning themselves with the inner working of Hadoop. Of course, a Data Scientist must be aware of the constraints associated with using Hadoop for data storage, but doesn’t need to know the exact Hadoop command to check the file system.
4Module 5: Advanced Analytics - Technology and Tools
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
• Data flow language and execution environment for Hadoop
What is Pig?
• A data flow language (Pig Latin)
• Two modes execution environment: • Local --- access a local file system • MapReduce when you’re interested in the Hadoop environment
Two Main Elements
• When NOT to use Pig If you only want to touch a small portion of the dataset (Pig eats it all)
If you do NOT want to use batch processing
Pig ONLY supports batch processing
5Module 5: Advanced Analytics - Technology and Tools
Pig is a data flow language and an execution environment to access the MapReduce functionality of Hadoop (as well as HDFS).
Pig consists of two main elements:
1. A data flow language called Pig Latin (ig-pay atin-lay) and
2. An execution environment, either as a standalone system or one using HDFS for data storage.
A word of caution is in order: If you only want to touch a small portion of a given dataset, then Pig is not for you, since it only knows how to read all the data presented to it. Pig only supports batch processing of data, so if you need an interactive environment, Pig isn’t for you.
5Module 5: Advanced Analytics - Technology and Tools
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Writing Pig Latin. Seriously.
• A Pig script is a series of operations (transformations) applied to an input to produce an output
May be helpful to think of a Unix pipe command tr [A-Za-z] file1; sort –o file2 file1; uniq –c file2
• Supports examining data structures and subsets of data
• Can execute Pig programs as a script Via Grunt - an interactive shell or from a Java program
Via the command line: pig
6Module 5: Advanced Analytics - Technology and Tools
One can think of a data flow programming model as a series of transforms or filters applied to an input stream.
In Pig, each transform is defined as new input data source (you’ll see an example of this in the next slide). Descriptions of these transforms can be provided to Pig via a Pig Latin script, or interactively using Grunt, Pig’s command-line interface.
Grunt also provides commands to query intermediate steps in the process. EXPLAIN shows the related MapReduce plan; DUMP lists out a dataset, and DESCRIBE describes the schema structure for that particular dataset.
Someone described Pig in this way: “You can process terabytes of data by issuing a half-dozen lines of Pig Latin from the console.” Not a bad return on investment. Just make sure they are the right half-dozen lines.
6Module 5: Advanced Analytics - Technology and Tools
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Deconstructing Pig
-- max_temp.pig -- Finds the max temperature by year
1 records = LOAD ‘data/samples.txt’ AS (year:chararray, temperature:int, quality:int) ;
2 filtered_records = FILTER records BY temperature != 9999 AND (quality == 0 OR quality ==1 OR quality == 4 OR
quality == 5 OR quality == 9);
3 grouped_records = GROUP filtered_records BY year;
4 max_temp = FOREACH grouped_records GENERATE group, MAX(filtered_records.temperature) ;
5 DUMP max_temp ;
7Module 5: Advanced Analytics - Technology and Tools
This simple five line Pig program computes the maximum temperature per year gathered from a set of multi-year, multi-observations file of temperatures.
Line 1 - the data is read into a local variable named “records.” Note that the code implies that records contains all the data, although this an abstraction.
Line 2 - the data set is filtered by removing missing values for temperature and by ensuring that quality takes on the values 0, 1, 4, 5, or 9.
Line 3 - the filtered records are grouped by year (effectively sorting by year).
Line 4 - choose the maximum temperature value for each group.
Line 5 - dumps the final values stored in the variable max_temp.
Observe that the argument to the LOAD function could be a local file (file:///) or an HDFS URI (hdfs:///…). In the default case, since we’re running HDFS, it will be found in HDFS.
The results of the LOAD command is a table that consists of a set of tuples (collections of variables). In line 3 we created a single record consisting of the year and a bag (an unordered collection of tuples), that contains a tuple for every observation made for that year (the year is actually repeated in each record, so the data looks like <1949/ {(1949, 111,1), (1949, 78,1) …}> This is still a representation of the pair that we saw earlier, but in this case the value pair is a structured data type (a list consisting of multiple 3- tuple). Line 4 aggregates the data by the grouping variable and the maximum temperature for that year.
7Module 5: Advanced Analytics - Technology and Tools
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Pig Comparison with a SQL
Pig SQL
A data flow language A declarative programming language
Schema is optional, can be specified at run-time
Schema is required at data load time
Supports complex, nested data structures
Typically uses simple table structures
Does not support random reads or queries
Random reads and queries are supported
8Module 5: Advanced Analytics - Technology and Tools
Pig acts as a front end to Hadoop; each command in Pig could theoretically become either a Map or a Reduce task (guess which ones are which in the preceding example).
PIG implements a data flow language where data flows through a series of transformations that alter the data in some way, and each step corresponds to a particular action. SQL, on the other hand, is a declarative programming language, and each SQL command represents a set of constraints that define the final output of the query.
For example, consider this SQL equivalent of our Pig example:
SELECT year, MAX(temperature) FROM (SELECT year, temperature quality FROM data WHERE temperature <> 9999 AND quality NOT IN (0,1,4,5,9) ORDER BY year GROUP BY year)
Unlike SQL, Pig doesn’t require a schema to be present; if one is, it can be declared at run-time. Pig can deal with arbitrarily nested data structures (bags within bags).
SQL supports random reads and queries; Pig reads the entire data set offered to it.
8Module 5: Advanced Analytics - Technology and Tools
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Hive and HiveQL
• Query language based on SQL for building MapReduce jobs
• All data stored in tables; schema is managed by Hive Schema can be applied to existing data in HDFS
9Module 5: Advanced Analytics - Technology and Tools
The Hive system is aimed at the Data Scientist with strong SQL skills. Think of Hive as occupying a space between Pig and an DBMS (although that DBMS doesn’t have to be a Relational DBMS [RDBMS]).
In Hive, all data is stored in tables. The schema for each table is managed by Hive itself. Tables can be populated via the Hive interface, or a Hive schema can be applied to existing data stored in HDFS.
9Module 5: Advanced Analytics - Technology and Tools
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Hive Shell and HiveQL
• Hive Provides web, server and shell interfaces for clients
Hive shell is the default
Can run external host commands using “!prog” command
Can access HDFS using the DFS command
• HiveQL Partial implementation of SQL-92 (closer to MySQL)
Data in Hive can be in internal tables or “external” tables
Internal tables managed by Hive
External tables are not (lazy create and load)
10Module 5: Advanced Analytics - Technology and Tools
The hive program provides different functions depending on which commands are provided. The simplest invocation is simply “hive”, that brings up the Hive shell. From there, you can enter Hive SQL commands interactively, or these commands could be combined into a single script file.
hive hwi is the Hive web interface whereby one can browse existing database schemas and create sessions for issuing database queries and Hive commands. The interface is available at http://:9999/hwi.
Hive hiveserver will start Hive as a server listening on port 10000 that provides a Thrift and a JDBC/ODBC interface to Hive databases.
Data for Hive can be stored in Hive’s internal tables (managed tables) or can be retrieved from data in the filesystem (HDFS). An example of creation of an external table is
CREATE EXTERNAL TABLE my_ext_data (dummy STRING) LOCATION ‘/opt/externalTable’ ;
LOAD DATA INPATH ‘/opt/externalTable’ INTO TABLE my_ext_data ;
The existence of this data isn’t checked when these statements are executed, nor is data loading in Hive’s datastore. Hence, the notion of “lazy create and lazy load.”
10Module 5: Advanced Analytics - Technology and Tools
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Temperature Example: Hive
2
1
3
Example Hive Code
1 CREATE TABLE records (year STRING, temperature INT, quality INT) ROW FORMAT DELIMITED FIELDS
TERMINATED by ‘\t’ ;
2 LOAD DATA LOCAL ‘data/samples.txt’ OVERWRITE INTO TABLE records ;
3 SELECT year, MAX(temperature) FROM records WHERE temperature != 9999 AND (quality == 0 OR quality
== 1 OR quality == 4 OR quality == 5 OR quality
== 9) GROUP BY year;
11Module 5: Advanced Analytics - Technology and Tools
Let’s go back to our example of calculating the maximum temperature for a given year from thousand and thousands of weather observations from hundreds of weather stations.
Line 1 defines your table and states that our input consists of tab-delimited fields.
In line 2, you encounter our old favorite LOAD DATA again with a slightly different syntax.
Line 3 looks like a standard SQL query that produces a relation consisting of a year and the max temperature for that year. The ROW FORMAT clause is a Hive-specific addition.
Hive maintains its own set of tables; these tables could exist on a local file system or in HDFS as /usr/hive/XXXXX. A directory in the filesystem corresponds to a particular table.
Hive does not implement the full SQL-92 standard, and additionally provides certain clauses that don’t appear in standard SQL (“ROW FORMAT …” is one such example).
11Module 5: Advanced Analytics - Technology and Tools
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Hive Comparison with a SQL
Hive Database
“Schema on Read” “Schema on Write”
Incomplete SQL-92 (never a design goal)
Full SQL-92
No updates, transactions. Indexes available in v0. 7
Updates, transactions and indexes.
12Module 5: Advanced Analytics - Technology and Tools
In most traditional DBMS, the database description (or schema) is read and applied when the data is loaded. If the data doesn’t match the schema (specifically, the table into which it is read), then the load fails. This is often called “Schema on Write.”
Hive, on the other hand, doesn’t attempt to apply the schema until the data is actually read when someone issues a query. This results in fast loads as well as supporting multiple schemas for the same data (only defining as many variables as needed for your particular analysis). In addition, the actual format of the data may not be known because queries against the data haven’t been defined.
Updates and transactions aren’t supplied with Hive. Indexes are available as of Hive 0.8. If concurrent access to tables is desired, then the application must roll its own. That said, the Hive project is working towards integration with the HBase project that does provide row updates. See the Hive project page at for further details.
12Module 5: Advanced Analytics - Technology and Tools
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
A P A C H E - the Hadoop Database
• “Column oriented” database built over HDFS supporting MapReduce and point queries
• Depends on Zookeeper for consistency and Hadoop for distributed data.
• The Siteserver component provides several interfaces to Web clients (REST via HTTP, Thrift and Avro)
13Module 5: Advanced Analytics - Technology and Tools
HBase represents a further layer of abstraction on Hadoop. HBase has been described as “a distributed column-oriented database [data storage system]” built of top of HDFS.
Note that HBase is described as managing structured data. Each record in the table can be described as a key (treated as a byte stream) and a set of variables, each of which may be versioned. It’s not a structure in the same sense as an RDBMS is structured.
HBase is a more complex system than what we have seen previously. HBase uses additional Apache Foundation open source frameworks: Zookeeper is used as a co-ordination system to maintain consistency, Hadoop for MapReduce and HDFS, and Oozie for workflow management. As a Data Scientist, you probably won’t be concerned overmuch with implementation, but it is useful to at least know the names of all the moving parts.
HBase can be run from the command line, but also supports REST (Representational State Transfer – think HTTP) and Thrift and Avro interfaces via the Siteserver daemon. Thrift and Avro both provide an interface to send and receive serialized data (objects where the data is “flattened” into a byte stream).
13Module 5: Advanced Analytics - Technology and Tools
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
When to Choose HBase
• You need random, real-time read/write access to your big data
• You need sparse tables consisting of millions of rows and millions of columns where each column variable may be versioned
• Google’s BigTable: a “Web table”
14Module 5: Advanced Analytics - Technology and Tools
HBase has been described like this: “[its] forte is when real-time read-write random-access to very large datasets is required.” HBase is an open source version of Google’s BigTable, and it’s instructive to read the definition of BigTable by the original authors:
BigTable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers.
Note that HBase is described as managing structured data. Each record in the table can be described as a key (treated as a byte stream) and a set of variables, each of which may be versioned. It’s not a structure in the same sense as an RDBMS is structured, but it does have structure nonetheless. And because HBase is not constrained in the same was as an RDBMS system is constrained, HBase designs can take advantage of the physical layout of the table on disk to increase performance.
It’s useful to recall that Google’s BigTable was designed to store information about Web URLs (Web documents). Fields in this table could be versioned (new versions of an HTML page, for example) and the table could be updated frequently as web-crawlers discovered new data. One design decision to speed access to URLs from the same site was to reverse the order of the URL: instead of media.google.com, the URL would be stored as com.google.media, ensuring that other Google URLs would be “reasonably” close.
14Module 5: Advanced Analytics - Technology and Tools
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
HBase Comparison with a Traditional Database
HBase DBMS
No real indexes Real indexes
Support automatic partitioning
No automatic partitioning
Ideal for billions of rows and millions of columns
Challenged by sparse data
Supports real time reads and writes
Supports real time reads and writes
15Module 5: Advanced Analytics - Technology and Tools
Although HBase may look like a traditional DBMS, it isn’t.
HBase is a “distributed, column-oriented data storage system that can scale tall (billions of rows), wide (billions of columns), and can be horizontally partitioned and replicated across thousands of commodity servers automatically.”
The HBase table schemas mirror physical storage for efficiency; a RDBMS doesn’t. (the RDBMS schema is a logical description of the data, and implies no specific physical structuring.)
Most RDBMS systems require that data must be consistent after each transaction (ACID prosperities). NoSQL (Not Only SQL) systems like HBase don’t suffer from these constraints, and implement eventual consistency. This means that for some systems you cannot write a value into the database and immediately read it back in. Strange, but true.
Another of HBase’s strengths is in its wide open view of data – HBASE will accept almost anything it can cram into an HBase table.
15Module 5: Advanced Analytics - Technology and Tools
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Pig
• Replacement for MapReduce Java coding
• When need exists to customize part of the processing phases (UDF)
Hive
• Use when SQL skills are available
• Customize part of the processing via UDFs
Which Interface Should You Choose?
HBase
• Use when random queries and partial processing is required, or when specific file layouts are needed
16Module 5: Advanced Analytics - Technology and Tools
If you are Java-savvy, or you have scripting skills (shell plus Ruby/Python/Perl/Tcl/Tk), you might simply want to use the existing Hadoop framework for MapReduce tasks.
If your talent lies in other areas, such as database administration or functional programming, you might wish to choose a different method of accessing your data.
Pig provides an abstraction above that of MapReduce for HDFS, and makes the interface simpler to use. Both Pig and Hive support UDFs (User Defined Functions).
Hive provides an SQL-like interface to data that may be stored in HDFS, but Hive tables don’t meet the definition of a RDBMS.
HBase, as the “Hadoop Database,” leverages Hadoop/HDFS for data storage and the Zookeeper system for co-ordination. Since there is no fixed schema per se, attributes (columns) can be added to a dataset without requiring programs to change to address the extra data; attribute values may be versioned to record changes to a particular value. Bulk loading can be accomplished by having MapReduce write files in HBase internal format directly into HDFS with an order of magnitude increase in populating the database.
16Module 5: Advanced Analytics - Technology and Tools
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Mahout
• Scalable machine learning and data mining library for Hadoop
• Support for four use cases Recommendation mining
Classification
Clustering
Frequent itemset
• Requires Hadoop infrastructure and Java programming
17Module 5: Advanced Analytics - Technology and Tools
Mahout is a set of machine learning algorithms that leverages Hadoop to provide both data storage and the MapReduce implementation.
The mahout command is itself a script that wraps the Hadoop command and executes a requested algorithm from the Mahout job jar file (jar files are Java ARchives, and are very similar to Linux tar files [tape archives]). Parameters are passed from the command line to the class instance.
Mahout mainly supports four use cases:
• Recommendation mining takes users' behavior and tries to find items users might like. An example of this is LinkedIn’s “People You Might Know” (PYMK).
• Classification learns from existing categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category.
• Clustering takes documents and groups them into collections of topically related documents based on word occurrences.
• Frequent itemset mining takes a set of item groups (for example, terms in a query session, shopping cart content) and identifies which individual items usually appear together. If you plan on using Mahout, remember that these distributions (Hadoop and Mahout) anticipate running on a *nix machine, although a Cygwin environment on Windows will work as well (or rewriting the command scripts in another language, say as a batch file on Windows). It goes without saying that a compatible working version of Hadoop is required. Lastly, Mahout requires that you program in Java: no other interface outside of the command line is supported.
17Module 5: Advanced Analytics - Technology and Tools
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
• Recommenders Non-distributed recommenders
Distributed item-based collaborative filtering
Collaborative filtering using a parallel matrix classification
• Classification Logistic Regression
Bayesian
Random Forests
Restricted Boltzmann Machines
Online Passive Aggressive
• Frequent Itemset Parallel FP Growth Mining
• Clustering Canopy Clustering
K-Means Clustering
Fuzzy K-Means
Mean Shift Clustering
Hierarchical Clustering
Dirichlet Process Clustering
Latent Dirichlet Allocation
Spectral Clustering
Minhash Clustering
Algorithms Available in Mahout
18Module 5: Advanced Analytics - Technology and Tools
Mahout provides a number of different algorithms to support its use cases. Some of these algorithms are listed. We’ve only included those algorithms that are curently implemented (at the time of this version of the course) and don’t have problem reports issued against them. Other algorithms are in various stages of development: check the website at for more details.
18Module 5: Advanced Analytics - Technology and Tools
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Other Hadoop Ecosystem Resources
Other tools from the Hadoop EcoSystem
• Log data collection Scribe, Chukwa
• Workflow/coordination Oozie, Azkaban
• Yet another BigTable implementation Hypertable
• Other key-value distributed datastores Cassandra, Voldemort
• Other tools Sqoop – Import or export data from HDFS to structured database
Cascading – an alternate API to Hadoop MapReduce
19Module 5: Advanced Analytics - Technology and Tools
We have already been introduced to HBase, Hive, Pig, and Zookeeper. Other tools/system based on Hadoop HDFS and MapReduce include:
Howl –- mixture of Hive and OWL (another interface to HDFS data).
Oozie –- a workflow/coordination system to manage Apache Hadoop(TM) jobs.
Zookeeper –- a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
Chukwa: –- a Hadoop subproject devoted to large-scale log collection and analysis.
Cascading -- an alternative API to Hadoop MapReduce.
Scribe –- a server designed for the real-time streaming of log data.
Cassandra –- another database system, noted for its elasticity, durability and availability.
Hypertable –- another BigTable implementation that runs over Hadoop.
Voldemort –- a distributed key-value storage system, used at LinkedIn.
Azkaban –- workflow scheduler, developed at LinkedIn.
Sqoop –- a tool for importing and exporting data from Hadoop to structured databases.
19Module 5: Advanced Analytics - Technology and Tools
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Check Your Knowledge
1. How does Pig differ from a typical MapReduce process? 2. How does schema parsing differ in Hive from a traditional
RDBMS?
3. With regards to file structure, how does HBase differ from a traditional RDBMS?
4. Which capabilities of Hadoop does Mahout use? 5. Which categories of use cases are supported by Mahout?
20Module 5: Advanced Analytics - Technology and Tools
Please take a moment to answer these questions. Write you answer in the space below.
20Module 5: Advanced Analytics - Technology and Tools
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Advanced Analytics - Technology and Tools
During this lesson the following topics were covered:
• Query languages for Hadoop (Hive and Pig)
• HBase – a BigTable workalike using Hadoop
• Mahout – machine learning algorithms using Hadoop MapReduce and HDFS
• Other elements of the Hadoop Ecosystem
Summary
21Module 5: Advanced Analytics - Technology and Tools
Lesson 2 covered:
Hive and Pig – Hadoop query languages
HBase – a BigTable workalike using Hadoop
Mahout – machine learning algorithms and Hadoop MapReduce
21Module 5: Advanced Analytics - Technology and Tools