Loading...

Messages

Proposals

Stuck in your homework and missing deadline? Get urgent help in $10/Page with 24 hours deadline

Get Urgent Writing Help In Your Essays, Assignments, Homeworks, Dissertation, Thesis Or Coursework & Achieve A+ Grades.

Privacy Guaranteed - 100% Plagiarism Free Writing - Free Turnitin Report - Professional And Experienced Writers - 24/7 Online Support

Pig commands in hadoop pdf

27/10/2021 Client: muhammad11 Deadline: 2 Day

Data Science And Big Data Analytics

Instead of assigning a traditional exam for your “Final Exam”, you will demonstrate your understanding of the importance of the final two phases of the Data Analytics Lifecycle: Communicating the Results and Operationalizing. You have studied various analytic methods, along with surveying various available tools and techniques. But as a Data Scientist, you must also communicate what your team has accomplished.

For your Final “Exam”, you will create a Final Sponsor Presentation. You have been hired by Aetna to use big data to improve patient health. The case study you will use can be found at: https://gigaom.com/2012/11/20/how-aetna-is-using-big-data-to-improve-patient-health/

Your assignment is to take this case study and create a PowerPoint Presentation of the Final Sponsor Presentation, as described in Chapter 12 of your text. (Make sure to use the format presented in Chapter 12.)

You will submit a PowerPoint Presentation file (ppt or pptx) with at least the following slides: (You can submit more, but I'm mainly looking for these six slides)

Title (Your name + other info)
Situation and Project Goals
Executive Summary
Approach
Method Description
Recommendations
Note: There is no direct information in the case study that covers the specific analytics methods used. You will extrapolate from the provided information in the case study and provide your own interpretation of the method (or methods) the Data Science team used. Draw from what we have covered this semester and map what you learned to what that team probably did. There is no rubric for this assignment. I want you to be as creative as possible. Tell the story. Inspire your sponsors.

Advanced Analytics - Technology and Tools

1Module 5: Advanced Analytics - Technology and Tools

1Module 5: Advanced Analytics - Technology and Tools

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Advanced Analytics – Technology and Tools

During this lesson the following topics are covered:

• Invoke interfaces from the command line

• Use query languages (Hive and Pig) for data analytics problems using un-structured data

• Build and query an HBase database

• Suggest examples where HBase is most suitable

The Hadoop Ecosystem

2Module 5: Advanced Analytics - Technology and Tools

Hadoop by itself was developed as an open source implementations of Google, Inc.’s Google File System (GFS), MapReduce framework, and BigTable system. Additionally, Hadoop has spawned a number of associated projects, all of which depend on Hadoop’s MapReduce capabilities and HDFS.

You will examine several of these in this lesson.

2Module 5: Advanced Analytics - Technology and Tools

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Query Languages for Hadoop

• Builds on core Hadoop (MapReduce and HDFS) to enhance the development and manipulation of Hadoop clusters

 Pig --- Data flow language and execution environment

 Hive (and HiveQL) --- Query language based on SQL for building MapReduce jobs

 HBase --- Column oriented database built on HDFS supporting MapReduce and point queries

 Depends on Zookeeper - a coordination service for building distributed applications

3Module 5: Advanced Analytics - Technology and Tools

These interfaces build on core Hadoop to support different styles of interaction with the data stored in HDFS. All are layered over HDFS and MapReduce.

Pig is a data flow language and execution environment on top of Hadoop.

Hive (HiveQL) is an SQL-like query language for building MapReduce jobs.

HBase is a column-oriented database running over HDFS and supporting MapReduce and point queries.

HBase depends on Zookeeper, a coordination service for building distributed applications. We won’t be discussion Zookeeper in this course except to say that it exists.

Mahout is a project to build scalable, machine learning systems by packaging machine learning algorithms as an executable library. You can specify input and output sources in HDFS to the algorithms, for example: each algorithm can take different parameters.

Mahout is still in early development (0.7 is the latest release). More information is available at http://mahout.apache.org/

3Module 5: Advanced Analytics - Technology and Tools

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

• As you move from Pig to Hive to HBase, you are increasingly moving away from the mechanics of Hadoop and creating an RDBMS view of the world

Levels of Abstraction

HBase Queries against

defined tables

Hive SQL-based language

Pig Data flow language More Hadoop

Visible

Less Hadoop

Visible

DBMS View

Mechanics of

Hadoop

4Module 5: Advanced Analytics - Technology and Tools

In Pig and Hive, the presence of HDFS is very noticeable.

Pig, for example, directly supports most of the Hadoop file system commands.

Likewise, Hive can access data whether it’s local or stored in an HDFS.

In either case, data can usually be specified via an HDFS URL (hdfs:///path>). In the case of HBase, however, Hadoop is mostly hidden in the HBase framework, and HBase provides data to the client via a programmatic interface (usually Java).

Via these interfaces, a Data Scientist can focus on manipulating large datasets without concerning themselves with the inner working of Hadoop. Of course, a Data Scientist must be aware of the constraints associated with using Hadoop for data storage, but doesn’t need to know the exact Hadoop command to check the file system.

4Module 5: Advanced Analytics - Technology and Tools

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

• Data flow language and execution environment for Hadoop

What is Pig?

• A data flow language (Pig Latin)

• Two modes execution environment: • Local --- access a local file system • MapReduce when you’re interested in the Hadoop environment

Two Main Elements

• When NOT to use Pig  If you only want to touch a small portion of the dataset (Pig eats it all)

 If you do NOT want to use batch processing

 Pig ONLY supports batch processing

5Module 5: Advanced Analytics - Technology and Tools

Pig is a data flow language and an execution environment to access the MapReduce functionality of Hadoop (as well as HDFS).

Pig consists of two main elements:

1. A data flow language called Pig Latin (ig-pay atin-lay) and

2. An execution environment, either as a standalone system or one using HDFS for data storage.

A word of caution is in order: If you only want to touch a small portion of a given dataset, then Pig is not for you, since it only knows how to read all the data presented to it. Pig only supports batch processing of data, so if you need an interactive environment, Pig isn’t for you.

5Module 5: Advanced Analytics - Technology and Tools

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Writing Pig Latin. Seriously.

• A Pig script is a series of operations (transformations) applied to an input to produce an output

 May be helpful to think of a Unix pipe command  tr [A-Za-z] file1; sort –o file2 file1; uniq –c file2

• Supports examining data structures and subsets of data

• Can execute Pig programs as a script  Via Grunt - an interactive shell or from a Java program

 Via the command line: pig

6Module 5: Advanced Analytics - Technology and Tools

One can think of a data flow programming model as a series of transforms or filters applied to an input stream.

In Pig, each transform is defined as new input data source (you’ll see an example of this in the next slide). Descriptions of these transforms can be provided to Pig via a Pig Latin script, or interactively using Grunt, Pig’s command-line interface.

Grunt also provides commands to query intermediate steps in the process. EXPLAIN shows the related MapReduce plan; DUMP lists out a dataset, and DESCRIBE describes the schema structure for that particular dataset.

Someone described Pig in this way: “You can process terabytes of data by issuing a half-dozen lines of Pig Latin from the console.” Not a bad return on investment. Just make sure they are the right half-dozen lines.

6Module 5: Advanced Analytics - Technology and Tools

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Deconstructing Pig

-- max_temp.pig -- Finds the max temperature by year

1 records = LOAD ‘data/samples.txt’ AS (year:chararray, temperature:int, quality:int) ;

2 filtered_records = FILTER records BY temperature != 9999 AND (quality == 0 OR quality ==1 OR quality == 4 OR

quality == 5 OR quality == 9);

3 grouped_records = GROUP filtered_records BY year;

4 max_temp = FOREACH grouped_records GENERATE group, MAX(filtered_records.temperature) ;

5 DUMP max_temp ;

7Module 5: Advanced Analytics - Technology and Tools

This simple five line Pig program computes the maximum temperature per year gathered from a set of multi-year, multi-observations file of temperatures.

Line 1 - the data is read into a local variable named “records.” Note that the code implies that records contains all the data, although this an abstraction.

Line 2 - the data set is filtered by removing missing values for temperature and by ensuring that quality takes on the values 0, 1, 4, 5, or 9.

Line 3 - the filtered records are grouped by year (effectively sorting by year).

Line 4 - choose the maximum temperature value for each group.

Line 5 - dumps the final values stored in the variable max_temp.

Observe that the argument to the LOAD function could be a local file (file:///) or an HDFS URI (hdfs:///…). In the default case, since we’re running HDFS, it will be found in HDFS.

The results of the LOAD command is a table that consists of a set of tuples (collections of variables). In line 3 we created a single record consisting of the year and a bag (an unordered collection of tuples), that contains a tuple for every observation made for that year (the year is actually repeated in each record, so the data looks like <1949/ {(1949, 111,1), (1949, 78,1) …}> This is still a representation of the pair that we saw earlier, but in this case the value pair is a structured data type (a list consisting of multiple 3- tuple). Line 4 aggregates the data by the grouping variable and the maximum temperature for that year.

7Module 5: Advanced Analytics - Technology and Tools

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Pig Comparison with a SQL

Pig SQL

A data flow language A declarative programming language

Schema is optional, can be specified at run-time

Schema is required at data load time

Supports complex, nested data structures

Typically uses simple table structures

Does not support random reads or queries

Random reads and queries are supported

8Module 5: Advanced Analytics - Technology and Tools

Pig acts as a front end to Hadoop; each command in Pig could theoretically become either a Map or a Reduce task (guess which ones are which in the preceding example).

PIG implements a data flow language where data flows through a series of transformations that alter the data in some way, and each step corresponds to a particular action. SQL, on the other hand, is a declarative programming language, and each SQL command represents a set of constraints that define the final output of the query.

For example, consider this SQL equivalent of our Pig example:

SELECT year, MAX(temperature) FROM (SELECT year, temperature quality FROM data WHERE temperature <> 9999 AND quality NOT IN (0,1,4,5,9) ORDER BY year GROUP BY year)

Unlike SQL, Pig doesn’t require a schema to be present; if one is, it can be declared at run-time. Pig can deal with arbitrarily nested data structures (bags within bags).

SQL supports random reads and queries; Pig reads the entire data set offered to it.

8Module 5: Advanced Analytics - Technology and Tools

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Hive and HiveQL

• Query language based on SQL for building MapReduce jobs

• All data stored in tables; schema is managed by Hive  Schema can be applied to existing data in HDFS

9Module 5: Advanced Analytics - Technology and Tools

The Hive system is aimed at the Data Scientist with strong SQL skills. Think of Hive as occupying a space between Pig and an DBMS (although that DBMS doesn’t have to be a Relational DBMS [RDBMS]).

In Hive, all data is stored in tables. The schema for each table is managed by Hive itself. Tables can be populated via the Hive interface, or a Hive schema can be applied to existing data stored in HDFS.

9Module 5: Advanced Analytics - Technology and Tools

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Hive Shell and HiveQL

• Hive  Provides web, server and shell interfaces for clients

 Hive shell is the default

 Can run external host commands using “!prog” command

 Can access HDFS using the DFS command

• HiveQL  Partial implementation of SQL-92 (closer to MySQL)

 Data in Hive can be in internal tables or “external” tables

 Internal tables managed by Hive

 External tables are not (lazy create and load)

10Module 5: Advanced Analytics - Technology and Tools

The hive program provides different functions depending on which commands are provided. The simplest invocation is simply “hive”, that brings up the Hive shell. From there, you can enter Hive SQL commands interactively, or these commands could be combined into a single script file.

hive hwi is the Hive web interface whereby one can browse existing database schemas and create sessions for issuing database queries and Hive commands. The interface is available at http://:9999/hwi.

Hive hiveserver will start Hive as a server listening on port 10000 that provides a Thrift and a JDBC/ODBC interface to Hive databases.

Data for Hive can be stored in Hive’s internal tables (managed tables) or can be retrieved from data in the filesystem (HDFS). An example of creation of an external table is

CREATE EXTERNAL TABLE my_ext_data (dummy STRING) LOCATION ‘/opt/externalTable’ ;

LOAD DATA INPATH ‘/opt/externalTable’ INTO TABLE my_ext_data ;

The existence of this data isn’t checked when these statements are executed, nor is data loading in Hive’s datastore. Hence, the notion of “lazy create and lazy load.”

10Module 5: Advanced Analytics - Technology and Tools

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Temperature Example: Hive

2

1

3

Example Hive Code

1 CREATE TABLE records (year STRING, temperature INT, quality INT) ROW FORMAT DELIMITED FIELDS

TERMINATED by ‘\t’ ;

2 LOAD DATA LOCAL ‘data/samples.txt’ OVERWRITE INTO TABLE records ;

3 SELECT year, MAX(temperature) FROM records WHERE temperature != 9999 AND (quality == 0 OR quality

== 1 OR quality == 4 OR quality == 5 OR quality

== 9) GROUP BY year;

11Module 5: Advanced Analytics - Technology and Tools

Let’s go back to our example of calculating the maximum temperature for a given year from thousand and thousands of weather observations from hundreds of weather stations.

Line 1 defines your table and states that our input consists of tab-delimited fields.

In line 2, you encounter our old favorite LOAD DATA again with a slightly different syntax.

Line 3 looks like a standard SQL query that produces a relation consisting of a year and the max temperature for that year. The ROW FORMAT clause is a Hive-specific addition.

Hive maintains its own set of tables; these tables could exist on a local file system or in HDFS as /usr/hive/XXXXX. A directory in the filesystem corresponds to a particular table.

Hive does not implement the full SQL-92 standard, and additionally provides certain clauses that don’t appear in standard SQL (“ROW FORMAT …” is one such example).

11Module 5: Advanced Analytics - Technology and Tools

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Hive Comparison with a SQL

Hive Database

“Schema on Read” “Schema on Write”

Incomplete SQL-92 (never a design goal)

Full SQL-92

No updates, transactions. Indexes available in v0. 7

Updates, transactions and indexes.

12Module 5: Advanced Analytics - Technology and Tools

In most traditional DBMS, the database description (or schema) is read and applied when the data is loaded. If the data doesn’t match the schema (specifically, the table into which it is read), then the load fails. This is often called “Schema on Write.”

Hive, on the other hand, doesn’t attempt to apply the schema until the data is actually read when someone issues a query. This results in fast loads as well as supporting multiple schemas for the same data (only defining as many variables as needed for your particular analysis). In addition, the actual format of the data may not be known because queries against the data haven’t been defined.

Updates and transactions aren’t supplied with Hive. Indexes are available as of Hive 0.8. If concurrent access to tables is desired, then the application must roll its own. That said, the Hive project is working towards integration with the HBase project that does provide row updates. See the Hive project page at for further details.

12Module 5: Advanced Analytics - Technology and Tools

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

A P A C H E - the Hadoop Database

• “Column oriented” database built over HDFS supporting MapReduce and point queries

• Depends on Zookeeper for consistency and Hadoop for distributed data.

• The Siteserver component provides several interfaces to Web clients (REST via HTTP, Thrift and Avro)

13Module 5: Advanced Analytics - Technology and Tools

HBase represents a further layer of abstraction on Hadoop. HBase has been described as “a distributed column-oriented database [data storage system]” built of top of HDFS.

Note that HBase is described as managing structured data. Each record in the table can be described as a key (treated as a byte stream) and a set of variables, each of which may be versioned. It’s not a structure in the same sense as an RDBMS is structured.

HBase is a more complex system than what we have seen previously. HBase uses additional Apache Foundation open source frameworks: Zookeeper is used as a co-ordination system to maintain consistency, Hadoop for MapReduce and HDFS, and Oozie for workflow management. As a Data Scientist, you probably won’t be concerned overmuch with implementation, but it is useful to at least know the names of all the moving parts.

HBase can be run from the command line, but also supports REST (Representational State Transfer – think HTTP) and Thrift and Avro interfaces via the Siteserver daemon. Thrift and Avro both provide an interface to send and receive serialized data (objects where the data is “flattened” into a byte stream).

13Module 5: Advanced Analytics - Technology and Tools

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

When to Choose HBase

• You need random, real-time read/write access to your big data

• You need sparse tables consisting of millions of rows and millions of columns where each column variable may be versioned

• Google’s BigTable: a “Web table”

14Module 5: Advanced Analytics - Technology and Tools

HBase has been described like this: “[its] forte is when real-time read-write random-access to very large datasets is required.” HBase is an open source version of Google’s BigTable, and it’s instructive to read the definition of BigTable by the original authors:

BigTable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers.

Note that HBase is described as managing structured data. Each record in the table can be described as a key (treated as a byte stream) and a set of variables, each of which may be versioned. It’s not a structure in the same sense as an RDBMS is structured, but it does have structure nonetheless. And because HBase is not constrained in the same was as an RDBMS system is constrained, HBase designs can take advantage of the physical layout of the table on disk to increase performance.

It’s useful to recall that Google’s BigTable was designed to store information about Web URLs (Web documents). Fields in this table could be versioned (new versions of an HTML page, for example) and the table could be updated frequently as web-crawlers discovered new data. One design decision to speed access to URLs from the same site was to reverse the order of the URL: instead of media.google.com, the URL would be stored as com.google.media, ensuring that other Google URLs would be “reasonably” close.

14Module 5: Advanced Analytics - Technology and Tools

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

HBase Comparison with a Traditional Database

HBase DBMS

No real indexes Real indexes

Support automatic partitioning

No automatic partitioning

Ideal for billions of rows and millions of columns

Challenged by sparse data

Supports real time reads and writes

Supports real time reads and writes

15Module 5: Advanced Analytics - Technology and Tools

Although HBase may look like a traditional DBMS, it isn’t.

HBase is a “distributed, column-oriented data storage system that can scale tall (billions of rows), wide (billions of columns), and can be horizontally partitioned and replicated across thousands of commodity servers automatically.”

The HBase table schemas mirror physical storage for efficiency; a RDBMS doesn’t. (the RDBMS schema is a logical description of the data, and implies no specific physical structuring.)

Most RDBMS systems require that data must be consistent after each transaction (ACID prosperities). NoSQL (Not Only SQL) systems like HBase don’t suffer from these constraints, and implement eventual consistency. This means that for some systems you cannot write a value into the database and immediately read it back in. Strange, but true.

Another of HBase’s strengths is in its wide open view of data – HBASE will accept almost anything it can cram into an HBase table.

15Module 5: Advanced Analytics - Technology and Tools

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Pig

• Replacement for MapReduce Java coding

• When need exists to customize part of the processing phases (UDF)

Hive

• Use when SQL skills are available

• Customize part of the processing via UDFs

Which Interface Should You Choose?

HBase

• Use when random queries and partial processing is required, or when specific file layouts are needed

16Module 5: Advanced Analytics - Technology and Tools

If you are Java-savvy, or you have scripting skills (shell plus Ruby/Python/Perl/Tcl/Tk), you might simply want to use the existing Hadoop framework for MapReduce tasks.

If your talent lies in other areas, such as database administration or functional programming, you might wish to choose a different method of accessing your data.

Pig provides an abstraction above that of MapReduce for HDFS, and makes the interface simpler to use. Both Pig and Hive support UDFs (User Defined Functions).

Hive provides an SQL-like interface to data that may be stored in HDFS, but Hive tables don’t meet the definition of a RDBMS.

HBase, as the “Hadoop Database,” leverages Hadoop/HDFS for data storage and the Zookeeper system for co-ordination. Since there is no fixed schema per se, attributes (columns) can be added to a dataset without requiring programs to change to address the extra data; attribute values may be versioned to record changes to a particular value. Bulk loading can be accomplished by having MapReduce write files in HBase internal format directly into HDFS with an order of magnitude increase in populating the database.

16Module 5: Advanced Analytics - Technology and Tools

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Mahout

• Scalable machine learning and data mining library for Hadoop

• Support for four use cases  Recommendation mining

 Classification

 Clustering

 Frequent itemset

• Requires Hadoop infrastructure and Java programming

17Module 5: Advanced Analytics - Technology and Tools

Mahout is a set of machine learning algorithms that leverages Hadoop to provide both data storage and the MapReduce implementation.

The mahout command is itself a script that wraps the Hadoop command and executes a requested algorithm from the Mahout job jar file (jar files are Java ARchives, and are very similar to Linux tar files [tape archives]). Parameters are passed from the command line to the class instance.

Mahout mainly supports four use cases:

• Recommendation mining takes users' behavior and tries to find items users might like. An example of this is LinkedIn’s “People You Might Know” (PYMK).

• Classification learns from existing categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category.

• Clustering takes documents and groups them into collections of topically related documents based on word occurrences.

• Frequent itemset mining takes a set of item groups (for example, terms in a query session, shopping cart content) and identifies which individual items usually appear together. If you plan on using Mahout, remember that these distributions (Hadoop and Mahout) anticipate running on a *nix machine, although a Cygwin environment on Windows will work as well (or rewriting the command scripts in another language, say as a batch file on Windows). It goes without saying that a compatible working version of Hadoop is required. Lastly, Mahout requires that you program in Java: no other interface outside of the command line is supported.

17Module 5: Advanced Analytics - Technology and Tools

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

• Recommenders  Non-distributed recommenders

 Distributed item-based collaborative filtering

 Collaborative filtering using a parallel matrix classification

• Classification  Logistic Regression

 Bayesian

 Random Forests

 Restricted Boltzmann Machines

 Online Passive Aggressive

• Frequent Itemset  Parallel FP Growth Mining

• Clustering  Canopy Clustering

 K-Means Clustering

 Fuzzy K-Means

 Mean Shift Clustering

 Hierarchical Clustering

 Dirichlet Process Clustering

 Latent Dirichlet Allocation

 Spectral Clustering

 Minhash Clustering

Algorithms Available in Mahout

18Module 5: Advanced Analytics - Technology and Tools

Mahout provides a number of different algorithms to support its use cases. Some of these algorithms are listed. We’ve only included those algorithms that are curently implemented (at the time of this version of the course) and don’t have problem reports issued against them. Other algorithms are in various stages of development: check the website at for more details.

18Module 5: Advanced Analytics - Technology and Tools

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Other Hadoop Ecosystem Resources

Other tools from the Hadoop EcoSystem

• Log data collection  Scribe, Chukwa

• Workflow/coordination  Oozie, Azkaban

• Yet another BigTable implementation  Hypertable

• Other key-value distributed datastores  Cassandra, Voldemort

• Other tools  Sqoop – Import or export data from HDFS to structured database

 Cascading – an alternate API to Hadoop MapReduce

19Module 5: Advanced Analytics - Technology and Tools

We have already been introduced to HBase, Hive, Pig, and Zookeeper. Other tools/system based on Hadoop HDFS and MapReduce include:

Howl –- mixture of Hive and OWL (another interface to HDFS data).

Oozie –- a workflow/coordination system to manage Apache Hadoop(TM) jobs.

Zookeeper –- a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.

Chukwa: –- a Hadoop subproject devoted to large-scale log collection and analysis.

Cascading -- an alternative API to Hadoop MapReduce.

Scribe –- a server designed for the real-time streaming of log data.

Cassandra –- another database system, noted for its elasticity, durability and availability.

Hypertable –- another BigTable implementation that runs over Hadoop.

Voldemort –- a distributed key-value storage system, used at LinkedIn.

Azkaban –- workflow scheduler, developed at LinkedIn.

Sqoop –- a tool for importing and exporting data from Hadoop to structured databases.

19Module 5: Advanced Analytics - Technology and Tools

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Check Your Knowledge

1. How does Pig differ from a typical MapReduce process? 2. How does schema parsing differ in Hive from a traditional

RDBMS?

3. With regards to file structure, how does HBase differ from a traditional RDBMS?

4. Which capabilities of Hadoop does Mahout use? 5. Which categories of use cases are supported by Mahout?

20Module 5: Advanced Analytics - Technology and Tools

Please take a moment to answer these questions. Write you answer in the space below.

20Module 5: Advanced Analytics - Technology and Tools

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Advanced Analytics - Technology and Tools

During this lesson the following topics were covered:

• Query languages for Hadoop (Hive and Pig)

• HBase – a BigTable workalike using Hadoop

• Mahout – machine learning algorithms using Hadoop MapReduce and HDFS

• Other elements of the Hadoop Ecosystem

Summary

21Module 5: Advanced Analytics - Technology and Tools

Lesson 2 covered:

Hive and Pig – Hadoop query languages

HBase – a BigTable workalike using Hadoop

Mahout – machine learning algorithms and Hadoop MapReduce

21Module 5: Advanced Analytics - Technology and Tools

Homework is Completed By:

Writer Writer Name Amount Client Comments & Rating
Instant Homework Helper

ONLINE

Instant Homework Helper

$36

She helped me in last minute in a very reasonable price. She is a lifesaver, I got A+ grade in my homework, I will surely hire her again for my next assignments, Thumbs Up!

Order & Get This Solution Within 3 Hours in $25/Page

Custom Original Solution And Get A+ Grades

  • 100% Plagiarism Free
  • Proper APA/MLA/Harvard Referencing
  • Delivery in 3 Hours After Placing Order
  • Free Turnitin Report
  • Unlimited Revisions
  • Privacy Guaranteed

Order & Get This Solution Within 6 Hours in $20/Page

Custom Original Solution And Get A+ Grades

  • 100% Plagiarism Free
  • Proper APA/MLA/Harvard Referencing
  • Delivery in 6 Hours After Placing Order
  • Free Turnitin Report
  • Unlimited Revisions
  • Privacy Guaranteed

Order & Get This Solution Within 12 Hours in $15/Page

Custom Original Solution And Get A+ Grades

  • 100% Plagiarism Free
  • Proper APA/MLA/Harvard Referencing
  • Delivery in 12 Hours After Placing Order
  • Free Turnitin Report
  • Unlimited Revisions
  • Privacy Guaranteed

6 writers have sent their proposals to do this homework:

Study Master
Fatimah Syeda
Engineering Help
ECFX Market
Professor Smith
Online Assignment Help
Writer Writer Name Offer Chat
Study Master

ONLINE

Study Master

I am a PhD writer with 10 years of experience. I will be delivering high-quality, plagiarism-free work to you in the minimum amount of time. Waiting for your message.

$23 Chat With Writer
Fatimah Syeda

ONLINE

Fatimah Syeda

This project is my strength and I can fulfill your requirements properly within your given deadline. I always give plagiarism-free work to my clients at very competitive prices.

$27 Chat With Writer
Engineering Help

ONLINE

Engineering Help

I am an elite class writer with more than 6 years of experience as an academic writer. I will provide you the 100 percent original and plagiarism-free content.

$49 Chat With Writer
ECFX Market

ONLINE

ECFX Market

I have done dissertations, thesis, reports related to these topics, and I cover all the CHAPTERS accordingly and provide proper updates on the project.

$49 Chat With Writer
Professor Smith

ONLINE

Professor Smith

After reading your project details, I feel myself as the best option for you to fulfill this project with 100 percent perfection.

$30 Chat With Writer
Online Assignment Help

ONLINE

Online Assignment Help

I reckon that I can perfectly carry this project for you! I am a research writer and have been writing academic papers, business reports, plans, literature review, reports and others for the past 1 decade.

$27 Chat With Writer

Let our expert academic writers to help you in achieving a+ grades in your homework, assignment, quiz or exam.

Similar Homework Questions

Starbucks just who is the starbucks customer case study - How does ultrasound work bbc bitesize - Child family and community 7th edition pdf free download - Hazard identification tool example - NEED 3+ PAGES WITH 4 PEER REVIEWED REFERENCES CITED IN APA FORMAT - Irony in the crucible - Problem 3 - Red green color blindness pedigree chart - C&EW6D - Federal Laws and Regulations - Journal 14 - Crm software can most likely help a salesperson to - Zipcar refining the business model case study analysis - CJBS 415 CRIMINAL JUSTICE - Hsc english standard syllabus - What is merchandising business - Visiting student researcher berkeley - 34.3 mm freeze plug autozone - Should students be allowed to use cellphones in school debate - Upper intermediate reading texts - Farewell to manzanar important quotes - Things that were invented in the 1920s - Anatomy drill and practice wiley - Sorcerer in the tempest - David haythornthwaite net worth - A bus decreases its speed from 80 - Paramecium caudatum eukaryotic or prokaryotic - Centi milli micro nano - Research Process & Methodology - 8/15 divided by 4/5 - Two truths and a lie example - Engstrom auto mirror plant case study - Chin chin china chip chop chop poem - CC2 - +91-8306951337 get your love back by vashikaran IN Belgaum - Deliverable 4 - Electricity, Magnetism, and Light Compare/Contrast Paper - Management information system case study for mba - Mayer hawthorne where does this door go rar - Synchronous vs asynchronous transmission - Follow up discussion - Freak the mighty scenes - Nursing - 2 essays needed- HR Management - Merits and demerits of factorial experiment in statistics - What Advice Would You Give to Executives About AI? - Mcdonalds geographic area of operations - Electric Fields - Scu harvard referencing - Snhu apa format - Soil texture triangle activity answers - Www westpac online investing - A and b are mutually exclusive - Ieee transactions template word - Spp split case pumps - Discussion - 2 3 dimethylbutane condensed structural formula - What is emotive writing - X mart uses the perpetual inventory system - Internal control case study with answer - Telecommuting at medex - International journal of design - Strategic management text and cases 7th edition ebook - Data Science Case Analysis - Syslog ng timestamp format - Swms template master builders - Week 2 Report - Powerpoint - Dell sc420 spec sheet - Training and Development : Unit VI Journal, Unit VI Assessment, Unit VII Homework, Unit VII PowerPoint Presentation - What is the value of log81 3? show your work. - Myitlab answers excel chapter 2 - Central distribution university of surrey - Information system audit checklist - Federalist no 10 worksheet answers - Financial institutions investments and management an introduction - Cybersecurity - What is rehashing in data structure - Module 2 Assignment - Brooklyn cop norman maccaig - Balance day adjustments journals examples - Should writers use they own english - Rather than strong work ethic a common attitude is - What number is x1x - Examples of daoism in the modern world - Average toe length graph - Physics lab series and parallel circuits - 8 mile ending scene - Product life cycle of domino's pizza - Leaked movie trailer and a confidentiality agreement - Informative speech outline on makeup - Thames valley animal welfare - When was banjo paterson born - Marketing Management - Hastie tibshirani elements of statistical learning - Dali concept 2 specifications - Juno and the paycock summary - Cambridge primary checkpoint past papers 2011 - Information flow diagram visio - Aligning IT Strategies with Business Initiatives - Xyz stock price and dividend history are as follows