Loading...

Messages

Proposals

Stuck in your homework and missing deadline? Get urgent help in $10/Page with 24 hours deadline

Get Urgent Writing Help In Your Essays, Assignments, Homeworks, Dissertation, Thesis Or Coursework & Achieve A+ Grades.

Privacy Guaranteed - 100% Plagiarism Free Writing - Free Turnitin Report - Professional And Experienced Writers - 24/7 Online Support

Cross industry standard process for data mining pdf

06/12/2021 Client: muhammad11 Deadline: 2 Day

DATA SCIENCE AND BIG DATA ANALYTICS

CHAPTER 2: DATA ANALYTICS LIFECYCLE

DATA ANALYTICS LIFECYCLE

• Data science projects differ from BI projects

• More exploratory in nature

• Critical to have a project process

• Participants should be thorough and rigorous

• Break large projects into smaller pieces

• Spend time to plan and scope the work

• Documenting adds rigor and credibility

DATA ANALYTICS LIFECYCLE

• Data Analytics Lifecycle Overview

• Phase 1: Discovery

• Phase 2: Data Preparation

• Phase 3: Model Planning

• Phase 4: Model Building

• Phase 5: Communicate Results

• Phase 6: Operationalize

• Case Study: GINA

2.1 DATA ANALYTICS LIFECYCLE OVERVIEW

• The data analytic lifecycle is designed for Big Data problems and

data science projects

• With six phases the project work can occur in several phases

simultaneously

• The cycle is iterative to portray a real project

• Work can return to earlier phases as new information is uncovered

2.1.1 KEY ROLES FOR A SUCCESSFUL ANALYTICS

PROJECT

KEY ROLES FOR A SUCCESSFUL ANALYTICS

PROJECT

• Business User – understands the domain area

• Project Sponsor – provides requirements

• Project Manager – ensures meeting objectives

• Business Intelligence Analyst – provides business domain

expertise based on deep understanding of the data

• Database Administrator (DBA) – creates DB environment

• Data Engineer – provides technical skills, assists data

management and extraction, supports analytic sandbox

• Data Scientist – provides analytic techniques and modeling

2.1.2 BACKGROUND AND OVERVIEW OF DATA ANALYTICS LIFECYCLE

• Data Analytics Lifecycle defines the analytics process and

best practices from discovery to project completion

• The Lifecycle employs aspects of

• Scientific method

• Cross Industry Standard Process for Data Mining (CRISP-DM)

• Process model for data mining

• Davenport’s DELTA framework

• Hubbard’s Applied Information Economics (AIE) approach

• MAD Skills: New Analysis Practices for Big Data by Cohen et al.

https://en.wikipedia.org/wiki/Scientific_method
https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining
http://www.informationweek.com/software/information-management/analytics-at-work-qanda-with-tom-davenport/d/d-id/1085869?
https://en.wikipedia.org/wiki/Applied_information_economics
https://pafnuty.wordpress.com/2013/03/15/reading-log-mad-skills-new-analysis-practices-for-big-data-cohen/
OVERVIEW OF DATA ANALYTICS LIFECYCLE

2.2 PHASE 1: DISCOVERY

2.2 PHASE 1: DISCOVERY

1. Learning the Business Domain

2. Resources

3. Framing the Problem

4. Identifying Key Stakeholders

5. Interviewing the Analytics Sponsor

6. Developing Initial Hypotheses

7. Identifying Potential Data Sources

2.3 PHASE 2: DATA PREPARATION

2.3 PHASE 2: DATA PREPARATION

• Includes steps to explore, preprocess, and condition

data

• Create robust environment – analytics sandbox

• Data preparation tends to be the most labor-intensive

step in the analytics lifecycle

• Often at least 50% of the data science project’s time

• The data preparation phase is generally the most

iterative and the one that teams tend to

underestimate most often

2.3.1 PREPARING THE ANALYTIC SANDBOX

• Create the analytic sandbox (also called workspace)

• Allows team to explore data without interfering with live

production data

• Sandbox collects all kinds of data (expansive approach)

• The sandbox allows organizations to undertake ambitious

projects beyond traditional data analysis and BI to perform

advanced predictive analytics

• Although the concept of an analytics sandbox is relatively new,

this concept has become acceptable to data science teams and

IT groups

2.3.2 PERFORMING ETLT (EXTRACT, TRANSFORM, LOAD,

TRANSFORM)

• In ETL users perform extract, transform, load

• In the sandbox the process is often ELT – early load

preserves the raw data which can be useful to

examine

• Example – in credit card fraud detection, outliers can

represent high-risk transactions that might be

inadvertently filtered out or transformed before

being loaded into the database

• Hadoop (Chapter 10) is often used here

2.3.3 LEARNING ABOUT THE DATA

• Becoming familiar with the data is critical

• This activity accomplishes several goals:

• Determines the data available to the team

early in the project

• Highlights gaps – identifies data not currently

available

• Identifies data outside the organization that

might be useful

2.3.3 LEARNING ABOUT THE DATA SAMPLE DATASET

INVENTORY

2.3.4 DATA CONDITIONING

• Data conditioning includes cleaning data,

normalizing datasets, and performing

transformations

• Often viewed as a preprocessing step prior to data

analysis, it might be performed by data owner, IT

department, DBA, etc.

• Best to have data scientists involved

• Data science teams prefer more data than too little

2.3.4 DATA CONDITIONING

• Additional questions and considerations

• What are the data sources? Target fields?

• How clean is the data?

• How consistent are the contents and files? Missing or

inconsistent values?

• Assess the consistence of the data types – numeric,

alphanumeric?

• Review the contents to ensure the data makes sense

• Look for evidence of systematic error

2.3.5 SURVEY AND VISUALIZE

• Leverage data visualization tools to gain an

overview of the data

• Shneiderman’s mantra:

• “Overview first, zoom and filter, then details-on-

demand”

• This enables the user to find areas of interest, zoom and

filter to find more detailed information about a

particular area, then find the detailed data in that area

2.3.5 SURVEY AND VISUALIZE GUIDELINES AND CONSIDERATIONS

• Review data to ensure calculations are consistent

• Does the data distribution stay consistent?

• Assess the granularity of the data, the range of values, and the level of

aggregation of the data

• Does the data represent the population of interest?

• Check time-related variables – daily, weekly, monthly? Is this good

enough?

• Is the data standardized/normalized? Scales consistent?

• For geospatial datasets, are state/country abbreviations consistent

2.3.6 COMMON TOOLS FOR DATA PREPARATION

• Hadoop can perform parallel ingest and analysis

• Alpine Miner provides a graphical user interface for

creating analytic workflows

• OpenRefine (formerly Google Refine) is a free, open

source tool for working with messy data

• Similar to OpenRefine, Data Wrangler is an

interactive tool for data cleansing an transformation

2.4 PHASE 3: MODEL PLANNING

2.4 PHASE 3: MODEL PLANNING

• Activities to consider

• Assess the structure of the data – this dictates the tools and

analytic techniques for the next phase

• Ensure the analytic techniques enable the team to meet the

business objectives and accept or reject the working hypotheses

• Determine if the situation warrants a single model or a series of

techniques as part of a larger analytic workflow

• Research and understand how other analysts have approached

this kind or similar kind of problem

2.4 PHASE 3: MODEL PLANNING MODEL PLANNING IN INDUSTRY

VERTICALS

• Example of other analysts approaching a similar problem

2.4.1 DATA EXPLORATION AND VARIABLE SELECTION

• Explore the data to understand the relationships among the

variables to inform selection of the variables and methods

• A common way to do this is to use data visualization tools

• Often, stakeholders and subject matter experts may have ideas

• For example, some hypothesis that led to the project

• Aim for capturing the most essential predictors and variables

• This often requires iterations and testing to identify key variables

• If the team plans to run regression analysis, identify the candidate

predictors and outcome variables of the model

2.4.2 MODEL SELECTION

• The main goal is to choose an analytical technique, or several candidates, based

on the end goal of the project

• We observe events in the real world and attempt to construct models that

emulate this behavior with a set of rules and conditions

• A model is simply an abstraction from reality

• Determine whether to use techniques best suited for structured data,

unstructured data, or a hybrid approach

• Teams often create initial models using statistical software packages such as R,

SAS, or Matlab

• Which may have limitations when applied to very large datasets

• The team moves to the model building phase once it has a good idea about the

type of model to try

2.4.3 COMMON TOOLS FOR THE MODEL PLANNING PHASE

• R has a complete set of modeling capabilities

• R contains about 5000 packages for data analysis and graphical presentation

• SQL Analysis services can perform in-database analytics of

common data mining functions, involved aggregations, and basic

predictive models

• SAS/ACCESS provides integration between SAS and the analytics

sandbox via multiple data connections

2.5 PHASE 4: MODEL BUILDING

2.5 PHASE 4: MODEL BUILDING

• Execute the models defined in Phase 3

• Develop datasets for training, testing, and production

• Develop analytic model on training data, test on test data

• Question to consider • Does the model appear valid and accurate on the test data?

• Does the model output/behavior make sense to the domain experts?

• Do the parameter values make sense in the context of the domain?

• Is the model sufficiently accurate to meet the goal?

• Does the model avoid intolerable mistakes? (see Chapters 3 and 7)

• Are more data or inputs needed?

• Will the kind of model chosen support the runtime environment?

• Is a different form of the model required to address the business problem?

2.5.1 COMMON TOOLS FOR THE MODEL BUILDING PHASE

• Commercial Tools

• SAS Enterprise Miner – built for enterprise-level computing and analytics

• SPSS Modeler (IBM) – provides enterprise-level computing and analytics

• Matlab – high-level language for data analytics, algorithms, data exploration

• Alpine Miner – provides GUI frontend for backend analytics tools

• STATISTICA and MATHEMATICA – popular data mining and analytics tools

• Free or Open Source Tools

• R and PL/R - PL/R is a procedural language for PostgreSQL with R

• Octave – language for computational modeling

• WEKA – data mining software package with analytic workbench

• Python – language providing toolkits for machine learning and analysis

• SQL – in-database implementations provide an alternative tool (see Chap 11)

2.6 PHASE 5: COMMUNICATE RESULTS

2.6 PHASE 5: COMMUNICATE RESULTS

• Determine if the team succeeded or failed in its objectives

• Assess if the results are statistically significant and valid

• If so, identify aspects of the results that present salient findings

• Identify surprising results and those in line with the hypotheses

• Communicate and document the key findings and major

insights derived from the analysis

• This is the most visible portion of the process to the outside

stakeholders and sponsors

2.7 PHASE 6: OPERATIONALIZE

2.7 PHASE 6: OPERATIONALIZE

• In this last phase, the team communicates the benefits of the project

more broadly and sets up a pilot project to deploy the work in a

controlled way

• Risk is managed effectively by undertaking small scope, pilot

deployment before a wide-scale rollout

• During the pilot project, the team may need to execute the

algorithm more efficiently in the database rather than with in-

memory tools like R, especially with larger datasets

• To test the model in a live setting, consider running the model in a

production environment for a discrete set of products or a single

line of business

• Monitor model accuracy and retrain the model if necessary

2.7 PHASE 6: OPERATIONALIZE KEY OUTPUTS FROM SUCCESSFUL

ANALYTICS PROJECT

2.7 PHASE 6: OPERATIONALIZE KEY OUTPUTS FROM SUCCESSFUL

ANALYTICS PROJECT

• Business user – tries to determine business benefits and

implications

• Project sponsor – wants business impact, risks, ROI

• Project manager – needs to determine if project completed on

time, within budget, goals met

• Business intelligence analyst – needs to know if reports and

dashboards will be impacted and need to change

• Data engineer and DBA – must share code and document

• Data scientist – must share code and explain model to peers,

managers, stakeholders

2.7 PHASE 6: OPERATIONALIZE FOUR MAIN DELIVERABLES

• Although the seven roles represent many interests, the

interests overlap and can be met with four main

deliverables

1. Presentation for project sponsors – high-level takeaways for

executive level stakeholders

2. Presentation for analysts – describes business process changes

and reporting changes, includes details and technical graphs

3. Code for technical people

4. Technical specifications of implementing the code

2.8 CASE STUDY: GLOBAL INNOVATION NETWORK AND ANALYSIS (GINA)

• In 2012 EMC’s new director wanted to improve

the company’s engagement of employees across

the global centers of excellence (GCE) to drive

innovation, research, and university partnerships

• This project was created to accomplish

• Store formal and informal data

• Track research from global technologists

• Mine the data for patterns and insights to improve the

team’s operations and strategy

2.8.1 PHASE 1: DISCOVERY

• Team members and roles

• Business user, project sponsor, project manager – Vice President from

Office of CTO

• BI analyst – person from IT

• Data engineer and DBA – people from IT

• Data scientist – distinguished engineer

2.8.1 PHASE 1: DISCOVERY

• The data fell into two categories

• Five years of idea submissions from internal innovation

contests

• Minutes and notes representing innovation and research

activity from around the world

• Hypotheses grouped into two categories

• Descriptive analytics of what is happening to spark

further creativity, collaboration, and asset generation

• Predictive analytics to advise executive management of

where it should be investing in the future

2.8.2 PHASE 2: DATA PREPARATION

• Set up an analytics sandbox

• Discovered that certain data needed conditioning and

normalization and that missing datasets were critical

• Team recognized that poor quality data could impact

subsequent steps

• They discovered many names were misspelled and

problems with extra spaces

• These seemingly small problems had to be addressed

2.8.3 PHASE 3: MODEL PLANNING

• The study included the following

considerations

• Identify the right milestones to achieve the goals

• Trace how people move ideas from each

milestone toward the goal

• Tract ideas that die and others that reach the goal

• Compare times and outcomes using a few

different methods

2.8.4 PHASE 4: MODEL BUILDING

• Several analytic method were employed

• NLP on textual descriptions

• Social network analysis using R and Rstudio

• Developed social graphs and visualizations

2.8.4 PHASE 4: MODEL BUILDING SOCIAL GRAPH OF DATA

SUBMITTERS AND FINALISTS

2.8.4 PHASE 4: MODEL BUILDING SOCIAL GRAPH OF TOP INNOVATION

INFLUENCERS

2.8.5 PHASE 5: COMMUNICATE RESULTS

• Study was successful in in identifying hidden innovators

• Found high density of innovators in Cork, Ireland

• The CTO office launched longitudinal studies

2.8.6 PHASE 6: OPERATIONALIZE

• Deployment was not really discussed

• Key findings

• Need more data in future

• Some data were sensitive

• A parallel initiative needs to be created to improve basic BI activities

• A mechanism is needed to continually reevaluate the model after

deployment

2.8.6 PHASE 6: OPERATIONALIZE

SUMMARY

• The Data Analytics Lifecycle is an approach to managing and

executing analytic projects

• Lifecycle has six phases

• Bulk of the time usually spent on preparation – phases 1 and 2

• Seven roles needed for a data science team

• Review the exercises

FOCUS OF COURSE

• Focus on quantitative disciplines – e.g., math, statistics, machine learning

• Provide overview of Big Data analytics

• In-depth study of a several key algorithms

Homework is Completed By:

Writer Writer Name Amount Client Comments & Rating
Instant Homework Helper

ONLINE

Instant Homework Helper

$36

She helped me in last minute in a very reasonable price. She is a lifesaver, I got A+ grade in my homework, I will surely hire her again for my next assignments, Thumbs Up!

Order & Get This Solution Within 3 Hours in $25/Page

Custom Original Solution And Get A+ Grades

  • 100% Plagiarism Free
  • Proper APA/MLA/Harvard Referencing
  • Delivery in 3 Hours After Placing Order
  • Free Turnitin Report
  • Unlimited Revisions
  • Privacy Guaranteed

Order & Get This Solution Within 6 Hours in $20/Page

Custom Original Solution And Get A+ Grades

  • 100% Plagiarism Free
  • Proper APA/MLA/Harvard Referencing
  • Delivery in 6 Hours After Placing Order
  • Free Turnitin Report
  • Unlimited Revisions
  • Privacy Guaranteed

Order & Get This Solution Within 12 Hours in $15/Page

Custom Original Solution And Get A+ Grades

  • 100% Plagiarism Free
  • Proper APA/MLA/Harvard Referencing
  • Delivery in 12 Hours After Placing Order
  • Free Turnitin Report
  • Unlimited Revisions
  • Privacy Guaranteed

6 writers have sent their proposals to do this homework:

Quick Mentor
Write My Coursework
Accounting & Finance Specialist
Essay & Assignment Help
Solution Provider
Academic Master
Writer Writer Name Offer Chat
Quick Mentor

ONLINE

Quick Mentor

I have done dissertations, thesis, reports related to these topics, and I cover all the CHAPTERS accordingly and provide proper updates on the project.

$33 Chat With Writer
Write My Coursework

ONLINE

Write My Coursework

I reckon that I can perfectly carry this project for you! I am a research writer and have been writing academic papers, business reports, plans, literature review, reports and others for the past 1 decade.

$20 Chat With Writer
Accounting & Finance Specialist

ONLINE

Accounting & Finance Specialist

I am an elite class writer with more than 6 years of experience as an academic writer. I will provide you the 100 percent original and plagiarism-free content.

$49 Chat With Writer
Essay & Assignment Help

ONLINE

Essay & Assignment Help

I am an elite class writer with more than 6 years of experience as an academic writer. I will provide you the 100 percent original and plagiarism-free content.

$27 Chat With Writer
Solution Provider

ONLINE

Solution Provider

I will be delighted to work on your project. As an experienced writer, I can provide you top quality, well researched, concise and error-free work within your provided deadline at very reasonable prices.

$35 Chat With Writer
Academic Master

ONLINE

Academic Master

I am an academic and research writer with having an MBA degree in business and finance. I have written many business reports on several topics and am well aware of all academic referencing styles.

$19 Chat With Writer

Let our expert academic writers to help you in achieving a+ grades in your homework, assignment, quiz or exam.

Similar Homework Questions

Ethical and legal implications of disclosure and nondisclosure - 2 hydroxybenzoic acid and ethanoic anhydride equation - Lms motorsports grants pass oregon - General power of attorney philippines - Gang of youths fremantle arts centre - Organizational structure of itc company - In deep nights i dig for you like treasure - Hardware and networking study material free download pdf - Menopause the musical titusville playhouse inc february 5 - There are five sales associates at mid motors ford - Wilkins a zurn company demand forecasting case solution - The shape of africa by jared diamond - What is the approximate molar mass of lauryl alcohol - The physiological effects of pornography on the brain - Perforated metal sheet screwfix - Elders weather esperance wa - Beth r jordan tax return solution - Human vs robot ppt - Culturally competent health care purnell - Math 59 Questions - A study conducted at manatee community college - What is the difference between a consultation and a referral - C2h3cl polar or nonpolar - Go to the website: https://epic.org/privacy/litigation/ which focuses on civil rights issues and privacy. Pick a case. - Maths quiz for 10 year olds - Nhs fife orthopaedic surgeons - Nursing Discussion - Questionnaire sample of maruti suzuki company - Converting mg to g - Dat inc needs to develop an aggregate plan - Social - Rapid fire fulfillment harvard business review - Nfpa 99 electrical safety standards - Dr dan gartrell articles - Ge healthcare in india an ultrasound strategy case - Love Story - Experiment 7 empirical formula lab report - Companion dog shows with obedience - Work breakdown structure for coffee shop - An organization pursuing a loyal soldier hr strategy - Nitrogen kick off calculation - Pomegranate poem meaning - How do you spell touchdown - HR_STR (U1_RPL) - Higher chemistry problem solving questions - How to set up fractional distillation apparatus - Social marketing process fan acquisition - Probation period report sample - Journeyman electrician study guide pdf - Practice with Commentary - Ezekiel 38 war timeline - My cqu log in - Sources of Long-term Economic Growth - Dolch and fry sight words list - Case study - Gcf learning free word 2016 - Gx works 3 programming manual - Classics of organizational behavior pdf - Security And Technical Drivers - Lead iodide precipitation reaction - Smart notebook 11 gallery essentials download - Malk youtube - Proscenium arch stage diagram - True false making data secure means keeping it secret - Aiaa aerospace sciences meeting - How to calculate internal growth rate - Use the adjusted trial balance for stockton company - Liverpool council waste collection - The crucible act 3 packet answers - Discussion - Mountain view realty excel assignment answers - Álvaro estrada / miami, estados unidos álvaro estrada es de miami. es estadounidense. - All quiet on the western front kemmerich - Rig veda creation hymn summary - A complex unlearned and fixed pattern of behavior - Katherine knight why did she do it - What is the manager's responsibility in comparing data - Bachelor of education arts in kenyatta university - Sam excel project - Module 2 - Periodical Report 2 - Www pearsonclinical com psychology products - Aasb 108 materiality - Discussion Week 11 - Procurement process in construction industry ppt - Probablities questions (MATH) - Psychology/discussion - Wileyplus com go bb register - What counseling theory am i - Concrete downpipe sump box - What are the seven dwarfs called - Flexsim help - Assignments FOR LAB - Health Unit2- Paper - Tams 2 slurry equipment - Week 2 discussion responses - Google analytics merchandise store demo account - Iwg format - Moral model for ethical decision making nursing - Waiting for godot essay - The hydraulic cylinder bd exerts on member abc