Data Analytics Lifecycle
U pon com
pletion of this lesson, you should be able to: •
Apply the Data Analytics Lifecycle to a case study scenario •
Fram e a business problem
as an analytics problem
• Identify the four m
ain deliverables in an analytics project
M odule 2: Data Analytics Lifecycle
2
Copyright © 2014 EM
C Corporation. All Rights Reserved.
How to Approach Your Analytics Problem
s
• How
do you currently approach your analytics problem
s?
• Do you follow
a m ethodology or
som e kind of fram
ew ork?
• How
do you plan for an analytic project?
3 M
odule 2: Data Analytics Lifecycle
Your Thoughts?
Copyright © 2014 EM
C Corporation. All Rights Reserved.
• Focus your tim
e
• Ensure rigor and com
pleteness
• Enable better transition to m
em bers of the cross-functional
analytic team s
Repeatable Scale to additional analysts Support validity of findings
4
“A journey of a thousand m iles begins w
ith a single step“ (Lao Tzu)
M odule 2: Data Analytics Lifecycle
Value of U sing the Data Analytics Lifecycle
Copyright © 2014 EM
C Corporation. All Rights Reserved. 5
1. W
ell-defined processes can help guide any analytic project
2. Focus of Data Analytics Lifecycle is on Data Science projects, not business intelligence
3. Data Science projects tend to require a m
ore consultative approach, and differ in a few
w ays
M ore due diligence in Discovery phase
M ore projects w
hich lack shape or structure Less predictable data
N eed For a Process to Guide Data Science Projects
5 M
odule 2: Data Analytics Lifecycle
Copyright © 2014 EM
C Corporation. All Rights Reserved.
Key Roles for a Successful Analytic Project
M odule 2: Data Analytics Lifecycle
6
Role Description
Business U ser
Som eone w
ho benefits from the end results
and can consult and advise project team on
value of end results and how these w
ill be operationalized
Project Sponsor Person responsible for the genesis of the project, providing the im
petus for the project and core business problem
, generally provides the funding and w
ill gauge the degree of value from
the final outputs of the w orking team
Project M anager
Ensure key m ilestonesand objectives are m
et on tim e and at expected quality.
BusinessIntelligence Analyst
Businessdom ain expertise w
ith deep understanding of the data,KPIs, key m etrics and
business intelligence from a reporting perspective
Data Engineer Deep technical skills to assist w
ith tuning SQ L queries for data m
anagem ent, extraction and
support data ingest to analytic sandbox
Database Adm
inistrator (DBA) Database Adm
inistratorw ho provisions and configures database environm
ent to support the analytical needs of the w
orking team
Data Scientist Provide subject m
atter expertise for analytical techniques, data m
odeling, applying valid
analytical techniques to given business problem s and ensuring overall analytical objectives
are m et
Copyright © 2014 EM
C Corporation. All Rights Reserved.
Data Analytics Lifecycle
M odule 2: Data Analytics Lifecycle
7
D iscovery
O perationalize
M odel
P lanning
D ata P
rep
M odel
B uilding
C om
m unicate
R esults
Do I have enough inform
ation to draft an analytic plan and share for
peer review ?
Do I have enough good
quality data to start building the m
odel?
Do I have a good idea about the type of m
odel to try? Can I refine the
analytic plan?
Is the m odel robust
enough? Have w e
failed for sure?
1
23
4
65
Copyright © 2014 EM
C Corporation. All Rights Reserved.
Data Analytics Lifecycle Phase 1: Discovery
M odule 2: Data Analytics Lifecycle
8
D iscovery
O perationalize
M odel
P lanning
D ata P
rep
M odel
B uilding
C om
m unicate
R esults
Do I have enough inform
ation to draft an analytic plan and share for
peer review ?
Do I have enough good
quality data to start building the m
odel?
Do I have a good idea about the type of m
odel to try? Can I refine the
analytic plan?
Is the m odel robust
enough? Have w e
failed for sure?
• Learn the Business Dom
ain Determ
ine am ount of dom
ain know ledge needed to orient you to the data and
interpret results dow nstream
Determ ine the general analytic problem
type (such as clustering, classification) If you don’t know, then conduct initial research to learn about the dom
ain area you’ll be analyzing
• Learn from
the past Have there been previous attem
pts in the organization to solve this problem ?
If so, w hy did they fail? W
hy are w e trying again? How
have things changed?
1
Copyright © 2014 EM
C Corporation. All Rights Reserved.
Data Analytics Lifecycle Phase 1: Discovery
M odule 2: Data Analytics Lifecycle
9
D iscovery
O perationalize
M odel
P lanning
D ata P
rep
M odel
B uilding
C om
m unicate
R esults
Do I have enough inform
ation to draft an analytic plan and share for
peer review ?
Do I have enough good
quality data to start building the m
odel?
Do I have a good idea about the type of m
odel to try? Can I refine the
analytic plan?
Is the m odel robust
enough? Have w e
failed for sure?
• Resources
Assess available technology Available data –
sufficient to m eet your needs
People for the w orking team
Assess scope of tim e for the project in calendar tim
e and person-hours Do you have sufficient resources to attem
pt the project? If not, can you get m
ore?
1
Copyright © 2014 EM
C Corporation. All Rights Reserved.
Data Analytics Lifecycle Phase 1: Discovery
M odule 2: Data Analytics Lifecycle
10
D iscovery
O perationalize
M odel
P lanning
D ata P
rep
M odel
B uilding
C om
m unicate
R esults
Do I have enough inform
ation to draft an analytic plan and share for
peer review ?
Do I have enough good
quality data to start building the m
odel?
Do I have a good idea about the type of m
odel to try? Can I refine the
analytic plan?
Is the m odel robust
enough? Have w e
failed for sure?
• Fram
e the problem …
..Fram ing is the process of stating the analytics problem
to be solved
State the analytics problem , w
hy it is im portant, and to w
hom Identify key stakeholders and their interests in the project Clearly articulate the current situation and pain points O
bjectives – identify w
hat needs to be achieved in business term s and w
hat needs to be done to m
eet the needs W
hat is the goal? W hat are the criteria for success? W
hat’s “good enough”? W
hat is the failure criterion (w hen do w
e just stop trying or settle for w hat w
e have)?
Identify the success criteria, key risks, and stakeholders (such as RACI)
1
Copyright © 2014 EM
C Corporation. All Rights Reserved.
Tips for Interview ing the Analytics Sponsor
• Even if you are “given” an analytic problem
you should w ork w
ith clients to clarify and fram
e the problem You’re typically handed solutions, you need to identify the problem
and their desired outcom e
Sponsor Interview Tips
• Prepare for the interview
– draft your questions, review
w ith colleague, team
• U
se open-ended questions, don’t ask leading questions •
Probe for details, follow -up
• Don’t fill every silence –
give them tim
e to think •
Let them express their ideas, don’t put w
ords in their m outh, let them
share their feelings •
Ask clarifying questions, ask w hy –
is that correct? Am I on target? Is there anything else?
• U
se active listening – repeat it back to m
ake sure you heard it correctly •
Don’t express your opinions •
Be m indful of your body language and theirs –
use eye contact, be attentive •
M inim
ize distractions •
Docum ent w
hat you heard and review it back w
ith the sponsor
1111 M
odule 2: Data Analytics Lifecycle
Copyright © 2014 EM
C Corporation. All Rights Reserved.
Tips for Interview ing the Analytics Sponsor
Interview Q
uestions
• W
hat is the business problem you’re trying to solve?
• W
hat is your desired outcom e?
• W
ill the focus and scope of the problem change if the follow
ing dim ensions
change: •
Tim e –
analyzing 1 year or 10 years w orth of data?
• People –
how w
ould this project change this? •
Risk – conservative to aggressive
• Resources –
none to unlim ited (tools, tech, …
..) •
Size and attributes of Data
• W
hat data sources do you have? •
W hat industry issues m
ay im pact the analysis?
• W
hat tim elines are you up against?
• W
ho could provide insight into the project? Consulted? •
W ho has final say on the project?
1212 M
odule 2: Data Analytics Lifecycle
Copyright © 2014 EM
C Corporation. All Rights Reserved.
Data Analytics Lifecycle Phase 1: Discovery
M odule 2: Data Analytics Lifecycle
13
D iscovery
O perationalize
M odel
P lanning
D ata P
rep
M odel
B uilding
C om
m unicate
R esults
Do I have enough inform
ation to draft an analytic plan and share for
peer review ?
Do I have enough good
quality data to start building the m
odel?
Do I have a good idea about the type of m
odel to try? Can I refine the
analytic plan?
Is the m odel robust
enough? Have w e
failed for sure?
• Form
ulate Initial Hypotheses IH, H
1 , H 2, H
3 , … H
n
Gather and assess hypotheses from stakeholders and
dom ain experts
Prelim inary data exploration to inform
discussions w ith
stakeholders during the hypothesis form ing stage
• Identify Data Sources –
Begin Learning the Data Aggregate sources for preview
ing the data and provide high-level understanding Review
the raw data
Determ ine the structures and tools needed
Scope the kind of data needed for this kind of problem
1
Copyright © 2014 EM
C Corporation. All Rights Reserved.
U sing a Sam
ple Case Study to Track the Phases in the Data Analytics Lifecycle
Situation Synopsis
• Retail Bank, Yoyodyne Bank w
ants to im prove the N
et Present Value (N
PV) and retention rate of custom ers
• They w
ant to establish an effective m arketing cam
paign targeting custom
ers to reduce the churn rate by at least five percent
• The bank w
ants to determ ine w
hether those custom ers are w
orth retaining. In addition, the bank also w
ants to analyze reasons for custom
er attrition and w hat they can do to keep them
• The bank w
ants to build a data w arehouse to support M
arketing and other related custom
er care groups
14
M ini C
ase Study: C hurn Prediction for
Yoyodyne B ank
14 M
odule 2: Data Analytics Lifecycle
Copyright © 2014 EM
C Corporation. All Rights Reserved.
How to Fram
e an Analytics Problem
Sam ple
Business Problem s
Q ualifiers
Analytical Approach
• How
can w e im
prove on x? •
W hat’s happening real-tim
e? Trends?
• How
can w e use analytics
differentiate ourselves •
How can w
e use analytics to innovate?
• How
can w e stay ahead of our
biggest com petitor?
W ill the focus and scope of the problem
change if the follow
ing dim ensions change:
• Tim
e •
People –
how w
ould x change this? •
Risk – conservative/aggressive
• Resources –
none/unlim ited
• Size of Data?
Define an analytical approach, including key term
s, m etrics, and
data needed.
Yoyodyne Bank How
can w
e im prove
N et Present Value (N
PV) and retention rate of the custom
ers?
• Tim
e: Trailing 5 m onths
• People: W
orking team and business users
from the
Bank •
Risk: the projectw ill fail if w
e cannot determ
ine valid predictors of churn •
Resources: EDW , analytic
sandbox, O LTP
system •
Data:U se 24 m
onths for the training set, then analyze 5 m
onths of historical data for those custom
ers w ho churned
How do w
e identify churn/no churn for a custom
er?
Pilot study follow ed
full scale analytical m
odel
1515 M
odule 2: Data Analytics Lifecycle
C hurn Prediction for Yoyodyne B
ank
M ini C
ase Study
Copyright © 2014 EM
C Corporation. All Rights Reserved.
Data Analytics Lifecycle Phase 2: Data Preparation
M odule 2: Data Analytics Lifecycle
16
D iscovery
O perationalize
M odel
P lanning
D ata P
rep
M odel
B uilding
C om
m unicate
R esults
Do I have enough inform
ation to draft an analytic plan and share for
peer review ?
Do I have enough good
quality data to start building the m
odel?
Do I have a good idea about the type of m
odel to try? Can I refine the
analytic plan?
Is the m odel robust
enough? Have w e
failed for sure?
• Prepare Analytic Sandbox
W ork space for the analytic team
10x+ vs. EDW
• Perform
ELT Determ
ine needed transform ations
Assess data quality and structuring Derive statistically useful m
easures Determ
ine and establish data connections for raw
data Execute Big ELT and/or Big ETL
• U
seful Tools for this phase: •
For D ata Transform
ation & C
leansing: S Q
L, H adoop, M
apR educe, A
lpine M iner
2
Copyright © 2014 EM
C Corporation. All Rights Reserved.
Data Analytics Lifecycle Phase 2: Data Preparation
M odule 2: Data Analytics Lifecycle
17
D iscovery
O perationalize
M odel
P lanning
D ata P
rep
M odel
B uilding
C om
m unicate
R esults
Do I have enough inform
ation to draft an analytic plan and share for
peer review ?
Do I have enough good
quality data to start building the m
odel?
Do I have a good idea about the type of m
odel to try? Can I refine the
analytic plan?
Is the m odel robust
enough? Have w e
failed for sure?
• Fam
iliarize yourself w ith the data thoroughly
List your data sources W
hat’s needed vs. w hat’s available
• Data Conditioning
Clean and norm alize data
Discern w hat you keep vs. w
hat you discard •
Survey & Visualize
O verview, zoom
& filter, details-on-dem
and Descriptive Statistics Data Q
uality
• U
seful Tools for this phase: •
D escriptive S
tatistics on candidate variables for diagnostics & quality
• Visualization: R
(base package, ggplot and lattice), G nuP
lot, G gobi/R
ggobi, S potfire,
Tableau
2
Copyright © 2014 EM
C Corporation. All Rights Reserved.
Data Analytics Lifecycle Phase 3: M
odel Planning
M odule 2: Data Analytics Lifecycle
18
D iscovery
O perationalize
M odel
P lanning
D ata P
rep
M odel
B uilding
C om
m unicate
R esults
Do I have enough inform
ation to draft an analytic plan and share for
peer review ?
Do I have enough good
quality data to start building the m
odel?
Do I have a good idea about the type of m
odel to try? Can I refine the
analytic plan?
Is the m odel robust
enough? Have w e
failed for sure?
• Determ
ine M ethods
Select m ethods based on hypotheses, data
structure and volum e
Ensure techniques and approach w ill m
eet business objectives
• Techniques &
W orkflow
Candidate tests and sequence Identify and docum
ent m odeling
assum ptions
• U
seful Tools for this phase: R /P
ostgresS Q
L, S Q
L A
nalytics, A lpine M
iner, S A
S /A
C C
E S
S , S
P S
S /O
B D
C
3
Copyright © 2014 EM
C Corporation. All Rights Reserved.
Data Analytics Lifecycle Phase 3: M
odel Planning
M odule 2: Data Analytics Lifecycle
19
D iscovery
O perationalize
M odel
P lanning
D ata P
rep
M odel
B uilding
C om
m unicate
R esults
Do I have enough inform
ation to draft an analytic plan and share for
peer review ?
Do I have enough good
quality data to start building the m
odel?
Do I have a good idea about the type of m
odel to try? Can I refine the
analytic plan?
Is the m odel robust
enough? Have w e
failed for sure?
• Data Exploration
• Variable Selection
Inputs from stakeholders and dom
ain experts Capture essence of the predictors, leverage a technique for dim
ensionality reduction Iterative testing to confirm
the m ost
significant variables
• M
odel Selection Conversion to SQ
L or database language for best perform
ance Choose technique based on the end goal
3
Copyright © 2014 EM
C Corporation. All Rights Reserved.
Sam ple Research: Churn Prediction in O
ther Verticals
M arket Sector
Analytic Techniques/M ethods U
sed
W ireless Telecom
DM EL m
ethod (data m ining by evolutionary learning)
Retail Business Logistic regression, ARD (autom
atic relevance determ
ination), decision tree
Daily Grocery M
LR (m ultiple linear regression), ARD, and decision tree
W ireless Telecom
N eural netw
ork, decision tree, hierarchical neurofuzzy system s, rule evolver
Retail Banking M
ultiple regression
W ireless Telecom
Logistic regression, neural netw ork, decision tree
2020 M
odule 2: Data Analytics Lifecycle
M ini C
ase Study: C
hurn Prediction for Yoyodyne B
ank
• After conducting research on churn prediction, you have identified m
any m
ethods for analyzing custom er churn across m
ultiple verticals (those in bold
are taught in this course)
• At this point, a Data Scientist w
ould assess the m ethods and select the best
m odel for the situation
Copyright © 2014 EM
C Corporation. All Rights Reserved.
Data Analytics Lifecycle Phase 4: M
odel Building
M odule 2: Data Analytics Lifecycle
21
D iscovery
O perationalize
M odel
P lanning
D ata P
rep
M odel
B uilding
C om
m unicate
R esults
Do I have enough inform
ation to draft an analytic plan and share for
peer review ?
Do I have enough good
quality data to start building the m
odel?
Do I have a good idea about the type of m
odel to try? Can I refine the
analytic plan?
Is the m odel robust
enough? Have w e
failed for sure?
• Develop data sets for testing, training, and production purposes
N eed to ensure that the m
odel data is sufficiently robust for the m odel
and analytical techniques Sm
aller, test sets for validating approach, training set for initial experim
ents •
G et the best environm
ent you can for building m odels and
w orkflow
s… fast hardw
are, parallel processing
• U
seful Tools for this phase: R , P
L/R , S
Q L, A
lpine M iner, S
A S
E nterprise M
iner
4
Copyright © 2014 EM
C Corporation. All Rights Reserved.
Data Analytics Lifecycle Phase 5: Com
m unicate Results
D iscovery
O perationalize
M odel
P lanning
D ata P
rep
M odel
B uilding
C om
m unicate
R esults
Do I have enough inform
ation to draft an analytic plan and share for
peer review ?
Do I have enough good
quality data to start building the m
odel?
Do I have a good idea about the type of m
odel to try? Can I refine the
analytic plan?
Is the m odel robust
enough? Have w e
failed for sure?
Did w e succeed? Did w
e fail?
• Interpret the results
• Com
pare to IH’s from Phase 1
• Identify key findings
• Q
uantify business value •
Sum m
arizing findings, depending on audience
5
For the YoyoD yne C
ase S tudy,
w hat w
ould be som e possible results and key findings?
M ini C
ase Study: C
hurn Prediction for Yoyodyne B
ank
M odule 2: Data Analytics Lifecycle
22
Copyright © 2014 EM
C Corporation. All Rights Reserved.
Data Analytics Lifecycle Phase 6: O
perationalize
M odule 2: Data Analytics Lifecycle
23
D iscovery
O perationalize
M odel
P lanning
D ata P
rep
M odel
B uilding
C om
m unicate
R esults
Do I have enough inform
ation to draft an analytic plan and share for
peer review ?
Do I have enough good
quality data to start building the m
odel?
Do I have a good idea about the type of m
odel to try? Can I refine the
analytic plan?
Is the m odel robust
enough? Have w e
failed for sure?
• Run a pilot
• Assess the benefits
• Provide final deliverables
• Im
plem ent the m
odel in the production environm
ent •
Define process to update, retrain, and retire the m
odel, as needed
6
Copyright © 2014 EM
C Corporation. All Rights Reserved.
Analytic Plan
24
Com ponentsof
Analytic Plan RetailBanking: Yoyodyne
Bank
Phase 1: Discovery
Business Problem
Fram ed
How do w
e identify churn/no churn for a custom er?
InitialHypotheses Transaction volum