1.Saskatoon SAS user group
Efficiency and data mining?
2.Agenda
Background
Case Study
3.Agenda
Background
Case Study
4.Uses a variety of tools
Data Scientist
Business Analyst
Heavy Excel user
IT Management
Executive
Consistent answers
Predictive Analytics…Data science…Statistics…Machine Learning…Data mining
It means different things to different people?
Show me the easy button
Show me the power
How do we manage this?
So what?
Tries to avoid next migraine
5.The Data Mining Process
This process is your friend. Use it. Iterate. Fail fast.
SEMMA
Process
CRISP-DM
Methodology
CRISP-DM is good methodology
SEMMA is a process in Enterprise Miner. It aligns well with CRISP-DM
6.Building a predictive model
3 Approaches
Rapid Predictive Modeler (RPM)
Preconfigured Enterprise Miner workflow in Enterprise Guide
Easy
Quick
Good models
Auditable and reusable
Enterprise Miner
Visual workflows
Powerful
Medium difficulty
Great models
Auditable and reusable
Programming
Difficult to learn
Some Data Scientists prefer this
Not suitable for the business analyst
7.The Data Mining Process
How to add efficiency
Use visualization early in the process
Don’t be afraid to build models, start with RPM
Fail fast
Understand the problem
Understand the data
8.Agenda
Background
Case Study
9.The Data Mining Process
Case study
We have a problem!
Use actionable, in-memory, big-data, cloud, machine-learning, analytics to fix it
You mean use predictive modeling to find the trucks that are going to blow up
Last time it was altitude related
10.
40 000 vehicles – Fleet is ageing
Trucks are equipped with Telematics
The data scientist is on vacation
Dataset = 1,5GB (2M rows) !!!!!!!!!! - my spreadsheet won’t open it…..
Business Analyst
Data Scientist
11.Case study
What I am going to show you
Demo 1
Visual exploration of timeline
Cluster analysis
Use visualization early in the process to formulate a strategy
12.Case study
What I am going to show you
Demo 2
Feature engineering
2 Minute model
Enterprise Model
Rapid Predictive Modeler
Enterprise Miner
Don’t be afraid to model
13.Case study
What I am going to show you
Demo 3
Create score-code
Geo spatial representation of scored data
This is how we derive value from the model
14.Sample & Explore Data
Missing data is a landmine. Identify and remediate.
Visualize - Reconstruct a timeline
Explore before sub setting or filtering
Demo 1
Visual exploration of timeline
Cluster Analysis
15.
16.
17.Sample & Explore Data
Now that I understand the data, I have a plan
Sample only Alternator faults
Focus on recent data.
Using all the history may pollute my model
Cluster Analysis in Visual Analytics
18.Modify Model Assess
Use Rapid Predictive Modeler to fail fast
Look at the variable importance chart
Engineer features into the data
Mitigate the risk of overfitting – (holdouts, model selection criteria)
Demo 2
Feature engineering
RPM Advanced
EM Model
19.Modify Data
Engineered Features
Binning into deciles
Altitude
Engine hours
Years in service
Odometer mileage
Oil temp
Water temp
Computed variables
RPM
Days since service origin
Water temp * Oil temp
Binning into quartiles
Speed
RPM
Water temp*oil temp
Days since service origin
20.Modify Model Assess
We improve the model by iterating
21.Pre release version of SAS Visual Data Mining and Machine Learning
22.Deploy
How will the model output be used by someone that knows nothing about data science?
Scorecode is useful. A model is not.
Visualize the output
Demo 3
Create score-code
Geo spatial representation of scored data
23.Deploy
Out of a truck fleet of 2000+
72 have fault codes on alternators
12 are prioritized for maintenance based on the prediction
This is where they are
24.The Data Mining Process
How to add efficiency
Use visualization early in the process
Don’t be afraid to build models, it is easy, start with RPM
Fail fast