The Session.. Rosaria Silipo Phil Winters KNIME 2016 KNIME.com AG. All Right Reserved.
Past KNIME Summits: Merging Techniques, Data and MUSIC! 2016 KNIME.com AG. All Rights Reserved. 2
Analytics, Machine Learning, Data Science, Data Mining, Predictive Analytics (Big Data): Any sufficiently advanced technology is indistinguishable from magic. Arthur C. Clarke, 1973 Trend1: More Magicians! Trend2: Power to the People! Data Scientist: The Sexiest Job of the 21st Century Harvard Business Review DataHookup Pair_ship 2016 KNIME.com AG. All Rights Reserved. 3
Guided Analytics Power to the People Rosaria Silipo Phil Winters Christian Albrecht KNIME 2016 KNIME.com AG. All Right Reserved.
Agenda Power to the People: 4 approaches Guided Analytics: The User Perspective Guided Analytics: The Platform Summary, Thoughts and Next Actions 2016 KNIME.com AG. All Rights Reserved. 5
Power to the People: 4 approaches Generic Black Box Machine Learning Citizen Data Scientists Analytic Cheat Sheets Guided Analytics Citizen Data Critical Capabilities Scientists Data Access 10% Data Preparation and Exploration 22% Advanced Modelling 5% Visual Composition Framework (VCF) 22% Automation 1% Delivery, Integration & Deployment 1% Platform and Project Management 1% Performance and Scalability 1% User Experience 22% Collaboration 1% Leverage and Productivity 14% Total 100% 2016 KNIME.com AG. All Rights Reserved. 6
Agenda Power to the People: 4 approaches Guided Analytics: The User Perspective Guided Analytics: The Platform Summary, Thoughts and Next Actions 2016 KNIME.com AG. All Rights Reserved. 7
Guided Analytics: Automate Understand 2016 KNIME.com AG. All Rights Reserved. 8
The Business Issue: Product Upsell by a Campaign Manager Lawyer s Insurance: A successful product Content Marketing is key: Right message, right person Young men: insurance for those things that happen (car, rent, purchase) - discount sensitive! Family Age women: protection for your family and children not discount sensitive! Older adults: complaints, purchase protection, contracts not discount sensitive A field in the Campaign Management system is needed to indicate whether a customer is likely to buy Lawyer s Insurance High likelihood individuals should be targeted with an offer Taking into account that each target group should be created around those demographics! 2016 KNIME.com AG. All Rights Reserved. 9
The Data: Classic Marketing Data! Demographics Information about previous product purchases Including whether the target product has been purchased or not Information about channel activity with the organization Some social media data Information about the value to the company 2016 KNIME.com AG. All Rights Reserved. 10
Goals and Requirements CRM data => Upselling of a Lawyer Insurance Calculate Propensity to buy a Lawyer Insurance product Cluster Customers into demographic groups Interactive Analytics Process Upload Data Check Data Quality Cleaning & Preproc. Clustering Refine Clustering Classific ation Scoring 2016 KNIME.com AG. All Rights Reserved. 11
The Analytics Process X-validation error Ratio Std dev/mean Missing Values Outliers Low Variance Zero Skewness High Correlation 3 clusters on demographic Features - Gender - Income - Age Explore Clusters If necessary, split one existing cluster into 3 sub-clusters Dedicated classifier (linear regression) for each cluster & sub-cluster Evaluate overall Accuracy Upload Data Check Data Quality Cleaning & Preproc. Clustering Refine Clustering Classific ation Scoring 2016 KNIME.com AG. All Rights Reserved. 12
2016 KNIME.com AG. All Rights Reserved. 13
2016 KNIME.com AG. All Rights Reserved. 14
2016 KNIME.com AG. All Rights Reserved. 15
2016 KNIME.com AG. All Rights Reserved. 16
2016 KNIME.com AG. All Rights Reserved. 17
2016 KNIME.com AG. All Rights Reserved. 18
2016 KNIME.com AG. All Rights Reserved. 19
2016 KNIME.com AG. All Rights Reserved. 20
2016 KNIME.com AG. All Rights Reserved. 21
2016 KNIME.com AG. All Rights Reserved. 22
2016 KNIME.com AG. All Rights Reserved. 23
2016 KNIME.com AG. All Rights Reserved. 24
2016 KNIME.com AG. All Rights Reserved. 25
2016 KNIME.com AG. All Rights Reserved. 26
0.996 2016 KNIME.com AG. All Rights Reserved. 27
Generic Black Box Analytics 2016 KNIME.com AG. All Rights Reserved. 28
2016 KNIME.com AG. All Rights Reserved. 29
2016 KNIME.com AG. All Rights Reserved. 30
2016 KNIME.com AG. All Rights Reserved. 31
2016 KNIME.com AG. All Rights Reserved. 32
2016 KNIME.com AG. All Rights Reserved. 33
2016 KNIME.com AG. All Rights Reserved. 34
2016 KNIME.com AG. All Rights Reserved. 35
2016 KNIME.com AG. All Rights Reserved. 36
2016 KNIME.com AG. All Rights Reserved. 37
2016 KNIME.com AG. All Rights Reserved. 38
2016 KNIME.com AG. All Rights Reserved. 39
2016 KNIME.com AG. All Rights Reserved. 40
2016 KNIME.com AG. All Rights Reserved. 41
2016 KNIME.com AG. All Rights Reserved. 42
2016 KNIME.com AG. All Rights Reserved. 43
2016 KNIME.com AG. All Rights Reserved. 44
2016 KNIME.com AG. All Rights Reserved. 45
2016 KNIME.com AG. All Rights Reserved. 46
2016 KNIME.com AG. All Rights Reserved. 47
2016 KNIME.com AG. All Rights Reserved. 48
2016 KNIME.com AG. All Rights Reserved. 49
2016 KNIME.com AG. All Rights Reserved. 50
2016 KNIME.com AG. All Rights Reserved. 51
Threshold set to.5 2016 KNIME.com AG. All Rights Reserved. 52
2016 KNIME.com AG. All Rights Reserved. 53
2016 KNIME.com AG. All Rights Reserved. 54
2016 KNIME.com AG. All Rights Reserved. 55
2016 KNIME.com AG. All Rights Reserved. 56
2016 KNIME.com AG. All Rights Reserved. 57
2016 KNIME.com AG. All Rights Reserved. 58
2016 KNIME.com AG. All Rights Reserved. 59
2016 KNIME.com AG. All Rights Reserved. 60
2016 KNIME.com AG. All Rights Reserved. 61
Agenda Power to the People: 4 approaches Guided Analytics: The User Perspective Guided Analytics: The Platform Summary, Thoughts and Next Actions 2016 KNIME.com AG. All Rights Reserved. 62
Goals and Requirements CRM data => Upselling of a Lawyer Insurance Calculate Propensity to buy a Lawyer Insurance product Cluster Customers into demographic groups Interactive Analytics Process Upload Data Check Data Quality Cleaning & Preproc. Clustering Refine Clustering Classific ation Scoring 2016 KNIME.com AG. All Rights Reserved. 63
Summary: the Analytics Process 2016 KNIME.com AG. All Rights Reserved. 64
Summary: Overall Workflow Loop till you are satisfied with total accuracy value 1. Upload file and check data quality 2. Interactive Pre-processing 3. Clustering and cluster refinement 4. Linear Regression and threshold based decision Accuracy evaluation 2016 KNIME.com AG. All Rights Reserved. 65
1. Upload File and check Data Quality Loop till you are satisfied with total accuracy value 1. Upload file and check data quality 2. Interactive Pre-processing 3. Clustering and cluster refinement 4. Linear Regression and threshold based decision Accuracy evaluation 2016 KNIME.com AG. All Rights Reserved. 66
1. Upload and check Data Quality 2016 KNIME.com AG. All Rights Reserved. 67
1. Upload the RIGHT Data File! 2016 KNIME.com AG. All Rights Reserved. 68
1. File Upload Wrapped Node 2016 KNIME.com AG. All Rights Reserved. 69
HTML 1. File Correct? Wrapped Node 2016 KNIME.com AG. All Rights Reserved. 70
1. Wrapped Node Description 2016 KNIME.com AG. All Rights Reserved. 71
1. Data Set Quality 2016 KNIME.com AG. All Rights Reserved. 72
2. Interactive Pre-processing Loop till you are satisfied with total accuracy value 1. Upload file and check data quality 2. Interactive Pre-processing 3. Clustering and cluster refinement 4. Linear Regression and threshold based decision Accuracy evaluation 2016 KNIME.com AG. All Rights Reserved. 73
2. Interactive Pre-processing 2016 KNIME.com AG. All Rights Reserved. 74
2. Column Cleaning by Missing Values 2016 KNIME.com AG. All Rights Reserved. 75
2. Outlier Removal 2016 KNIME.com AG. All Rights Reserved. 76
2. Column Cleaning by 2016 KNIME.com AG. All Rights Reserved. 77
2. Column Cleaning by Sorting Views on a Grid through JSON 2016 KNIME.com AG. All Rights Reserved. 78
3. Clustering and Cluster Refinement Loop till you are satisfied with total accuracy value 1. Upload file and check data quality 2. Interactive Pre-processing 3. Clustering and cluster refinement 4. Linear Regression and threshold based decision Accuracy evaluation 2016 KNIME.com AG. All Rights Reserved. 79
3. Cluster and Cluster Refinement K-Means: 3 clusters on age, income, gender 2016 KNIME.com AG. All Rights Reserved. 80
3. Wrapped Node Viz Clusters 2016 KNIME.com AG. All Rights Reserved. 81
3. Summary Statistics (No interactivity!) 2016 KNIME.com AG. All Rights Reserved. 82
4. Linear Regression and Threshold based Decision Loop till you are satisfied with total accuracy value 1. Upload file and check data quality 2. Interactive Pre-processing 3. Clustering and cluster refinement 4. Linear Regression and threshold based decision Accuracy evaluation 2016 KNIME.com AG. All Rights Reserved. 83
4. Linear Regression and Threshold based Decision Linear Regression Model on each Cluster and Sub-cluster prediction > threshold => 1 prediction <= threshold => 0 Default Threshold = 0.5 2016 KNIME.com AG. All Rights Reserved. 84
4. Correct vs. Wrong Visualization 2016 KNIME.com AG. All Rights Reserved. 85
4. Save or Loop? 2016 KNIME.com AG. All Rights Reserved. 86
4. Linear Regression and Threshold based Decision Would it not be nice to have threshold selection and visual inspection of correct vs. wrong results in the same frame? 2016 KNIME.com AG. All Rights Reserved. 87
4. Automatic Adjustment of Threshold through Scatter Plot Visualization 2016 KNIME.com AG. All Rights Reserved. 88
5. Audit Report Loop till you are satisfied with total accuracy value 1. Upload file and check data quality 2. Interactive Pre-processing 3. Clustering and cluster refinement 4. Linear Regression and threshold based decision Accuracy evaluation 2016 KNIME.com AG. All Rights Reserved. 89
Agenda Power to the People: 4 approaches Guided Analytics: The User Perspective Guided Analytics: The Platform Summary, Thoughts and Next Actions 2016 KNIME.com AG. All Rights Reserved. 90
What we did.. and could have done Data Audit Missings? How handled? Too Many Missings? Strange minimum or maximum values? Strange mean values or large differences between mean and median? Large skew or excessive kurtosis? (for algorithms assuming normal distribution? Gaps in distribution, bi-modal or multi-modal? Values in categorical that don t match valid values High-cardinality categorical variables (possibly needing binning or other treatment) Categorical variables with large percentage of single-value Unusually strong relationships with target variable? High correlation (possibly indicating redundancy)? Report on the data audit and the entire sequence of actions to product the result 2016 KNIME.com AG. All Rights Reserved. 91
What we did.. and could have done Data Audit Missings? How handled? Too Many Missings? Strange minimum or maximum values? Strange mean values or large differences between mean and median? Large skew or excessive kurtosis? (for algorithms assuming normal distribution? Gaps in distribution, bi-modal or multi-modal? Values in categorical that don t match valid values High-cardinality categorical variables (possibly needing binning or other treatment) Categorical variables with large percentage of single-value Unusually strong relationships with target variable? High correlation (possibly indicating redundancy)? Report on the data audit and the entire sequence of actions to product the result 2016 KNIME.com AG. All Rights Reserved. 92
What we did.. and could have done Data Audit Missings? How handled? Too Many Missings? Strange minimum or maximum values? Strange mean values or large differences between mean and median? Large skew or excessive kurtosis? (for algorithms assuming normal distribution? Gaps in distribution, bi-modal or multi-modal? Values in categorical that don t match valid values High-cardinality categorical variables (possibly needing binning or other treatment) Categorical variables with large percentage of single-value Unusually strong relationships with target variable? High correlation (possibly indicating redundancy)? Report on the data audit and the entire sequence of actions to product the result 2016 KNIME.com AG. All Rights Reserved. 93
What we did.. and could have done CRM Artificially Generated Data Set Iris workflow from the EXAMPLES Server to generate: - Existing First Names and Last Names - Existing Streets and Cities - Income and age with binomial distribution (???) - Gaussian random gender assignment - PLZ for certain groups of (age, income) - Shopping Basket: 5 insurance products assigned depending on income and age - Target as 0/1 if customer bought lawyer insurance - Lawyer assigned following purchase of lawyer insurance 2016 KNIME.com AG. All Rights Reserved. 94
What we did.. and could have done Predictive Modelling Using multiple models / smarter decision criteria / Ensembles Clustering Time Series Recommendation 2016 KNIME.com AG. All Rights Reserved. 95
What worked well The Guided packaging around a functional area The number of functions we could quickly make The mixing/matching to guide through the analytics Generating the data! Auditing 2016 KNIME.com AG. All Rights Reserved. 96
Guided Analytics: This was just a first Step! And Now Wrapped workflows for standard tasks? Feature reduction, creation, etc.? Automated decisioning about methods? Data testing environment? Sharing, discussing, developing best practices. Everyone at KNIME would love to discuss your ideas! 2016 KNIME.com AG. All Rights Reserved. 97
Material, white paper, etc. A white paper on initial first steps The approach The workflow The data generation The auditing 2016 KNIME.com AG. All Rights Reserved. 98
Guided Analytics: The User Perspective Power to the People Rosaria Silipo Phil Winters Christian Albrecht KNIME 2016 KNIME.com AG. All Right Reserved.