Services: Data Mining Project Assessment, Data Preparation For Data Mining, Data Mining Model Development, Data Mining Model Deployment, Data Mining Course: Overview for Project Managers, Data Mining Course: Overview for Practitioners, Customized Data Mining Engagements
Insight 1: Find Correlated Variables Prior to Modeling Topic: Data Understanding and Data Preparation Sub-Topic: Feature Selection Insight 2: Beware of Outliers in Computing Correlations Topic: Data Preparation Sub-Topic: Outliers Insight 3: Create Three Sampled Data Sets, not Two Topic: Modeling Sub-Topic: Sampling Insight 4: Use Priors to Balance Class Counts Topic: Modeling Sub-Topic: Decision Trees Insight 5: Beware of Automatic Handling of Categorical Variables Topic: Data Understanding and Data Preparation Sub-Topic: Feature Selection and Creation Insight 6: Gain Insights by Building Models from Several Algorithms Topic: Modeling Sub-Topic: Algorithm Selection Insight 7: Beware of Being Fooled with Model Performance Topic: Data Evaluation Sub-Topic: Model Performance
Upcoming Data Mining Seminars A Practical Introduction to Data Mining Upcoming courses (nationwide) Data Mining Level II: A drill-down of the data mining process, techniques, and applications Data Mining Level III: A hands-on day of data mining using real data and real data mining software Anytime Courses Overview for Project Managers: Train project managers on the data mining process. Overview for Practitioners: Train practitioners (data analysts, project managers, managers) on the data mining process.
Mr. Abbott is a seasoned instructor, having taught a wide range of data mining tutorials and seminars for a decade to audiences of up to 400, including DAMA, KDD, AAAI, and IEEE conferences. He is the instructor of well-regarded data mining courses, explaining concepts in language readily understood by a wide range of audiences, including analytics novices, data analysts, statisticians, and business professionals. Mr. Abbott also has taught applied data mining courses for major software vendors, including Clementine (SPSS), Affinium Model (Unica Corporation), Model 1 (Group1 Software), and hands-on courses using S-Plus and Insightful Miner (Insightful Corporation), and CART (Salford Systems).
Topic: Data Evaluation
Sub-Topic: Model Performance
Date Posted: June 2005
Interpreting model performance is a minefield. If one wants model performance to be as good as possible, it is critical to define exactly what "good" means. How does one measure "goodness"? The easiest way to communicate performance is with a single-valued score, such as percent correct classification or R-squared. However, it is precisely this simplification of a complex idea the model is predicting to a single number that can cause one to be fooled. A simple example follows.
Let's assume that a non-profit organization wants a model built that predicts the propensity of individuals to send donations, and that this model has 80+% classification accuracy, even on a test set. Furthermore, assume that the two indicators "Recent Donation Amount" (X1) and "Average Donation Amount" (X2) are two of the top predictors in the model. The figure at the left shows what a Support Vector Machine model did with this data. Even with the good accuracy, there is something disturbing about the model that isn't clear unless one sees a picture: the model isn't finding ranges of average and recent donation amounts that are associated with donors, but rather it is finding islands of donors. The second model (on the right) provided corrective measures to smooth the model, and it much more pleasing. It is saying (roughly) that when someone donates between about $10-$50 on average (X2), they are more likely to respond. It is smooth and there are no pockets of isolated donation amounts, making this model much more believable, even though some accuracy was lost in the process.
Health Club Survey Analysis, Part I: Successful application of data mining by Abbott Analytics
Data Mining Level II:
Washington, DC - June 4 & 5, 2008
San Diego, CA - October 1 & 2, 2008
Las Vegas, NV - December 10 & 11, 2008
Data Mining Level III:
Washington, DC - June 6, 2008
San Diego, CA - October 3, 2008
Las Vegas, NV - December 12, 2008
Vafaie, H., D.W. Abbott, M. Hutchins, and I.P. Matkovsky, Combining Multitple Models Across Algorithms and Samples for Improved Results (PDF), The Twelfth International Conference on Tools with Artificial Intelligence, Vancouver, British Columbia, November 13-15, 2000.