Services: Data Mining Project Assessment, Data Preparation For Data Mining, Data Mining Model Development, Data Mining Model Deployment, Data Mining Course: Overview for Project Managers, Data Mining Course: Overview for Practitioners, Customized Data Mining Engagements
Insight 1: Find Correlated Variables Prior to Modeling Topic: Data Understanding and Data Preparation Sub-Topic: Feature Selection Insight 2: Beware of Outliers in Computing Correlations Topic: Data Preparation Sub-Topic: Outliers Insight 3: Create Three Sampled Data Sets, not Two Topic: Modeling Sub-Topic: Sampling Insight 4: Use Priors to Balance Class Counts Topic: Modeling Sub-Topic: Decision Trees Insight 5: Beware of Automatic Handling of Categorical Variables Topic: Data Understanding and Data Preparation Sub-Topic: Feature Selection and Creation Insight 6: Gain Insights by Building Models from Several Algorithms Topic: Modeling Sub-Topic: Algorithm Selection Insight 7: Beware of Being Fooled with Model Performance Topic: Data Evaluation Sub-Topic: Model Performance
Upcoming Data Mining Seminars A Practical Introduction to Data Mining Upcoming courses (nationwide) Data Mining Level II: A drill-down of the data mining process, techniques, and applications Data Mining Level III: A hands-on day of data mining using real data and real data mining software Anytime Courses Overview for Project Managers: Train project managers on the data mining process. Overview for Practitioners: Train practitioners (data analysts, project managers, managers) on the data mining process.
Mr. Abbott is a seasoned instructor, having taught a wide range of data mining tutorials and seminars for a decade to audiences of up to 400, including DAMA, KDD, AAAI, and IEEE conferences. He is the instructor of well-regarded data mining courses, explaining concepts in language readily understood by a wide range of audiences, including analytics novices, data analysts, statisticians, and business professionals. Mr. Abbott also has taught applied data mining courses for major software vendors, including Clementine (SPSS), Affinium Model (Unica Corporation), Model 1 (Group1 Software), and hands-on courses using S-Plus and Insightful Miner (Insightful Corporation), and CART (Salford Systems).
Topic: Data Preparation
Sub-Topic: Outliers
Date Posted: January 2005
Outliers can cause mislead summary statistics in a variety of ways. One such way is when computing correlations, and it doesn’t take many outliers to significantly change correlation coefficients. Take, for example, 1,000 random samples of a variable X, uniformly distributed over the range [0,1]. A second variable, Y, is a combination of X and a second uniform random sample so that X and Y have a correlation coefficient of 0.99. Now, suppose in the 1,000 data points, two outliers are introduced. The values of X are the same, but the values of Y are magnified by a factor of 10, and are placed away from the trend of the original data points. The left figure below shows the strong correlation of X and Y, and the figure to the right shows the same data with two outliers.
The statistics for the four variables are shown in Table 1 below. When one looks at the correlations, however (Table 2), the correlations suddenly change from 0.99 down to 0.49.
Therefore, test your data for outliers that could be influencing summary statistics. More on how to do that in a future issue of Abbott Insights™.
Health Club Survey Analysis, Part I: Successful application of data mining by Abbott Analytics
Data Mining Level II:
Washington, DC - June 4 & 5, 2008
San Diego, CA - October 1 & 2, 2008
Las Vegas, NV - December 10 & 11, 2008
Data Mining Level III:
Washington, DC - June 6, 2008
San Diego, CA - October 3, 2008
Las Vegas, NV - December 12, 2008
Abbott, Dean, Making Large Feature Sets Manageable for Prediction of LD50 from 3-D Chemical Structure (PDF), 7th Annual Insightful Users' Conference, Las Vegas, NV, October 8-10, 2003.