Abbott Analytics: Data Mining Consulting
Services

Services: Data Mining Project Assessment, Data Preparation For Data Mining, Data Mining Model Development, Data Mining Model Deployment, Data Mining Course: Overview for Project Managers, Data Mining Course: Overview for Practitioners, Customized Data Mining Engagements

Abbott Insights

Insight 1: Find Correlated Variables Prior to Modeling Topic: Data Understanding and Data Preparation Sub-Topic: Feature Selection Insight 2: Beware of Outliers in Computing Correlations Topic: Data Preparation Sub-Topic: Outliers Insight 3: Create Three Sampled Data Sets, not Two Topic: Modeling Sub-Topic: Sampling Insight 4: Use Priors to Balance Class Counts Topic: Modeling Sub-Topic: Decision Trees Insight 5: Beware of Automatic Handling of Categorical Variables Topic: Data Understanding and Data Preparation Sub-Topic: Feature Selection and Creation Insight 6: Gain Insights by Building Models from Several Algorithms Topic: Modeling Sub-Topic: Algorithm Selection Insight 7: Beware of Being Fooled with Model Performance Topic: Data Evaluation Sub-Topic: Model Performance

Data Mining Clients

Client List and Case Studies

Courses and Seminars

Upcoming Data Mining Seminars A Practical Introduction to Data Mining Upcoming courses (nationwide) Data Mining Level II: A drill-down of the data mining process, techniques, and applications Data Mining Level III: A hands-on day of data mining using real data and real data mining software Anytime Courses Overview for Project Managers: Train project managers on the data mining process. Overview for Practitioners: Train practitioners (data analysts, project managers, managers) on the data mining process.

Data Mining Resources

Data Mining Resources, Books, Websites, White Papers, Presentations, Tutorials

About Us

Mr. Abbott is a seasoned instructor, having taught a wide range of data mining tutorials and seminars for a decade to audiences of up to 400, including DAMA, KDD, AAAI, and IEEE conferences. He is the instructor of well-regarded data mining courses, explaining concepts in language readily understood by a wide range of audiences, including analytics novices, data analysts, statisticians, and business professionals. Mr. Abbott also has taught applied data mining courses for major software vendors, including Clementine (SPSS), Affinium Model (Unica Corporation), Model 1 (Group1 Software), and hands-on courses using S-Plus and Insightful Miner (Insightful Corporation), and CART (Salford Systems).

Contact Us

Home

Abbott Insights™, Data Mining Advice

Abbott Insights Index

Abbott Insights™

Insight 5: Beware of Automatic Handling of Categorical Variables

Topic: Data Understanding and Data Preparation
Sub-Topic: Feature Selection and Creation
Date Posted: April 2005


Categorical variables are often difficult to use because many data mining algorithms require that input (independent) variables be continuous. Fortunately many data mining software tools handle this problem for you by converting the single categorical variable with "N" values into "N" new dummy variables, one new variable for each value. For example, if you have a field "State" with 50 text labels, the tools will create automatically 50 new variables with values 0 or 1. If a record is has the value "MA" in the variable State, the new dummy column representing "MA" will have value "1", and all 49 other state dummy columns will have value "0". Because of this, analysts don't have to convert all their text and categorical variables to numeric variables prior to modeling.

However, the automatic handling of categorical variables could cause problems that are hidden to you. Instead of having one input variable in your model (as it appears when you select input variables), you could have hundreds! This can effect decision trees (that are biased toward variables with more categories) and neural network sensitivities (that are often biased toward categorical variables with large numbers of categories). In other words, there is a hidden bias toward larger numbers of categories that could bias your interpretation of the models.

What should one do? First, be aware of these variables. During the data understanding stage of your data mining project, identify variables with large numbers of categories. This will at least alert you to the possiblity of bias in your models or sensitivies. Second, If there are more than a dozen or two categories, consider binning up those variable groups by combining dummy variables with smaller counts into larger groups, or dropping them altogether. More on identifying the significance of categorical variable values in an upcoming Abbott Insights.

Health Club Survey Analysis, Part I: Successful application of data mining by Abbott Analytics

Data Mining Level II:
Washington, DC - June 4 & 5, 2008
San Diego, CA - October 1 & 2, 2008
Las Vegas, NV - December 10 & 11, 2008
Data Mining Level III:
Washington, DC - June 6, 2008
San Diego, CA - October 3, 2008
Las Vegas, NV - December 12, 2008