Рассмотрим топ-5 ошибок, которые совершают при подготовке данных к анализу.
1. Including ID Fields as Predictors
Because lots of IDs look like continuous integers (and older IDs are typically smaller), it is possible that they may make their way into the model as a predictive variables. Be sure to exclude them as early on in the process as possible to avoid any confusion while building your model.
2. Using Anachronistic Variables
Make sure that no predictor variables contain information about the outcome. Because models are built using historical data, it is possible that some of the variables you have accessible when building your model were not available at the time the model is built to reflect. No predictor variables should be proxies for your dependent variable (ie: “made a gift” = donor, “deposited” = enrolled).
3. Allowing Duplicate Records
Avoid including duplicates in a model file. Including just two records per person gives that person twice as much predictive power. To make sure that each person’s influence counts equally, only one record per person or action being modeled should be included. It never hurts to dedupe your model file before you start building a predictive model.
4. Modeling on Too Small of a Population
Double-check your population size. The «right» population size depends on too many factors for me to feel comfortable throwing out an exact number that will apply in all cases, but a litmus test is to make sure your modeling dataset is «filled out», meaning that there are plenty of data points to fit different variable categories. For example, if your dataset consists of 100 records, a variable like «gender» will probably be a lot more filled out than a variable like «state» (assuming the dataset is evenly distributed). So, you’ll have to adjust the «right» number of records accordingly. Generally, the larger your population size is, the most robust your model will be.
5. Not Accounting for Outliers and/or Missing Values
Be sure to account for any outliers and/or missing values. Large rifts in individual variables can add up when you’re combining those variables to build a predictive model. Checking the minimum and maximum values for each variable can be a quick way to spot any records that are out of the usual realm.