Ошибки при подготовке данных

Рассмотрим топ-5 ошибок, которые совершают при подготовке данных к анализу.

1. Including ID Fields as Predictors

Because lots of IDs look like continuous integers (and older IDs are typically smaller), it is possible that they may make their way into the model as a predictive variables. Be sure to exclude them as early on in the process as possible to avoid any confusion while building your model.

2. Using Anachronistic Variables

Make sure that no predictor variables contain information about the outcome. Because models are built using historical data, it is possible that some of the variables you have accessible when building your model were not available at the time the model is built to reflect. No predictor variables should be proxies for your dependent variable (ie: “made a gift” = donor, “deposited” = enrolled).

3. Allowing Duplicate Records

Avoid including duplicates in a model file. Including just two records per person gives that person twice as much predictive power. To make sure that each person’s influence counts equally, only one record per person or action being modeled should be included. It never hurts to dedupe your model file before you start building a predictive model.

4. Modeling on Too Small of a Population

Double-check your population size. The «right» population size depends on too many factors for me to feel comfortable throwing out an exact number that will apply in all cases, but a litmus test is to make sure your modeling dataset is «filled out», meaning that there are plenty of data points to fit different variable categories. For example, if your dataset consists of 100 records, a variable like «gender» will probably be a lot more filled out than a variable like «state» (assuming the dataset is evenly distributed). So, you’ll have to adjust the «right» number of records accordingly.  Generally, the larger your population size is, the most robust your model will be.

5. Not Accounting for Outliers and/or Missing Values

Be sure to account for any outliers and/or missing values. Large rifts in individual variables can add up when you’re combining those variables to build a predictive model. Checking the minimum and maximum values for each variable can be a quick way to spot any records that are out of the usual realm.


Data Scientist # 1

Машинное обучение, большие данные, наука о данных, анализ данных, цифровой маркетинг, искусственный интеллект, нейронные сети, глубокое обучение, data science, data scientist, machine learning, artificial intelligence, big data, deep learning

Данные — новый актив!

Эффективно управлять можно только тем, что можно измерить.
Copyright © 2019 Data Scientist. Все права защищены.