Типы регрессий

Какой тип регрессии выбрать и использовать? Зависит ли это от контекста? Это не секрет, что существует множество видов регрессии и важно знать их недостатки и широту использования. Давайте рассмотрим некоторые из них.

  • Linear regression: Oldest type of regression, designed 250 years ago; computations (on small data) could easily be carried out by a human being, by design. Can be used for interpolation, but not suitable for predictive analytics; has many drawbacks when applied to modern data, e.g. sensitivity to both ouliers and cross-correlations (both in the variable and observation domains), and subject to over-fitting. A better solution is piecewise-linear regression, in particular for time series.
  • Logistic regression: Used extensively in clinical trials, scoring and fraud detection, when the response is binary (chance of succeeding or failing, e.g. for a new tested drug or a credit card transaction). Suffers same drawbacks as linear regression (not robust, model-dependent), and computing regression coeffients involves using complex iterative, numerically unstable algorithm. Can be well approximated by linear regression after transforming the response (logit transform). Some versions (Poisson or Cox regression) have been designed for a non-binary response, for categorical data (classification), ordered integer response (age groups), and even continuous response (regression trees).
  • Ridge regression: A more robust version of linear regression, putting constrainsts on regression coefficients to make them much more natural, less subject to over-fitting, and easier to interpret.Click here for source code.
  • Lasso regression: Similar to ridge regression, but automatically performs variable reduction (allowing regression coefficients to be zero).
  • Ecologic regression: Consists in performing one regression per strata, if your data is segmented into several rather large core strata, groups, or bins. Beware about the curse of big data in this context: if you perform millions of regressions, some will be totally wrong, and the best ones will be overshadowed by noisy ones with great but artificial goodness-of-fit: a big concern if you try to identify extreme events and causal relationships (global warming, rare diseases or extreme flood modeling). Here’s a fix to this problem.
  • Logic regression: Used when all variables are binary, typically in scoring algorithms. It is a specialized, more robust form of logistic regression (useful for fraud detection where each variable is a 0/1 rule), where all variables have been binned into binary variables.
  • Bayesian regression: see entry in Wikipedia. It’s a kind of penalized likehood estimator, and thus somewhat similar to ridge regression: more flexible and stable than traditional linear regression. It assumes that you have some prior knowledge about the regression coefficients.and the error term — relaxing the assumption that the error must have a normal distribution (the error must still be independent across observations). However, in practice, the prior knowledge is translated into artificial (conjugate) priors — a weakness of this technique.
  • Quantile regression: Used in connection with extreme events, read Common Errors in Statistics page 238 for details.
  • LAD regression: Similar to linear regression, but using absolute values (L1 space) rather than squares (L2 space). More robust.
  • Jackknife regression: This is the new type of regression, also used as general clustering and data reduction technique. It solves all the drawbacks of traditional regression. It provides an approximate, yet very accurate, robust solution to regression problems, and work well with «independent» variables that are correlated and/or non-normal (for instance, data distributed according to a mixture model with several modes). Ideal for black-box predictive algorithms. It approximates linear regression quite well, but it is much more robust, and work when the assumptions of traditional regression (non correlated variables, normal data, homoscedasticity) are violated.

Note: Jackknife regression has nothing to do with Bradley Efron’s Jackknife, bootstrap and other re-sampling techniques published in 1982; indeed it has nothing to do with re-sampling techniques.

Source

Data Scientist # 1

Машинное обучение, большие данные, наука о данных, анализ данных, цифровой маркетинг, искусственный интеллект, нейронные сети, глубокое обучение, data science, data scientist, machine learning, artificial intelligence, big data, deep learning

Данные — новый актив!

Эффективно управлять можно только тем, что можно измерить.
Copyright © 2016-2021 Data Scientist. Все права защищены.