1. How do you assess the statistical significance of an insight?
- Is this insight just observed by chance or is it a real insight?
Statistical significance can be accessed using hypothesis testing:
— Stating a null hypothesis which is usually the opposite of what we wish to test (classifiers A and B perform equivalently, Treatment A is equal of treatment B)
— Then, we choose a suitable statistical test and statistics used to reject the null hypothesis
— Also, we choose a critical region for the statistics to lie in that is extreme enough for the null hypothesis to be rejected (p-value)
— We calculate the observed test statistics from the data and check whether it lies in the critical region
— One sample Z test
— Two-sample Z test
— One sample t-test
— paired t-test
— Two sample pooled equal variances t-test
— Two sample unpooled unequal variances t-test and unequal sample sizes (Welch’s t-test)
— Chi-squared test for variances
— Chi-squared test for goodness of fit
— Anova (for instance: are the two regression models equals? F-test)
— Regression F-test (i.e: is at least one of the predictor useful in predicting the response?)
2. Explain what a long-tailed distribution is and provide three examples of relevant phenomena that have long tails. Why are they important in classification and regression problems?
- In long tailed distributions, a high frequency population is followed by a low frequency population, which gradually tails off asymptotically
- Rule of thumb: majority of occurrences (more than half, and when Pareto principles applies, 80%) are accounted for by the first 20% items in the distribution
- The least frequently occurring 80% of items are more important as a proportion of the total population
- Zipf’s law, Pareto distribution, power laws
1) Natural language
— Given some corpus of natural language — The frequency of any word is inversely proportional to its rank in the frequency table
— The most frequent word will occur twice as often as the second most frequent, three times as often as the third most frequent…
— “The” accounts for 7% of all word occurrences (70000 over 1 million)
— “of” accounts for 3.5%, followed by “and”…
— Only 135 vocabulary items are needed to account for half the English corpus!
2) Allocation of wealth among individuals: the larger portion of the wealth of any society is controlled by a smaller percentage of the people
3) File size distribution of Internet Traffic
Additional: Hard disk error rates, values of oil reserves in a field (a few large fields, many small ones), sizes of sand particles, sizes of meteorites
Importance in classification and regression problems:
— Skewed distribution
— Which metrics to use? Accuracy paradox (classification), F-score, AUC
— Issue when using models that make assumptions on the linearity (linear regression): need to apply a monotone transformation on the data (logarithm, square root, sigmoid function…)
— Issue when sampling: your data becomes even more unbalanced! Using of stratified sampling of random sampling, SMOTE (“Synthetic Minority Over-sampling Technique”, NV Chawla) or anomaly detection approach
3. What is the Central Limit Theorem? Explain it. Why is it important?
The CLT states that the arithmetic mean of a sufficiently large number of iterates of independent random variables will be approximately normally distributed regardless of the underlying distribution. i.e: the sampling distribution of the sample mean is normally distributed.
— Used in hypothesis testing
— Used for confidence intervals
— Random variables must be iid: independent and identically distributed
— Finite variance
4. What is statistical power?
- Sensitivity of a binary hypothesis test
- Probability that the test correctly rejects the null hypothesis Ho when the alternative is true H1
- Ability of a test to detect an effect, if the effect actually exists
- Power=P(reject Ho|H1 is true)
- As power increases, chances of Type II error (false negative) decrease
- Used in the design of experiments, to calculate the minimum sample size required so that one can reasonably detects an effect. i.e: “how many times do I need to flip a coin to conclude it is biased?”
- Used to compare tests. Example: between a parametric and a non-parametric test of the same hypothesis
5. Explain selection bias (with regard to a dataset, not variable selection). Why is it important? How can data management procedures such as missing data handling make it worse?
- Selection of individuals, groups or data for analysis in such a way that proper randomization is not achieved
— Sampling bias: systematic error due to a non-random sample of a population causing some members to be less likely to be included than others
— Time interval: a trial may terminated early at an extreme value (ethical reasons), but the extreme value is likely to be reached by the variable with the largest variance, even if all the variables have similar means
— Data: “cherry picking”, when specific subsets of the data are chosen to support a conclusion (citing examples of plane crashes as evidence of airline flight being unsafe, while the far more common example of flights that complete safely)
— Studies: performing experiments and reporting only the most favorable results
— Can lead to unaccurate or even erroneous conclusions
— Statistical methods can generally not overcome it
Why data handling make it worse?
— Example: individuals who know or suspect that they are HIV positive are less likely to participate in HIV surveys
— Missing data handling will increase this effect as it’s based on most HIV negative
— Prevalence estimates will be unaccurate
6. Provide a simple example of how an experimental design can help answer a question about behavior. How does experimental data contrast with observational data?
- You are researching the effect of music-listening on studying efficiency
- You might divide your subjects into two groups: one would listen to music and the other (control group) wouldn’t listen anything!
- You give them a test
- Then, you compare grades between the two groups
Differences between observational and experimental data:
— Observational data: measures the characteristics of a population by studying individuals in a sample, but doesn’t attempt to manipulate or influence the variables of interest
— Experimental data: applies a treatment to individuals and attempts to isolate the effects of the treatment on a response variable
Observational data: find 100 women age 30 of which 50 have been smoking a pack a day for 10 years while the other have been smoke free for 10 years. Measure lung capacity for each of the 100 women. Analyze, interpret and draw conclusions from data.
Experimental data: find 100 women age 20 who don’t currently smoke. Randomly assign 50 of the 100 women to the smoking treatment and the other 50 to the no smoking treatment. Those in the smoking group smoke a pack a day for 10 years while those in the control group remain smoke free for 10 years. Measure lung capacity for each of the 100 women.
Analyze, interpret and draw conclusions from data.
7. Is mean imputation of missing data acceptable practice? Why or why not?
- Bad practice in general
- If just estimating means: mean imputation preserves the mean of the observed data
- Leads to an underestimate of the standard deviation
- Distorts relationships between variables by “pulling” estimates of the correlation toward zero
8. What is an outlier? Explain how you might screen for outliers and what would you do if you found them in your dataset. Also, explain what an inlier is and how you might screen for them and what would you do if you found them in your dataset
— An observation point that is distant from other observations
— Can occur by chance in any distribution
— Often, they indicate measurement error or a heavy-tailed distribution
— Measurement error: discard them or use robust statistics
— Heavy-tailed distribution: high skewness, can’t use tools assuming a normal distribution
— Three-sigma rules (normally distributed data): 1 in 22 observations will differ by twice the standard deviation from the mean
— Three-sigma rules: 1 in 370 observations will differ by three times the standard deviation from the mean
Three-sigma rules example: in a sample of 1000 observations, the presence of up to 5 observations deviating from the mean by more than three times the standard deviation is within the range of what can be expected, being less than twice the expected number and hence within 1 standard deviation of the expected number (Poisson distribution).
If the nature of the distribution is known a priori, it is possible to see if the number of outliers deviate significantly from what can be expected. For a given cutoff (samples fall beyond the cutoff with probability p), the number of outliers can be approximated with a Poisson distribution with lambda=pn. Example: if one takes a normal distribution with a cutoff 3 standard deviations from the mean, p=0.3% and thus we can approximate the number of samples whose deviation exceed 3 sigmas by a Poisson with lambda=3
— No rigid mathematical method
— Subjective exercise: be careful
— QQ plots (sample quantiles Vs theoretical quantiles)
— Depends on the cause
— Retention: when the underlying model is confidently known
— Regression problems: only exclude points which exhibit a large degree of influence on the estimated coefficients (Cook’s distance)
— Observation lying within the general distribution of other observed values
— Doesn’t perturb the results but are non-conforming and unusual
— Simple example: observation recorded in the wrong unit (°F instead of °C)
— Mahalanobi’s distance
— Used to calculate the distance between two random vectors
— Difference with Euclidean distance: accounts for correlations
— Discard them
9. How do you handle missing data? What imputation techniques do you recommend?
- If data missing at random: deletion has no bias effect, but decreases the power of the analysis by decreasing the effective sample size
- Recommended: Knn imputation, Gaussian mixture imputation
10. You have data on the durations of calls to a call center. Generate a plan for how you would code and analyze these data. Explain a plausible scenario for what the distribution of these durations might look like. How could you test, even graphically, whether your expectations are borne out?
1. Exploratory data analysis
- Histogram of durations
- Histogram of durations per service type, per day of week, per hours of day (durations can be systematically longer from 10am to 1pm for instance), per employee…
2. Distribution: lognormal?
3. Test graphically with QQ plot: sample quantiles of log(durations) Vs normal quantiles
11. Explain likely differences between administrative datasets and datasets gathered from experimental studies. What are likely problems encountered with administrative data? How do experimental methods help alleviate these problems? What problem do they bring?
— Large coverage of population
— Captures individuals who may not respond to surveys
— Regularly updated, allow consistent time-series to be built-up
— Restricted to data collected for administrative purposes (limited to administrative definitions. For instance: incomes of a married couple, not individuals, which can be more useful)
— Lack of researcher control over content
— Missing or erroneous entries
— Quality issues (addresses may not be updated or a postal code is provided only)
— Data privacy issues
— Underdeveloped theories and methods (sampling methods…)
12. You are compiling a report for user content uploaded every month and notice a spike in uploads in October. In particular, a spike in picture uploads. What might you think is the cause of this, and how would you test it?
- Halloween pictures?
- Look at uploads in countries that don’t observe Halloween as a sort of counter-factual analysis
- Compare uploads mean in October and uploads means with September: hypothesis testing
13. You’re about to get on a plane to Seattle. You want to know if you should bring an umbrella. You call 3 random friends of yours who live there and ask each independently if it’s raining. Each of your friends has a 2/3 chance of telling you the truth and a 1/3 chance of messing with you by lying. All 3 friends tell you that “Yes” it is raining. What is the probability that it’s actually raining in Seattle?
14. There’s one box — has 12 black and 12 red cards, 2nd box has 24 black and 24 red; if you want to draw 2 cards at random from one of the 2 boxes, which box has the higher probability of getting the same color? Can you tell intuitively why the 2nd box has a higher probability
First select: for both, then and ; compare them
15. What is: lift, KPI, robustness, model fitting, design of experiments, 80/20 rule?
It’s measure of performance of a targeting model (or a rule) at predicting or classifying cases as having an enhanced response (with respect to the population as a whole), measured against a random choice targeting model. Lift is simply: target response/average response.
Suppose a population has an average response rate of 5% (mailing for instance). A certain model (or rule) has identified a segment with a response rate of 20%, then lift=20/5=4
Typically, the modeler seeks to divide the population into quantiles, and rank the quantiles by lift. He can then consider each quantile, and by weighing the predicted response rate against the cost, he can decide to market that quantile or not.
“if we use the probability scores on customers, we can get 60% of the total responders we’d get mailing randomly by only mailing the top 30% of the scored customers”.
— Key performance indicator
— A type of performance measurement
— Examples: 0 defects, 10/10 customer satisfaction
— Relies upon a good understanding of what is important to the organization
Marketing & Sales:
— New customers acquisition
— Customer attrition
— Revenue (turnover) generated by segments of the customer population
— Often done with a data management platform
— Mean time between failure
— Mean time to repair
— Statistics with good performance even if the underlying distribution is not normal
— Statistics that are not affected by outliers
— A learning algorithm that can reduce the chance of fitting noise is called robust
— Median is a robust measure of central tendency, while mean is not
— Median absolute deviation is also more robust than the standard deviation
— How well a statistical model fits a set of observations
— Examples: AIC, R2, Kolmogorov-Smirnov test, Chi 2, deviance (generalized linear model)
Design of experiments:
The design of any task that aims to describe or explain the variation of information under conditions that are hypothesized to reflect the variation. In its simplest form, an experiment aims at predicting the outcome by changing the preconditions, the predictors.
— Selection of the suitable predictors and outcomes
— Delivery of the experiment under statistically optimal conditions
— Blocking: an experiment may be conducted with the same equipment to avoid any unwanted variations in the input
— Replication: performing the same combination run more than once, in order to get an estimate for the amount of random error that could be part of the process
— Interaction: when an experiment has 3 or more variables, the situation in which the interaction of two variables on a third is not additive
— Pareto principle
— 80% of the effects come from 20% of the causes
— 80% of your sales come from 20% of your clients
— 80% of a company complaints come from 20% of its customers
16. Define: quality assurance, six sigma.
— A way of preventing mistakes or defects in manufacturing products or when delivering services to customers
— In a machine learning context: anomaly detection
— Set of techniques and tools for process improvement
— 99.99966% of products are defect-free products (3.4 per 1 million)
— 6 standard deviation from the process mean
17. Give examples of data that does not have a Gaussian distribution, nor log-normal.
- Allocation of wealth among individuals
- Values of oil reserves among oil fields (many small ones, a small number of large ones)
18. What is root cause analysis? How to identify a cause vs. a correlation? Give examples
Root cause analysis:
— Method of problem solving used for identifying the root causes or faults of a problem
— A factor is considered a root cause if removal of it prevents the final undesirable event from recurring
Identify a cause vs. a correlation:
— Correlation: statistical measure that describes the size and direction of a relationship between two or more variables. A correlation between two variables doesn’t imply that the change in one variable is the cause of the change in the values of the other variable
— Causation: indicates that one event is the result of the occurrence of the other event; there is a causal relationship between the two events
— Differences between the two types of relationships are easy to identify, but establishing a cause and effect is difficult
Example: sleeping with one’s shoes on is strongly correlated with waking up with a headache. Correlation-implies-causation fallacy: therefore, sleeping with one’s shoes causes headache.
More plausible explanation: both are caused by a third factor: going to bed drunk.
Identify a cause Vs a correlation: use of a controlled study
— In medical research, one group may receive a placebo (control) while the other receives a treatment If the two groups have noticeably different outcomes, the different experiences may have caused the different outcomes
19. Give an example where the median is a better measure than the mean
When data is skewed
20. Given two fair dices, what is the probability of getting scores that sum to 4? to 8?
- Total: 36 combinations
- Of these, 3 involve a score of 4: (1,3), (3,1), (2,2)
- So: 3/36=1/12
- Considering a score of 8: (2,6), (3,5), (4,4), (6,2), (5,3)
- So: 5/36
21. What is the Law of Large Numbers?
- A theorem that describes the result of performing the same experiment a large number of times
- Forms the basis of frequency-style thinking
- It says that the sample mean, the sample variance and the sample standard deviation converge to what they are trying to estimate
- Example: roll a dice, expected value is 3.5. For a large number of experiments, the average converges to 3.5
22. How do you calculate needed sample size?
Estimate a population mean:
— ME is the desired margin of error
— t is the t score or z score that we need to use to calculate our confidence interval
— s is the standard deviation
Example: we would like to start a study to estimate the average internet usage of households in one week for our business plan. How many households must we randomly select to be 95% sure that the sample mean is within 1 minute from the true mean of the population? A previous survey of household usage has shown a standard deviation of 6.95 minutes.
Z score corresponding to a 95% interval: 1.96 (97.5%, α/2=0.025)
Example: a professor in Harvard wants to determine the proportion of students who support gay marriage. She asks “how large a sample do I need?”
She wants a margin of error of less than 2.5%, she has found a previous survey which indicates a proportion of 30%.
23. When you sample, what bias are you inflicting?
— An online survey about computer use is likely to attract people more interested in technology than in typical
Under coverage bias:
— Sample too few observations from a segment of population
— Observations at the end of the study are a non-random set of those present at the beginning of the investigation
— In finance and economics: the tendency for failed companies to be excluded from performance studies because they no longer exist
24. How do you control for biases?
- Choose a representative sample, preferably by a random method
- Choose an adequate size of sample
- Identify all confounding factors if possible
- Identify sources of bias and include them as additional predictors in statistical analyses
- Use randomization: by randomly recruiting or assigning subjects in a study, all our experimental groups have an equal chance of being influenced by the same bias
— Randomization: in randomized control trials, research participants are assigned by chance, rather than by choice to either the experimental group or the control group.
— Random sampling: obtaining data that is representative of the population of interest
25. What are confounding variables?
- Extraneous variable in a statistical model that correlates directly or inversely with both the dependent and the independent variable
- A spurious relationship is a perceived relationship between an independent variable and a dependent variable that has been estimated incorrectly
- The estimate fails to account for the confounding factor
- See Question 18 about root cause analysis
26. What is A/B testing?
- Two-sample hypothesis testing
- Randomized experiments with two variants: A and B
- A: control; B: variation
- User-experience design: identify changes to web pages that increase clicks on a banner
- Current website: control; NULL hypothesis
- New version: variation; alternative hypothesis
27. An HIV test has a sensitivity of 99.7% and a specificity of 98.5%. A subject from a population of prevalence 0.1% receives a positive test result. What is the precision of the test (i.e the probability is HIV positive)?
28. Infection rates at a hospital above a 1 infection per 100 person days at risk are considered high. An hospital had 10 infections over the last 1787 person days at risk. Give the p-value of the correct one-sided test of whether the hospital is below the standard
One-sided test, assume a Poisson distribution
Ho: lambda=0.01 ; H1:lambda>0.01
##  0.03237153
29. You roll a biased coin (p(head)=0.8) five times. What’s the probability of getting three or more heads?
30. A random variable X is normal with mean 1020 and standard deviation 50. Calculate P(X>1200)
X\~N(1020,50) Our new quantile: z=(1200−1020)/50=3.6
##  0.0001591086
31. Consider the number of people that show up at a bus station is Poisson with mean 2.5/h. What is the probability that at most three people show up in a four hour period?
##  0.01033605
32. You are running for office and your pollster polled hundred people. 56 of them claimed they will vote for you. Can you relax?
33. Geiger counter records 100 radioactive decays in 5 minutes. Find an approximate 95% interval for the number of decays per hour.
- Start by finding a 95% interval for radioactive decay in a 5 minutes period
- The estimated standard deviation is√100=10
- So the interval is λ̂ ±1.96×10=100±19.6
- So, per hour: [964.8,1435.2]
34. The homicide rate in Scotland fell last year to 99 from 115 the year before. Is this reported change really networthy?
- Consider the homicides as independent; a Poisson distribution can be a reasonable model
- 95% interval for the true homicide rate is 115±2×√115=115±22=[94,137]
- It’s not reasonable to conclude that there has been a reduction in the true rate
35. Consider influenza epidemics for two parent heterosexual families. Suppose that the probability is 17% that at least one of the parents has contracted the disease. The probability that the father has contracted influenza is 12% while the probability that both the mother and father have contracted the disease is 6%. What is the probability that the mother has contracted influenza?
- P(«Mother or Father»)=P(«Mother»)+P(«Father»)−P(«Mother and Father»)
- Hence: P(«Mother»)=0.17+0.06−0.12=0.11
36. Suppose that diastolic blood pressures (DBPs) for men aged 35-44 are normally distributed with a mean of 80 (mm Hg) and a standard deviation of 10. About what is the probability that a random 35-44 year old has a DBP less than 70?
- One standard deviation below the mean: 32/2=16
37. In a population of interest, a sample of 9 men yielded a sample average brain volume of 1,100cc and a standard deviation of 30cc. What is a 95% Student’s T confidence interval for the mean brain volume in this new population?
- Standard error of the mean: 30/√9=10
- Relevant t quantile: 97.5%
##  1076.94 1123.06
38. A diet pill is given to 9 subjects over six weeks. The average difference in weight (follow up — baseline) is -2 pounds. What would the standard deviation of the difference in weight have to be for the upper endpoint of the 95% T confidence interval to touch 0?
##  2.601903
39. In a study of emergency room waiting times, investigators consider a new and the standard triage systems. To test the systems, administrators selected 20 nights and randomly assigned the new triage system to be used on 10 nights and the standard system on the remaining 10 nights. They calculated the nightly median waiting time (MWT) to see a physician. The average MWT for the new system was 3 hours with a variance of 0.60 while the average MWT for the old system was 5 hours with a variance of 0.68. Consider the 95% confidence interval estimate for the differences of the mean MWT associated with the new system. Assume a constant variance. What is the interval? Subtract in this order (New System — Old System).
40. To further test the hospital triage system, administrators selected 200 nights and randomly assigned a new triage system to be used on 100 nights and a standard system on the remaining 100 nights. They calculated the nightly median waiting time (MWT) to see a physician. The average MWT for the new system was 4 hours with a standard deviation of 0.5 hours while the average MWT for the old system was 6 hours with a standard deviation of 2 hours. Consider the hypothesis of a decrease in the mean MWT associated with the new treatment. What does the 95% independent group confidence interval with unequal variances suggest vis a vis this hypothesis? (Because there’s so many observations per group, just use the Z quantile instead of the T.)