Questions & Answers: Process & Miscellaneous

1. How to optimize algorithms? (parallel processing and/or faster algorithms). Provide examples for both

“Premature optimization is the root of all evil”; Donald Knuth

Parallel processing: for instance in R with a single machine.
— doParallel and foreach package
— doParallel: parallel backend, will select n-cores of the machine
— for each: assign tasks for each core
— using Hadoop on a single node
— using Hadoop on multi-node

Faster algorithm:
— In computer science: Pareto principle; 90% of the execution time is spent executing 10% of the code
— Data structure: affect performance
— Caching: avoid unnecessary work
— Improve source code level
For instance: on early C compilers, WHILE(something) was slower than FOR(;;), because WHILE evaluated “something” and then had a conditional jump which tested if it was true while FOR had unconditional jump.

2. Examples of NoSQL architecture

Key-value: in a key-value NoSQL database, all of the data within consists of an indexed key and a value. Cassandra, DynamoDB
Column-based: designed for storing data tables as sections of columns of data rather than as rows of data. HBase, SAP HANA
Document Database: map a key to some document that contains structured information. The key is used to retrieve the document. MongoDB, CouchDB
Graph Database: designed for data whose relations are well-represented as a graph and has elements which are interconnected, with an undetermined number of relations between them. Polyglot Neo4J

3. Provide examples of machine-to-machine communications

Telemedicine
— Heart patients wear specialized monitor which gather information regarding heart state
— The collected data is sent to an electronic implanted device which sends back electric shocks to the patient for correcting incorrect rhythms

Product restocking
— Vending machines are capable of messaging the distributor whenever an item is running out of stock

4. Compare R and Python

R
— Focuses on better, user friendly data analysis, statistics and graphical models
— The closer you are to statistics, data science and research, the more you might prefer R
— Statistical models can be written with only a few lines in R
— The same piece of functionality can be written in several ways in R
— Mainly used for standalone computing or analysis on individual servers
— Large number of packages, for anything!

Python
— Used by programmers that want to delve into data science
— The closer you are working in an engineering environment, the more you might prefer Python
— Coding and debugging is easier mainly because of the nice syntax
— Any piece of functionality is always written the same way in Python
— When data analysis needs to be implemented with web apps
— Good tool to implement algorithms for production use

5. Is it better to have 100 small hash tables or one big hash table, in memory, in terms of access speed (assuming both fit within RAM)? What do you think about in-database analytics?

Hash tables:
— Average case O(1) lookup time
— Lookup time doesn’t depend on size

Even in terms of memory:
— O(n) memory
— Space scales linearly with number of elements
— Lots of dictionaries won’t take up significantly less space than a larger one

In-database analytics:
— Integration of data analytics in data warehousing functionality
— Much faster and corporate information is more secure, it doesn’t leave the enterprise data warehouse

Good for real-time analytics: fraud detection, credit scoring, transaction processing, pricing and margin analysis, behavioral ad targeting and recommendation engines

6. What is star schema? Lookup tables?

The star schema is a traditional database schema with a central (fact) table (the “observations”, with database “keys” for joining with satellite tables, and with several fields encoded as ID’s). Satellite tables map ID’s to physical name or description and can be “joined” to the central fact table using the ID fields; these tables are known as lookup tables, and are particularly useful in real-time applications, as they save a lot of memory. Sometimes star schemas involve multiple layers of summarization (summary tables, from granular to less granular) to retrieve information faster.

Lookup tables:
— Array that replace runtime computations with a simpler array indexing operation

7. What is the life cycle of a data science project ?

Data acquisition

Acquiring data from both internal and external sources, including social media or web scraping. In a steady state, data extraction and routines should be in place, and new sources, once identified would be acquired following the established processes

Data preparation

Also called data wrangling: cleaning the data and shaping it into a suitable form for later analyses. Involves exploratory data analysis and feature extraction.

Hypothesis & modelling

Like in data mining but not with samples, with all the data instead. Applying machine learning techniques to all the data. A key sub-step: model selection. This involves preparing a training set for model candidates, and validation and test sets for comparing model performances, selecting the best performing model, gauging model accuracy and preventing overfitting

Evaluation & interpretation

Steps 2 to 4 are repeated a number of times as needed; as the understanding of data and business becomes clearer and results from initial models and hypotheses are evaluated, further tweaks are performed. These may sometimes include step5 and be performed in a pre-production.

Deployment

Operations

Regular maintenance and operations. Includes performance tests to measure model performance, and can alert when performance goes beyond a certain acceptable threshold

Optimization

Can be triggered by failing performance, or due to the need to add new data sources and retraining the model or even to deploy new versions of an improved model

Note: with increasing maturity and well-defined project goals, pre-defined performance can help evaluate feasibility of the data science project early enough in the data-science life cycle. This early comparison helps the team refine hypothesis, discard the project if non-viable, change approaches.

8. How to efficiently scrape web data, or collect tons of tweets?

Python example
Requesting and fetching the webpage into the code: httplib2 module
Parsing the content and getting the necessary info: BeautifulSoup from bs4 package
Twitter API: the Python wrapper for performing API requests. It handles all the OAuth and API queries in a single Python interface
MongoDB as the database
PyMongo: the Python wrapper for interacting with the MongoDB database
Cronjobs: a time based scheduler in order to run scripts at specific intervals; allows to bypass the “rate limit exceed” error

9. How to clean data?

1. First: detect anomalies and contradictions

Common issues:

Tidy data: (Hadley Wickam paper)
— column names are values, not names, e.g. <15-25, >26-45…
— multiple variables are stored in one column, e.g. m1534 (male of 15-34 years’ old age)
— variables are stored in both rows and columns, e.g. tmax, tmin in the same column
— multiple types of observational units are stored in the same table. e.g, song dataset and rank dataset in the same table
— *a single observational unit is stored in multiple tables (can be combined)
Data-Type constraints: values in a particular column must be of a particular type: integer, numeric, factor, boolean
Range constraints: number or dates fall within a certain range. They have minimum/maximum permissible values
Mandatory constraints: certain columns can’t be empty
Unique constraints: a field must be unique across a dataset: a same person must have a unique SS number
Set-membership constraints: the values for a columns must come from a set of discrete values or codes: a gender must be female, male
Regular expression patterns: for example, phone number may be required to have the pattern: (999)999-9999
Misspellings
Missing values
Outliers
Cross-field validation: certain conditions that utilize multiple fields must hold. For instance, in laboratory medicine: the sum of the different white blood cell must equal to zero (they are all percentages). In hospital database, a patient’s date or discharge can’t be earlier than the admission date

2. Clean the data using:

Regular expressions: misspellings, regular expression patterns
KNN-impute and other missing values imputing methods
Coercing: data-type constraints
Melting: tidy data issues
Date/time parsing
Removing observations

10. How frequently an algorithm must be updated?

You want to update an algorithm when:
— You want the model to evolve as data streams through infrastructure
— The underlying data source is changing
— Example: a retail store model that remains accurate as the business grows
— Dealing with non-stationarity

Some options:
— Incremental algorithms: the model is updated every time it sees a new training example
Note: simple, you always have an up-to-date model but you can’t incorporate data to different degrees.
Sometimes mandatory: when data must be discarded once seen (privacy)
— Periodic re-training in “batch” mode: simply buffer the relevant data and update the model every-so-often
Note: more decisions and more complex implementations

How frequently?
— Is the sacrifice worth it?
— Data horizon: how quickly do you need the most recent training example to be part of your model?
— Data obsolescence: how long does it take before data is irrelevant to the model? Are some older instances more relevant than the newer ones?

Economics: generally, newer instances are more relevant than older ones. However, data from the same month, quarter or year of the last year can be more relevant than the same periods of the current year. In a recession period: data from previous recessions can be more relevant than newer data from different economic cycles.

11. What is POC (proof of concept)?

A realization of a certain method to demonstrate its feasibility
In engineering: a rough prototype of a new idea is often constructed as a proof of concept

12. Explain Tufte’s concept of “chart junk”

All visuals elements in charts and graphs that are not necessary to comprehend the information represented, or that distract the viewer from this information

Examples of unnecessary elements include:
— Unnecessary text
— Heavy or dark grid lines
— Ornamented chart axes
— Pictures
— Background
— Unnecessary dimensions
— Elements depicted out of scale to one another
— 3-D simulations in line or bar charts

13. How would you come up with a solution to identify plagiarism?

Vector space model approach
Represent documents (the suspect and original ones) as vectors of terms
Terms: n-grams; n=1 to as much we can (detect passage plagiarism)
Measure the similarity between both documents
Similarity measure: cosine distance, Jaro-Winkler, Jaccard
Declare plagiarism at a certain threshold

14. How to detect individual paid accounts shared by multiple users?

Check geographical region: Friday morning a log in from Paris and Friday evening a log in from Tokyo
Bandwidth consumption: if a user goes over some high limit
Counter of live sessions: if they have 100 sessions per day (4 times per hour) that seems more than one person can do

15. Is it better to spend 5 days developing a 90% accurate solution, or 10 days for 100% accuracy? Depends on the context?

“premature optimization is the root of all evils”
At the beginning: quick-and-dirty model is better
Optimization later

Other answer:
— Depends on the context
— Is error acceptable? Fraud detection, quality assurance

16. What is your definition of big data?

Big data is high volume, high velocity and/or high variety information assets that require new forms of processing
— Volume: big data doesn’t sample, just observes and tracks what happens
— Velocity: big data is often available in real-time
— Variety: big data comes from texts, images, audio, video…

Difference big data/business intelligence:

— Business intelligence uses descriptive statistics with data with high density information to measure things, detect trends etc.
— Big data uses inductive statistics (statistical inference) and concepts from non-linear system identification to infer laws (regression, classification, clustering) from large data sets with low density information to reveal relationships and dependencies or to perform prediction of outcomes or behaviors

17. Explain the difference between “long” and “wide” format data. Why would you use one or the other?

Long: one column containing the values and another column listing the context of the value Fam_id year fam_inc
Wide: each different variable in a separate column
Fam_id fam_inc96 fam_inc97 fam_inc98

Long Vs Wide:
— Data manipulations are much easier when data is in the wide format: summarize, filter
— Program requirements

18. Do you know a few “rules of thumb” used in statistical or computer science? Or in business analytics?

Pareto rule:
— 80% of the effects come from 20% of the causes
— 80% of the sales come from 20% of the customers

Computer science: “simple and inexpensive beats complicated and expensive” — Rod Elder

Finance, rule of 72:
— Estimate the time needed for a money investment to double
— 100$ at a rate of 9%: 72/9=8 years

Rule of three (Economics):
— There are always three major competitors in a free market within one industry

19. Name a few famous API’s (for instance GoogleSearch)

Google API (Google Analytics, Picasa), Twitter API (interact with Twitter functions), GitHub API, LinkedIn API (users data)…

20. Give examples of bad and good visualizations

Bad visualization:
— Pie charts: difficult to make comparisons between items when area is used, especially when there are lots of items
— Color choice for classes: abundant use of red, orange and blue. Readers can think that the colors could mean good (blue) versus bad (orange and red) whereas these are just associated with a specific segment
— 3D charts: can distort perception and therefore skew data
— Using a solid line in a line chart: dashed and dotted lines can be distracting

Good visualization:
— Heat map with a single color: some colors stand out more than others, giving more weight to that data. A single color with varying shades show the intensity better
— Adding a trend line (regression line) to a scatter plot help the reader highlighting trends

Source

Other questions & answers for data science you can find here.