1. How to optimize algorithms? (parallel processing and/or faster algorithms). Provide examples for both
“Premature optimization is the root of all evil”; Donald Knuth
Parallel processing: for instance in R with a single machine.
— doParallel and foreach package
— doParallel: parallel backend, will select n-cores of the machine
— for each: assign tasks for each core
— using Hadoop on a single node
— using Hadoop on multi-node
— In computer science: Pareto principle; 90% of the execution time is spent executing 10% of the code
— Data structure: affect performance
— Caching: avoid unnecessary work
— Improve source code level
For instance: on early C compilers, WHILE(something) was slower than FOR(;;), because WHILE evaluated “something” and then had a conditional jump which tested if it was true while FOR had unconditional jump.
2. Examples of NoSQL architecture
- Key-value: in a key-value NoSQL database, all of the data within consists of an indexed key and a value. Cassandra, DynamoDB
- Column-based: designed for storing data tables as sections of columns of data rather than as rows of data. HBase, SAP HANA
- Document Database: map a key to some document that contains structured information. The key is used to retrieve the document. MongoDB, CouchDB
- Graph Database: designed for data whose relations are well-represented as a graph and has elements which are interconnected, with an undetermined number of relations between them. Polyglot Neo4J
3. Provide examples of machine-to-machine communications
— Heart patients wear specialized monitor which gather information regarding heart state
— The collected data is sent to an electronic implanted device which sends back electric shocks to the patient for correcting incorrect rhythms
— Vending machines are capable of messaging the distributor whenever an item is running out of stock
4. Compare R and Python
— Focuses on better, user friendly data analysis, statistics and graphical models
— The closer you are to statistics, data science and research, the more you might prefer R
— Statistical models can be written with only a few lines in R
— The same piece of functionality can be written in several ways in R
— Mainly used for standalone computing or analysis on individual servers
— Large number of packages, for anything!
— Used by programmers that want to delve into data science
— The closer you are working in an engineering environment, the more you might prefer Python
— Coding and debugging is easier mainly because of the nice syntax
— Any piece of functionality is always written the same way in Python
— When data analysis needs to be implemented with web apps
— Good tool to implement algorithms for production use
5. Is it better to have 100 small hash tables or one big hash table, in memory, in terms of access speed (assuming both fit within RAM)? What do you think about in-database analytics?
— Average case O(1) lookup time
— Lookup time doesn’t depend on size
Even in terms of memory:
— O(n) memory
— Space scales linearly with number of elements
— Lots of dictionaries won’t take up significantly less space than a larger one
— Integration of data analytics in data warehousing functionality
— Much faster and corporate information is more secure, it doesn’t leave the enterprise data warehouse
Good for real-time analytics: fraud detection, credit scoring, transaction processing, pricing and margin analysis, behavioral ad targeting and recommendation engines
6. What is star schema? Lookup tables?
The star schema is a traditional database schema with a central (fact) table (the “observations”, with database “keys” for joining with satellite tables, and with several fields encoded as ID’s). Satellite tables map ID’s to physical name or description and can be “joined” to the central fact table using the ID fields; these tables are known as lookup tables, and are particularly useful in real-time applications, as they save a lot of memory. Sometimes star schemas involve multiple layers of summarization (summary tables, from granular to less granular) to retrieve information faster.
— Array that replace runtime computations with a simpler array indexing operation
7. What is the life cycle of a data science project ?
Acquiring data from both internal and external sources, including social media or web scraping. In a steady state, data extraction and routines should be in place, and new sources, once identified would be acquired following the established processes
Also called data wrangling: cleaning the data and shaping it into a suitable form for later analyses. Involves exploratory data analysis and feature extraction.
Hypothesis & modelling
Like in data mining but not with samples, with all the data instead. Applying machine learning techniques to all the data. A key sub-step: model selection. This involves preparing a training set for model candidates, and validation and test sets for comparing model performances, selecting the best performing model, gauging model accuracy and preventing overfitting
Evaluation & interpretation
Steps 2 to 4 are repeated a number of times as needed; as the understanding of data and business becomes clearer and results from initial models and hypotheses are evaluated, further tweaks are performed. These may sometimes include step5 and be performed in a pre-production.
Regular maintenance and operations. Includes performance tests to measure model performance, and can alert when performance goes beyond a certain acceptable threshold
Can be triggered by failing performance, or due to the need to add new data sources and retraining the model or even to deploy new versions of an improved model
Note: with increasing maturity and well-defined project goals, pre-defined performance can help evaluate feasibility of the data science project early enough in the data-science life cycle. This early comparison helps the team refine hypothesis, discard the project if non-viable, change approaches.
8. How to efficiently scrape web data, or collect tons of tweets?
- Python example
- Requesting and fetching the webpage into the code: httplib2 module
- Parsing the content and getting the necessary info: BeautifulSoup from bs4 package
- Twitter API: the Python wrapper for performing API requests. It handles all the OAuth and API queries in a single Python interface
- MongoDB as the database
- PyMongo: the Python wrapper for interacting with the MongoDB database
- Cronjobs: a time based scheduler in order to run scripts at specific intervals; allows to bypass the “rate limit exceed” error
9. How to clean data?
1. First: detect anomalies and contradictions
- Tidy data: (Hadley Wickam paper)
— column names are values, not names, e.g. <15-25, >26-45…
— multiple variables are stored in one column, e.g. m1534 (male of 15-34 years’ old age)
— variables are stored in both rows and columns, e.g. tmax, tmin in the same column
— multiple types of observational units are stored in the same table. e.g, song dataset and rank dataset in the same table
— *a single observational unit is stored in multiple tables (can be combined)
- Data-Type constraints: values in a particular column must be of a particular type: integer, numeric, factor, boolean
- Range constraints: number or dates fall within a certain range. They have minimum/maximum permissible values
- Mandatory constraints: certain columns can’t be empty
- Unique constraints: a field must be unique across a dataset: a same person must have a unique SS number
- Set-membership constraints: the values for a columns must come from a set of discrete values or codes: a gender must be female, male
- Regular expression patterns: for example, phone number may be required to have the pattern: (999)999-9999
- Missing values
- Cross-field validation: certain conditions that utilize multiple fields must hold. For instance, in laboratory medicine: the sum of the different white blood cell must equal to zero (they are all percentages). In hospital database, a patient’s date or discharge can’t be earlier than the admission date
2. Clean the data using:
- Regular expressions: misspellings, regular expression patterns
- KNN-impute and other missing values imputing methods
- Coercing: data-type constraints
- Melting: tidy data issues
- Date/time parsing
- Removing observations
10. How frequently an algorithm must be updated?
You want to update an algorithm when:
— You want the model to evolve as data streams through infrastructure
— The underlying data source is changing
— Example: a retail store model that remains accurate as the business grows
— Dealing with non-stationarity
— Incremental algorithms: the model is updated every time it sees a new training example
Note: simple, you always have an up-to-date model but you can’t incorporate data to different degrees.
Sometimes mandatory: when data must be discarded once seen (privacy)
— Periodic re-training in “batch” mode: simply buffer the relevant data and update the model every-so-often
Note: more decisions and more complex implementations
— Is the sacrifice worth it?
— Data horizon: how quickly do you need the most recent training example to be part of your model?
— Data obsolescence: how long does it take before data is irrelevant to the model? Are some older instances more relevant than the newer ones?
Economics: generally, newer instances are more relevant than older ones. However, data from the same month, quarter or year of the last year can be more relevant than the same periods of the current year. In a recession period: data from previous recessions can be more relevant than newer data from different economic cycles.
11. What is POC (proof of concept)?
- A realization of a certain method to demonstrate its feasibility
- In engineering: a rough prototype of a new idea is often constructed as a proof of concept
12. Explain Tufte’s concept of “chart junk”
All visuals elements in charts and graphs that are not necessary to comprehend the information represented, or that distract the viewer from this information
Examples of unnecessary elements include:
— Unnecessary text
— Heavy or dark grid lines
— Ornamented chart axes
— Unnecessary dimensions
— Elements depicted out of scale to one another
— 3-D simulations in line or bar charts
13. How would you come up with a solution to identify plagiarism?
- Vector space model approach
- Represent documents (the suspect and original ones) as vectors of terms
- Terms: n-grams; n=1 to as much we can (detect passage plagiarism)
- Measure the similarity between both documents
- Similarity measure: cosine distance, Jaro-Winkler, Jaccard
- Declare plagiarism at a certain threshold
14. How to detect individual paid accounts shared by multiple users?
- Check geographical region: Friday morning a log in from Paris and Friday evening a log in from Tokyo
- Bandwidth consumption: if a user goes over some high limit
- Counter of live sessions: if they have 100 sessions per day (4 times per hour) that seems more than one person can do
15. Is it better to spend 5 days developing a 90% accurate solution, or 10 days for 100% accuracy? Depends on the context?
- “premature optimization is the root of all evils”
- At the beginning: quick-and-dirty model is better
- Optimization later
— Depends on the context
— Is error acceptable? Fraud detection, quality assurance
16. What is your definition of big data?
Big data is high volume, high velocity and/or high variety information assets that require new forms of processing
— Volume: big data doesn’t sample, just observes and tracks what happens
— Velocity: big data is often available in real-time
— Variety: big data comes from texts, images, audio, video…
Difference big data/business intelligence:
— Business intelligence uses descriptive statistics with data with high density information to measure things, detect trends etc.
— Big data uses inductive statistics (statistical inference) and concepts from non-linear system identification to infer laws (regression, classification, clustering) from large data sets with low density information to reveal relationships and dependencies or to perform prediction of outcomes or behaviors
17. Explain the difference between “long” and “wide” format data. Why would you use one or the other?
- Long: one column containing the values and another column listing the context of the value Fam_id year fam_inc
- Wide: each different variable in a separate column
Fam_id fam_inc96 fam_inc97 fam_inc98
Long Vs Wide:
— Data manipulations are much easier when data is in the wide format: summarize, filter
— Program requirements
18. Do you know a few “rules of thumb” used in statistical or computer science? Or in business analytics?
— 80% of the effects come from 20% of the causes
— 80% of the sales come from 20% of the customers
Computer science: “simple and inexpensive beats complicated and expensive” — Rod Elder
Finance, rule of 72:
— Estimate the time needed for a money investment to double
— 100$ at a rate of 9%: 72/9=8 years
Rule of three (Economics):
— There are always three major competitors in a free market within one industry
19. Name a few famous API’s (for instance GoogleSearch)
Google API (Google Analytics, Picasa), Twitter API (interact with Twitter functions), GitHub API, LinkedIn API (users data)…
20. Give examples of bad and good visualizations
— Pie charts: difficult to make comparisons between items when area is used, especially when there are lots of items
— Color choice for classes: abundant use of red, orange and blue. Readers can think that the colors could mean good (blue) versus bad (orange and red) whereas these are just associated with a specific segment
— 3D charts: can distort perception and therefore skew data
— Using a solid line in a line chart: dashed and dotted lines can be distracting
— Heat map with a single color: some colors stand out more than others, giving more weight to that data. A single color with varying shades show the intensity better
— Adding a trend line (regression line) to a scatter plot help the reader highlighting trends
Other questions & answers for data science you can find here.