Life Cycle of Data Science Projects

1. Identify the problem

  • Identify metrics used to measure success over baseline (doing nothing)
  • Identify type of problem: prototyping, proof of concept, root cause analysis, predictive analytics, prescriptive analytics, machine-to-machine implementation
  • Identify key people within your organization and outside
  • Get specifications, requirements, priorities, budgets
  • How accurate the solution needs to be?
  • Do we need all the data?
  • Built internally versus using a vendor solution
  • Vendor comparison, benchmarking

2. Identify available data sources

  • Extract (or obtain) and check sample data (use sound sampling techniques); discuss fields to make sure data is understood by you
  • Perform EDA (exploratory analysis, data dictionary)
  • Assess quality of data, and value available in data
  • Identify data glitches, find work-around
  • Is quality and fields populated consistent over time?
  • Are some fields a blend of different stuff (example: keyword field, sometimes equal to user query, sometimes to advertiser keyword, with no way to know except via statistical analyses or by talking to business people)
  • How to improve data quality moving forward
  • Do I need to create mini summary tables / database
  • Which tool do I need (R, Excel, Tableau, Python, Perl, SAS and so on)

3. Identify if additional data sources are needed

  • What fields should be capture
  • How granular
  • How much historical data
  • Do we need real time data
  • How to store or access the data (NoSQL? Map-Reduce?)
  • Do we need experimental design?

4. Statistical Analyses

  • Use imputation methods as needed
  • Detect / remove outliers
  • Selecting variables (variables reduction)
  • Is the data censored (hidden data, as in survival analysis or time-to-crime statistics)
  • Cross-correlation analysis
  • Model selection (as needed, favor simple models)
  • Sensitivity analysis
  • Cross-validation, model fitting
  • Measure accuracy, provide confidence intervals

5. Implementation, development

  • FSSRR: Fast, simple, scalable, robust, re-usable
  • How frequently do I need to update lookup tables, white lists, data uploads, and so on
  • Debugging
  • Need to create an API to communicate with other apps?

6. Communicate results

  • Need to integrate results in dashboard? Need to create an email alert system?
  • Decide on dashboard architecture, with business people
  • Visualization
  • Discuss potential improvements (with cost estimates)
  • Provide training
  • Commenting code, writing a technical report, explaining how your solution should be used, parameters fine-tuned, and results interpreted

7. Maintenance

  • Test the model or implementation; stress tests
  • Regular updates
  • Final outsourcing to engineering and business people in your company, once solutions is stable
  • Help move solution to new platform or vendor


Ещё одна статья на подобную тему: Жизненный цикл проекта

Data Scientist # 1

Машинное обучение, большие данные, наука о данных, анализ данных, цифровой маркетинг, искусственный интеллект, нейронные сети, глубокое обучение, data science, data scientist, machine learning, artificial intelligence, big data, deep learning

Данные — новый актив!

Эффективно управлять можно только тем, что можно измерить.
Copyright © 2016-2021 Data Scientist. Все права защищены.