Data Science инструменты для тех, кто не умеет кодить

Вообще-то кодинг является важной частью data science, но всё-таки без этого можно обойтись, используя соответствующие вспомогательные инструменты (но лучше уметь кодить). Итак, вот список таких инструментов:

1. RapidMiner

RapidMiner (RM) was originally started in 2006 as an open-source stand-alone software named Rapid-I. Over the years, they have given it the name of RapidMiner and also attained ~35Mn USD in funding. The tool is open-source for old version (below v6) but the latest versions come in a 14-day trial period and licensed after that.

RM covers the entire life-cycle of prediction modeling, starting from data preparation to model building and finally validation and deployment. The GUI is based on a block-diagram approach, something very similar to Matlab Simulink. There are predefined blocks which act as plug and play devices. You just have to connect them in the right manner and a large variety of algorithms can be run without a single line of code. On top of this, they allow custom R and Python scripts to be integrated into the system.

There current product offerings include the following:

  1. RapidMiner Studio: A stand-alone software which can be used for data preparation, visualization and statistical modeling
  2. RapidMiner Server: It is an enterprise-grade environment with central repositories which allow easy team work, project management and model deployment
  3. RapidMiner Radoop: Implements big-data analytics capabilities centered around Hadoop
  4. RapidMiner Cloud: A cloud-based repository which allows easy sharing of information among various devices

RM is currently being used in various industries including automotive, banking, insurance, life Sciences, manufacturing, oil and gas, retail, telecommunication and utilities.

2. DataRobot

DataRobot (DR) is a highly automated machine learning platform built by all time best Kagglers including Jeremy Achin, Thoman DeGodoy and Owen Zhang. Their platform claims to have obviated the need for data scientists. This is evident from a phrase from their website – “Data science requires math and stats aptitude, programming skills, and business knowledge. With DataRobot, you bring the business knowledge and data, and our cutting-edge automation takes care of the rest.”

DR proclaims to have the following benefits:

  • Model Optimization
    • Platform automatically detects the best data pre-processing and feature engineering by employing text mining, variable type detection, encoding, imputation, scaling, transformation, etc.
    • Hyper-parameters are automatically chosen depending on the error-metric and the validation set score
  • Parallel Processing
    • Computation is divided over thousands of multi-core servers
    • Uses distributed algorithms to scale to large data sets
  • Deployment
    • Easy deployment facilities with just a few clicks (no need to write any new code)
  • For Software Engineers
    • Python SDK and APIs available for quick integration of models into tools and softwares.

With funding of ~60Mn USD and more than 100 employees, DR looks in good shape for the future.

3. BigML

BigML is another platform with ~Mn USD in funding. It provides a good GUI which takes the user through 6 steps as following:

  • Sources: use various sources of information
  • Datasets: use the defined sources to create a dataset
  • Models: make predictive models
  • Predictions: generate predictions based on the model
  • Ensembles: create ensemble of various models
  • Evaluation: very model against validation sets

These processes will obviously iterate in different orders. The BigML platform provides nice visualization of results and has algorithms for solving classification, regression, clustering, anomaly detection and association discovery problems. You can get a feel of how their interface works using their YouTube channel.

 

4. Google Cloud Prediction API

The Google Cloud Prediction API offers RESTful APIs for building machine learning models for android applications. This platform is specifically for mobile applications based on Android OS. Some of the use cases include:

  • Recommendation Engine: Given a user’s past viewing habits, predict what other movies or products a user might like.
  • Span Detection: Categorize emails as spam or non-spam.
  • Sentiment Analysis: Analyze posted comments about your product to determine whether they have a positive or negative tone.
  • Purchase Prediction: Guess how much a user might spend on a given day, given his spending history.

Though the API can be used by any system, there are also specific Google API client libraries build for better performance and security. These exist for various programming languages- Python, Go, Java, JavaScript, .net, NodeJS, Obj-C, PHP and Ruby.

5. Paxata

Paxata is one of the few organizations which focus on data cleaning and preparation, NOT the machine learning or statistical modeling part. It is an MS Excel-like application that is easy to use, with visual guidance making it easy to bring together data, find and fix dirty or missing data, and share and re-use data projects across teams. Like others mentioned here, Paxata eliminates coding or scripting, so overcoming technical technical barriers involved in handling data.

Paxata platform follows the following process:

  1. Add Data: use a wide range of sources to acquire data
  2. Explore: perform data exploration using powerful visuals allowing the user to easily identify gaps in data
  3. Clean+Change: perform data cleaning using steps like imputation, normalization of similar values using NLP, detecting duplicates
  4. Shape: make pivots on data, perform grouping and aggregation
  5. Share+Govern: allows sharing and collaborating across teams with strong authentication and authorization in place
  6. Combine: a proprietary technology called SmartFusion allows combining data frames with 1 click as it automatically detects the best combination possible; multiple data sets can be combined into a single AnswerSet
  7. BI Tools: allows easy visualization of the final AnswerSet in commonly used BI tools; also allows easy iterations between data preprocessing and visualization

With a funding of ~25Mn USD, Praxata has set its foot in financial services, consumer goods and networking domains. It might be a good tool to use if your work requires extensive data cleaning.

6. Trifacta

Trifacta is another startup focussed on data preparation. It has 2 product offering:

  • Wrangler – a free stand-alone software
  • Wrangler Enterprise – licensed professional version

Trifacta offers a very intuitive GUI for performing data cleaning. It takes data as input and provides a summary with various statistics by column. Also, for each column it automatically recommends some transformations which can be selected using a single click. Various transformations can be performed on the data using some pre-defined functions which can be called easily in the interface.

Trifacta platform uses the following steps of data preparation:

  1. Discovering: this involves getting a first look at the data and distributions to get a quick sense of what you have
  2. Structure: this involves assigning proper shape and variable types to the data and resolving anomalies
  3. Cleaning: this step includes processes like imputation, text standardization, etc. which are required to make the data model ready
  4. Enriching: this step helps in improving the quality of analysis that can be done by either adding data from more sources or performing some feature engineering on existing data
  5. Validating: this step performs final sense checks on the data
  6. Publishing: finally the data is exported for further use

With ~75Mn USD in funding, Trifacta is currently being used in financial, life sciences and telecommunication industry.

7. Narrative Science

Narrative Science is based on a unique idea in the sense that it generates automated reports using data. It works like a data story-telling tool which used advanced natural language processing to create reports. It is something similar to a consulting report.

Some of the features of this platform include:

  • incorporates specific statistics and past data of the organization
  • makes of the benchmarks, drivers and trends of the specific domain
  • it can help generate personalized reports targeted to specific audience

With ~30Mn USD in funding, Narrative Science is currently being used in financial, insurance, government and e-commerce domains. Some of its customers include American Century Investments, PayScale, MasterCard, Forbes, Deloitte, etc.

Having discussed some startups in this domain, lets move on to some of the academic initiatives which are trying to automate some aspects of data science. These have potential of turning into successful enterprise in future.

8. MLBase

MLBase is an open-source project developed by AMP (Algorithms Machines People) Lab at University of California, Berkeley. The core idea is to provide an easy solution for applying machine learning to large scale problems.

It has 3 offerings:

  1. MLib: It works as the core distributed ML library in Apache Spark. It was originally developed as part of MLBase project, but now the Spark community supports it
  2. MLI: An experimental API for feature extraction and algorithm development that introduces high-level ML programming abstractions.
  3. ML Optimizer: This layer aims to automating the task of ML pipeline construction. The optimizer solves a search problem over feature extractors and ML algorithms included in MLI and MLlib.

This undertaking is still under active development and we should hear about the developments in the near future.

9. WEKA

Weka is a data mining software written in Java, developed at the Machine Learning Group at University of Waikato, New Zealand. It is a GUI based tool which is very good for beginners in data science and the best part is that it is open-source. You can learn about it using the MOOC offered by University of Waikato here. You can learn more about it in this article.

Though weka is currently more used in the academic community, but it might be the stepping stone of something big coming up in future.

10. Automatic Statistician

Automatic Statistician is not a product per se but a research organization which is creating a data exploration and analysis tool. It can take in various kinds of data and use natural language processing to generate a detailed report. It is being developed by researchers who have worked in Cambridge and MIT and also won Google’s Focussed Research Award with a price of $750,000. Though is it still under development and very minimal information is available about the project, it looks like it is being backed by Google. You can find some information here.

 

More Tools

  • MarketSwitch – This tool is more focussed on optimization rather than predictive analytics
  • algorithms.io – This tool works in the domain of IoT (Internet of Things) and performs analytics on connected devices
  • wise.io – This tool is focussed on customer handling and ticket system analytics
  • Predixion – This is another tool which works on data collected from connected devices
  • Logical Glue – Another GUI based machine learning platform which works from raw data to deployment
  • Pure Predictive – This tool uses a patented Artificial Intelligence system which obviates the part of data preparation and model tuning; it uses AI to combine 1000s of models into what they call “supermodels”
  • DataRPM – Another tool for making predictive models using a GUI and no coding requirements
  • ForecastThis – Another proprietary technology focussed on machine learning using a GUI
  • FeatureLab – It allows easy predictive modeling and deployment using GUI

Источник

Data Scientist # 1

Машинное обучение, большие данные, наука о данных, анализ данных, цифровой маркетинг, искусственный интеллект, нейронные сети, глубокое обучение, data science, data scientist, machine learning, artificial intelligence, big data, deep learning

Данные — новый актив!

Эффективно управлять можно только тем, что можно измерить.
Copyright © 2016-2021 Data Scientist. Все права защищены.