Как говорится, чистые данные существуют только в учебниках. Большинство имеющихся данных неструктурированы, содержат много потерянных элементов, могут содержать дубликаты и прочие «мусорные» значения. Поэтому одним из важнейших, базовых и часто самых трудоёмких этапов анализа данных является их очистка. Но есть полезные инструменты, которые приходят на помощь при очистке данных. Рассмотрим некоторые из них.
Drake is a simple-to-use, extensible, text-based data workflow tool that organizes command execution around data and its dependencies. Data processing steps are defined along with their inputs and outputs and Drake automatically resolves their dependencies and calculates:
- which commands to execute (based on file timestamps)
- in what order to execute the commands (based on dependencies)
Drake is similar to GNU Make, but designed especially for data workflow management. It has HDFS support, allows multiple inputs and outputs, and includes a host of features designed to help you bring sanity to your otherwise chaotic data processing workflows.
OpenRefine (formerly Google Refine) is a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data.
Wrangler is an interactive tool for data cleaning and transformation. Spend less time formatting and more time analyzing your data. Wrangler allows interactive transformation of messy, real-world data into the data tables analysis tools expect. Export data for use in Excel, R, Tableau, Protovis, …
The heart of DataCleaner is a strong data profiling engine for discovering and analyzing the quality of your data. Find the patterns, missing values, character sets and other characteristics of your data values. Profiling is an essential activity of any Data Quality, Master Data Management or Data Governance program. If you don’t know what you’re up against, you have poor chances of fixing it.
Data quality is an important contributor in the overall success of a project or campaign. Inaccurate data leads to wrong assumptions and analysis. Consequently it leads to failure of the project or campaign. Duplicate data can thus cause all sorts of hassles such as slow load ups, accidental deletion etc. A good data cleaning tool tackles these problems and cleans your database of duplicate data, bad entries and incorrect information.
Talend Enterprise Data Quality is a subscription-based, open source solution that costs just a fraction of what comparable proprietary data cleansing solutions cost. Based primarily on the number of developer seats, Talend’s subscription pricing is more transparent and predictable than proprietary licensing, and includes professional technical support from Talend.
The SQL Power DQguru helps you cleanse your data, validate and correct addresses, identify and remove duplicates, and build cross-references between source and target tables. This provides business users with complete and accurate data, and a single 360-degree view of all business entities, such as customer, product, representative, employee, supplier or business unit.
Data Cleansing Suite is from Data Ladder, is a software development and service company dedicated to helping you «Get the Most Out of Your Data» through Data Matching, Data Cleansing, Profiling, and Enrichment. Easy to use and affordable data cleansing and deduplication software.