Текст майнинг: инструменты

Интеллектуальный анализ текстов (англ. text mining) — направление в искусственном интеллекте, целью которого является получение информации из коллекций текстовых документов, основываясь на применении эффективных в практическом плане методов машинного обучения и обработки естественного языка. Название «интеллектуальный анализ текстов» перекликается с понятием «интеллектуальный анализ данных» (ИАД, англ. data mining), что выражает схожесть их целей, подходов к переработке информации и сфер применения; разница проявляется лишь в конечных методах, а также в том, что ИАД имеет дело с хранилищами и базами данных, а не электронными библиотеками и корпусами текстов.

Ключевыми группами задач text mining являются: категоризация текстов, извлечение информации и информационный поиск, обработка изменений в коллекциях текстов, а также разработка средств представления информации для пользователя.

Категоризация документов заключается в отнесении документов из коллекции к одной или нескольким группам (классам, кластерам) схожих между собой текстов (например, по теме или стилю). Категоризация может происходить при участии человека, так и без него. В первом случае, называемом классификацией документов, система ИАТ должна отнести тексты к уже определённым (удобным для него) классам. В терминах машинного обучения для этого необходимо произвести обучение с учителем, для чего пользователь должен предоставить системе ИАТ как множество классов, так и образцы документов, принадлежащих этим классам.

Второй случай категоризации называется кластеризацией документов. При этом система ИАТ должна сама определить множество кластеров, по которым могут быть распределены тексты, — в машинном обучении соответствующая задача называется обучением без учителя. В этом случае пользователь должен сообщить системе ИАТ количество кластеров, на которое ему хотелось бы разбить обрабатываемую коллекцию (подразумевается, что в алгоритм программы уже заложена процедура выбора признаков).

А вот список инструментов text mining (среди них есть и бесплатные):

QDA Miner Lite
QDA Miner qualitative data analysis tool may be used to analyze interview or focus group transcripts, legal documents, journal articles, speeches, even entire books, as well as drawings, photographs, paintings, and other types of visual documents.

KH Coder
KH Coder is a free software for quantitative content analysis or text mining. It is also utilized for computational linguistics. You can analyze Japanese, English, French, German, Italian, Portuguese and Spanish text with KH Coder. Also, Chinese (simplified, UTF-8), Korean and Russian (UTF-8) language data can be analyzed with the latest Alpha.

TAMS Analyzer
TAMS stands for Text Analysis Markup System. It is a convention for identifying themes in texts (web pages, interviews, field notes). It was designed for use in ethnographic and discourse research.

Carrot2 is an Open Source Search Results Clustering Engine. It can automatically organize small collections of documents (search results but not only) into thematic categories.

The Coding Analysis Toolkit (CAT) is a free, open source, cloud computing platform built on Microsoft’s ASP.NET technology. CAT enables efficient, transparent, reliable, valid and scalable Web-based collaborative manual text categorization and analysis.

Open source software capable of solving almost any text processing problem.

tm (text mining infrastructure in R)
The tm package offers functionality for managing text documents, abstracts the process of document manipulation and eases the usage of heterogeneous text formats in R.

Gensim (free Python library)
Gensim is a FREE Python library for: Scalable statistical semantics; Analyze plain-text documents for semantic structure; Retrieve semantically similar documents.

Natural Language Toolkit
NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

The Text Extension for RapidMiner adds all operators necessary for statistical text analysis. You can load texts from many different data sources, transform them by a huge set of different filtering techniques, and finally analyze your text data.The Text Extensions supports several text formats including plain text, HTML, or PDF as well as other data sources. It provides standard filters for tokenization, stemming, stopword filtering, or n-gram generation to provide everything necessary for preparing and analyzing texts.

Unstructured Information Management Architecture (UIMA)
Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user. An example UIM application might ingest plain text and identify entities, such as persons, places, organizations; or relations, such as works-for or located-at.

OpenNLP (Apache library)
The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution. These tasks are usually required to build more advanced text processing services. OpenNLP also includes maximum entropy and perceptron based machine learning.

The KNIME Textprocessing feature enables to read, process, mine and visualize textual data in a convenient way. It provides functionality from natural language processing (NLP), text mining, information retrieval.

Orange-Textable (аддон)
Open source data visualization and data analysis for novice and expert. Interactive workflows with a large toolbox.

LPU (which stands for Learning from Positive and Unlabeled data) is a text learning or classification system that learns from a set of positive documents and a set of unlabeled documents (without labeled negative documents). This type of learning is different from classic text learning/classification, in which both positive and negative training documents are required.

Pattern (модуль для Python)
Pattern is a web mining module for the Python programming language. It has tools for data mining (Google, Twitter and Wikipedia API, a web crawler, a HTML DOM parser), natural language processing (part-of-speech taggers, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, clustering, SVM), network analysis and canvas visualization.

LingPipe is tool kit for processing text using computational linguistics. LingPipe is used to do tasks like: Find the names of people, organizations or locations in news; Automatically classify Twitter search results into categories; Suggest correct spellings of queries.

S-EM is a text learning or classification system that learns from a set of positive and unlabeled examples (no negative examples). It is based on a «spy» technique, naive Bayes and EM algorithm.

LibShortText is an open source tool for short-text classification and analysis. It can handle the classification of, for example, titles, questions, sentences, and short messages.

VisualText is an ideal tool for quickly developing accurate and fast information extraction, natural language processing, and text analysis systems for the most complex needs.

Twinword provides text analysis APIs that can understand and associate words in the same way as humans do. Features include context and topic extraction ,online consumer sentiment analysis for brands and products and personalized and targeted e-commerce/advertising platforms.

Apache Stanbol
Apache Stanbol’s intended use is to extend traditional content management systems with semantic services. Other feasible use cases include: direct usage from web applications (e.g. for tag extraction/suggestion; or text completion in search fields), ‘smart’ content workflows or email routing based on extracted entities, topics, etc.

Datumbox API
The Datumbox API offers a large number of off-the-shelf Classifiers and Natural Language Processing services which can be used in a broad spectrum of applications including: Sentiment Analysis, Topic Classification, Language Detection, Subjectivity Analysis, Spam Detection, Reading Assessment, Keyword and Text Extraction and more.

Aika is a text-mining algorithm. It combines various ideas from the field of machine learning such as artificial neural networks, frequent pattern mining and grammar induction.

Data Scientist # 1

Машинное обучение, большие данные, наука о данных, анализ данных, цифровой маркетинг, искусственный интеллект, нейронные сети, глубокое обучение, data science, data scientist, machine learning, artificial intelligence, big data, deep learning

Данные — новый актив!

Эффективно управлять можно только тем, что можно измерить.
Copyright © 2016-2021 Data Scientist. Все права защищены.