You have gathered gigabytes or terabytes of unstructured text, for instance scraping the Internet, or pieces of email from your employees or users, or tweets, or millions of products that you want to categorize (only product description and product name is available — sometimes with typos). Now you want to make sense of it, and extract value, possibly design a nice search engine so that your customers can easily find your products. The core algorithm that you need is an automated cataloguer, also called indexer. I am going to explain in layman’s terms how it works. First, let’s assume that the data consists of
- pages or articles (a web page or the body of an email, etc.)
- subject lines (or page titles),
- and authors (for a web page or an email).
Typically, these «pages» are stored as large repositories containing millions or billions of (sometimes compressed) text files spread across a number of folders and sub-folders, or multiple servers. Sometimes a time stamp is attached to each document, and can be leveraged to increase the accuracy of the indexer.
Even if you only have pages (no user information, no titles), it will work. If you have pages and authors, you can classify the pages separately, then the authors separately (or in parallel), then blend the results to maximize accuracy. The same indexation algorithm (sometimes called tagging algorithm) is used in both cases. Despite the fact that classifying billions of documents seems mathematically unfeasible due to the computational complexity of traditional clustering algorithms (the time spent to cluster is growing much faster than linearly, as a function of the size of your repository), this algorithm is different, run very fast, and is easy to implement using a distributed architecture.
The indexer algorithm creates a taxonomy of your pages (or products, articles, documents etc.) Each page is assigned a category and sub-category.
Indexation algorithm
- Step 1: Create a data dictionary (that is, a frequency table) of all one-token and two-token keywords found in all pages (both in the title and in the body of the article). This assumes that you crawled all your articles to extract all the text.
- Step 2: Filter / clean results. Ignore keywords with less than 5 occurrences. Check all n-grams of a keyword (data science and science data) and eliminate n-grams with low frequency, for each keyword
- Step 3: Look at top 300 entries, called seed keywords. Manually assign seed keywords to 10-20 categories, (these categories are manually pre-selected, after looking at the top 300 entries.) For instance, the top category data plumbing will have the following seed keywords: data engineer, data architect, data warehouse, Hadoop, Spark, data lakes, IoT and many more. Don’t forget to have a top category called Unknown.
- Step 4: Based on keywords found in the title and body of an article, assign the article in question to the top category that has the biggest overlap with the article, in terms of seed keywords. Note that keywords found in the title might be assigned a higher weight than those found in the body. Likewise, a different weight can be attached to each seed keyword, in each top category.
Potential improvement
These improvements will improve the performance (accuracy).
- Add 3-token keywords in your dictionary, not just 1- and 2-token. For 3 tokens keywords, you have 3! (factorial 3) = 6 n-grams. Usually, only one or two of these 6 n-grams will show up in the articles, for any keyword.
- Use stop words to clean your data. Examples: it, where, how, why, for and so on. Be careful though: IT Job can not be reduced to Job by filtering out the token IT. You can replace plurals by singular, and normalize the keywords.
- Some one-token words don’t make sense. Do not break San Francisco in San and Francisco. Used a table of keywords that should not be split.
Even without improvements, the methodology will work well, because you focus on top keywords in terms of frequency. For instance, in Best San Francisco Hotels, the keywords Best San and Francisco Hotels won’t show up at the top, and if they do, you can remove them, as you manually review the top 3,000 entries (a process that takes 30 minutes).