Article categorization has been here for a while, but huge improvements have been achieved when Machine Learning techniques improved. However, not always we will need something that complex to categorize an article. Here I will show you how we can categorize an article using NLTK in python 2.7.x with a few dozen lines of code.
We will asume that we will receive the article with out html tags and properly encoded. The steps to solve this problem are the following:
The main reason to use python is that is the de-facto language for language processing and the libraries that exists in .net are not mantained anymore.
Create a couple of classes to save needed data
We will create two classes. The first one called Category and the seconde one called Data
class Category: def __init__(self, name, keywords): self.name = name self.keywords = keywords class Data: def __init__(self, article): self.categories = None self.article= article
The first class will save the name of the category and the keywords to identify it, the second one will save the article data and it will assign categories in None that will be set later.
Create a util class to process the data
This class will have a method which receives an article class and an array of categories classes. With this information we will use NLTK to get the category in the most simple way. This is quite a long class so it will be important to read all the comments for further explanation.
''' Analyzer class ''' import nltk import base64 from nltk.corpus import stopwords from nltk.stem import PorterStemmer from nltk.tokenize import RegexpTokenizer from nltk.tokenize import sent_tokenize, word_tokenize from category import Category nltk.download('stopwords') nltk.download('punkt') nltk.download('averaged_perceptron_tagger') class Analyzer: def __init__(self, data, categories): self.data = data self.categories = categories # Distribution threshold. How many times the # word must be repeated in the distribution to # be considered as part of the category self.threshold_distribution = 1 # Category threshold. How many words is # the limit to be considered as part of the category self.threshold_category_words = 3 def analize_content_with_nltk(self): # Tokenizer which will remove the extra with spaces and weird characters tokenizer = RegexpTokenizer(r'\w+') # Stop word dictionary to remove common words like prepositions stop_words = set(stopwords.words('english')) # Stemmer to reduce the words to their stem stemmer = PorterStemmer() print "processing..." # First we will tokenize the article article_tokens = tokenizer.tokenize(self.data.article) # Remove the stop words in the article article_tokens_no_stop_words = [w for w in article_tokens if not w in stop_words] print 'article tokens: ' + str(len(article_tokens)) print 'article tokens no stop words: ' + str(len(article_tokens_no_stop_words)) # Stem the article tokens article_tokens_stemming =  for word in article_tokens_no_stop_words: article_tokens_stemming.append(stemmer.stem(word.lower())) # Create a frequent distribution for the article´s token article_dist = nltk.FreqDist(article_tokens_stemming) # Iterate over the categories array article_categories =  for category in self.categories: # Tokenize the category keywords keywords_tokens = list(set(tokenizer.tokenize(' '.join(category.keywords)))) # If there are no tokens continue with the next category if(len(keywords_tokens) == 0): continue print 'category: ' + category.name print 'keywords tokens: ' + str(len(keywords_tokens)) # Stem the category keywords keywords_tokens_stemming =  for keyword in keywords_tokens: keywords_tokens_stemming.append(stemmer.stem(keyword.lower())) # Using the first threshold verify if in the article distribution # one of the keywords appear. If is the case save it to the # keywords found array keywords_found =  for keyword in keywords_tokens_stemming: if(article_dist[keyword] >= self.threshold_distribution): keywords_found.append(keyword) print "keywords found: " + str(keywords_found) # If the number of keywords found in the article distribution # is equal or pass the second threshold we will assume the article # of this category. We will then repeat the process for another category if(len(keywords_found) >= self.threshold_category_words): article_categories.append(category.name) print(article_categories) self.data.categories = article_categories
We used NLP to solve this task instead of Machine Learning. This of course make if faster to categorize, but not so accurate. I will definitely go with a Machine Learning approach if you have the resources and time to implement it.
Return the results
Finally, we will create our new class and call the method to get the results.
#!/usr/bin/env python from category import Category from data import Data # Gunicorn entry point. app = create_app() if __name__ == '__main__': # Entry point when run via Python interpreter. data = Data("Some article content over here about soccer and sports") category1 = Category("Sports", ["soccer","game","sports", "entretainment"]) category2 = Category("Movies", ["cinema","oscars","movies", "imdb"]) # Create the analyzer class analyzer = Analyzer(data, [category1, category2]) # Call the method analyzer.analize_content_with_nltk() # Show result print analyzer.data.categories
This show you how to implement a simple article categorization using NLTK and python. I hope it will help someone and as always keep learning !!!