NLTK for Simple Article Categorization
Article categorization has been here for a while, but huge improvements have been achieved when Machine Learning techniques improved. However, not always we will need something that complex to categorize an article. Here I will show you how we can categorize an article using NLTK in python 2.7.x with a few dozen lines of code.

We will asume that we will receive the article with out html tags and properly encoded. The steps to solve this problem are the following:
The main reason to use python is that is the de-facto language for language processing and the libraries that exists in .net are not mantained anymore.
Create a couple of classes to save needed data
We will create two classes. The first one called Category and the seconde one called Data
class Category:
def __init__(self, name, keywords):
self.name = name
self.keywords = keywords
class Data:
def __init__(self, article):
self.categories = None
self.article= article
The first class will save the name of the category and the keywords to identify it, the second one will save the article data and it will assign categories in None that will be set later.
Create a util class to process the data
This class will have a method which receives an article class and an array of categories classes. With this information we will use NLTK to get the category in the most simple way. This is quite a long class so it will be important to read all the comments for further explanation.
'''
Analyzer class
'''
import nltk
import base64
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import sent_tokenize, word_tokenize
from category import Category
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
class Analyzer:
def __init__(self, data, categories):
self.data = data
self.categories = categories
# Distribution threshold. How many times the
# word must be repeated in the distribution to
# be considered as part of the category
self.threshold_distribution = 1
# Category threshold. How many words is
# the limit to be considered as part of the category
self.threshold_category_words = 3
def analize_content_with_nltk(self):
# Tokenizer which will remove the extra with spaces and weird characters
tokenizer = RegexpTokenizer(r'\w+')
# Stop word dictionary to remove common words like prepositions
stop_words = set(stopwords.words('english'))
# Stemmer to reduce the words to their stem
stemmer = PorterStemmer()
print "processing..."
# First we will tokenize the article
article_tokens = tokenizer.tokenize(self.data.article)
# Remove the stop words in the article
article_tokens_no_stop_words = [w for w in article_tokens if not w in stop_words]
print 'article tokens: ' + str(len(article_tokens))
print 'article tokens no stop words: ' + str(len(article_tokens_no_stop_words))
# Stem the article tokens
article_tokens_stemming = []
for word in article_tokens_no_stop_words:
article_tokens_stemming.append(stemmer.stem(word.lower()))
# Create a frequent distribution for the article´s token
article_dist = nltk.FreqDist(article_tokens_stemming)
# Iterate over the categories array
article_categories = []
for category in self.categories:
# Tokenize the category keywords
keywords_tokens = list(set(tokenizer.tokenize(' '.join(category.keywords))))
# If there are no tokens continue with the next category
if(len(keywords_tokens) == 0):
continue
print 'category: ' + category.name
print 'keywords tokens: ' + str(len(keywords_tokens))
# Stem the category keywords
keywords_tokens_stemming = []
for keyword in keywords_tokens:
keywords_tokens_stemming.append(stemmer.stem(keyword.lower()))
# Using the first threshold verify if in the article distribution
# one of the keywords appear. If is the case save it to the
# keywords found array
keywords_found = []
for keyword in keywords_tokens_stemming:
if(article_dist[keyword] >= self.threshold_distribution):
keywords_found.append(keyword)
print "keywords found: " + str(keywords_found)
# If the number of keywords found in the article distribution
# is equal or pass the second threshold we will assume the article
# of this category. We will then repeat the process for another category
if(len(keywords_found) >= self.threshold_category_words):
article_categories.append(category.name)
print(article_categories)
self.data.categories = article_categories
We used NLP to solve this task instead of Machine Learning. This of course make if faster to categorize, but not so accurate. I will definitely go with a Machine Learning approach if you have the resources and time to implement it.
Return the results
Finally, we will create our new class and call the method to get the results.
#!/usr/bin/env python
from category import Category
from data import Data
# Gunicorn entry point.
app = create_app()
if __name__ == '__main__':
# Entry point when run via Python interpreter.
data = Data("Some article content over here about soccer and sports")
category1 = Category("Sports", ["soccer","game","sports", "entretainment"])
category2 = Category("Movies", ["cinema","oscars","movies", "imdb"])
# Create the analyzer class
analyzer = Analyzer(data, [category1, category2])
# Call the method
analyzer.analize_content_with_nltk()
# Show result
print analyzer.data.categories
This show you how to implement a simple article categorization using NLTK and python. I hope it will help someone and as always keep learning !!!
Leave a Reply