We love designing and developing websites, but what really drives us is solving problems and cultivating strong relationships with our clients
Finding keywords using Python
By : lakshman
Update: keywords2.txt is Pride and Prejudice from Project Gutenberg. Attached for convenience.
Finding keywords in a given snippet of text has many uses. From classifying web pages to fighting spam mails, keyword recognition is the first step in many algorithms.
We here show the naive Bayesian filter to find keywords, which was popularised by Paul Graham to discover spam mails.
Steps to find keywords.
- Have a large corpus of text against which we will compare.
- Find the relative frequency of words in corpus. Eg if your corpus is "the green way is very green way green". Relative frequency is.. [ (the, 1/8), (green, 3/8), (way, 2/8), (is, 1/8), (very, 1/8), ]
- Get the relative frequency of words in search string.
- Divide frequency of search string by corpus, taking care of cases like when a word is not in corpus.
- Sort by answer in 4, the top n words are keyword.
Why does this work?
The relative frequency in corpus string is a measure of how frequently the word occurs generally. Dividing search frequency word by relative corpus frequency gets a measure of how frequently the word occurs relative to normal. Sorting on this gives us the keywords.
The code
The code is longer than it needs to be make every step as clear as possible
def find_keyword(test_string = 'Hacker news is a good site while Techcrunch not so much'):
key_file = open('keywords2.txt')
data = key_file.read()
words = data.split()
word_freq = {}
for word in words:
if word in word_freq:
word_freq[word]+=1
else:
word_freq[word] = 1
word_prob_dict = {}
size_corpus = len(words)
for word in word_freq:
word_prob_dict[word] = float(word_freq[word])/size_corpus
prob_list = []
for word, prob in word_prob_dict.items():
prob_list.append(prob)
non_exist_prob = min(prob_list)/2
words = test_string.split()
test_word_freq = {}
for word in words:
if word in test_word_freq:
test_word_freq[word]+=1
else:
test_word_freq[word] = 1
test_words_ba = {}
for word, freq in test_word_freq.items():
if word in word_prob_dict:
test_words_ba[word] = freq/word_prob_dict[word]
else:
test_words_ba[word] = freq/non_exist_prob
test_word_ba_list = []
for word, ba in test_words_ba.items():
test_word_ba_list.append((word, ba))
def sort_func(a, b):
if a[1] > b[1]:
return -1
elif a[1] < b[1]:
return 1
return 0
test_word_ba_list.sort(sort_func)
return test_word_ba_list[:2]
Output
Here is some sample output.
In [59]: ff.find_keyword
Out[59]: <function find_keyword at 0x92b7e9c>
In [60]: ff.find_keyword()
Out[60]: [('Hacker', 249160.0), ('Techcrunch', 249160.0)]
In [61]: ff.find_keyword('')
Out[61]: []
In [62]: ff.find_keyword('Java is an island and a programming language')
Out[62]: [('Java', 249160.0), ('island', 249160.0)]
In [63]: ff.find_keyword('Java is an island and a programming language. Python is a snake and a programming')
Out[63]: [('programming', 498320.0), ('Java', 249160.0)]
In [64]: ff.find_keyword('Java is an island and a programming language')
Out[64]: [('Java', 249160.0), ('island', 249160.0)]
In [65]: ff.find_keyword('Java is an island and a programming programming language')
Out[65]: [('programming', 498320.0), ('Java', 249160.0)]
Need to build a web app which does more than read from a database? Contact us at sales@uswaretech.com, to discuss further.
Comments
Reactions
Finding keywords using Python http://klck.me/A7
This comment was originally posted on Twitter
- How to use pep8.py to write better Django code
- Screencast: Django Tutorial Part 1
- How and why to use pyflakes to write better Python
- Getting started with South for Django DB migrations
- A brief overview of Vagrant
- Writing jQuery plugins using Coffeescript
- Behind the Scenes: Request to Response
- Using SQLite Database with Android
- Haml for Django developers
- Coffeescript for Python programmers
- rails
- django
- linkroundup
- django opinion
- opinion
- business
- API
- appengine
- python
- satire
- startup
- Uncategorized
- marketing
- personal
- rambling
- search
- interviews
- seo-interviews
- 5startupideas
- ideas
- seo
- tips
- forms
- paypal
- utilities
- datetime
- web2.0
- Amazon
- algorithms
- presentations
- products
- pinax
- satchmo
- ecommerce
- microsoft
- yahoo
- book
- tutorial
- models
- aggreagtion
- meta
- India
- apps
- about
- CSS
- Design
- wordpress
- test slug
- vim
- urls
- reviews
- javascript
- xmpp
- emacs
- Typography
- Grid Theory
- Color Theory
- iphone
- android
- titanium
- mobile applications
- CSS3
- Browser Compatibility
- mobile
- jobs
- lamson
- django setup
- files
- upload
- jsTree
- hierarchical view
- web page
- Treeview
- coffeescript
- request
- response
- South
- django south
- django migration
- --fake
- screencasts
- February 2012
- January 2012
- December 2011
- October 2011
- September 2011
- July 2011
- June 2011
- April 2011
- February 2011
- January 2011
- December 2010
- November 2010
- October 2010
- September 2010
- June 2010
- April 2010
- March 2010
- January 2010
- December 2009
- November 2009
- October 2009
- September 2009
- August 2009
- July 2009
- June 2009
- April 2009
- March 2009
- February 2009
- November 2008
- October 2008
- June 2008
- May 2008
- April 2008
You could do this in just a few lines of code by using a more Pythonic, functional style instead of this very imperative style; see, e.g., http://digitalhistoryhacks.blogspot.com/2006/08/easy-pieces-in-python-word-frequencies.html
The sort_func is totally unnecessary; just do test_word_ba_list.sort(key = operator.itemgetter(1))
Also, you probably want to remove stop words (a, and, the, etc.).
Finally, raw term frequency isn't the best measure to identify the most important keywords; instead, for example, you might use the term's frequency in the document divided by its frequency in English overall (or across a larger corpus of documents).