We love designing and developing websites, but what really drives us is solving problems and cultivating strong relationships with our clients
Generating pseudo random text with Markov chains using Python
By : shabda
First the definition from Wolfram
A Markov chain is collection of random variables {X_t} (where the index t runs through 0, 1, ...) having the property that, given the present, the future is conditionally independent of the past.
Wikipedia is a little clearer
...Markov chain is a stochastic process with markov property ... [Which means] state changes are probabilistic, and future state depend on current state only.
Markov chains have various uses, but now let's see how it can be used to generate gibberish, which might look legit.
The algorithm is,
- Have a text which will serve as the corpus from which we choose the next transitions.
- Start with two consecutive words from the text. The last two words constitute the present state.
- Generating next word is the markov transition. To generate the next word, look in the corpus, and find which words are present after the given two words. Choose one of them randomly.
- Repeat 2, until text of required size is generated.
The code for this is
To see a sample output, we take the text of My man jeeves by Wodehouse from Project Gutenberg, and see a sample output.
In [1]: file_ = open('/home/shabda/jeeves.txt')
In [2]: import markovgen
In [3]: markov = markovgen.Markov(file_)
In [4]: markov.generate_markov_text()
Out[4]: 'Can you put a few years of your twin-brother Alfred,
who was apt to rally round a bit. I should strongly advocate
the blue with milk'
[The files you need to run this are jeeves.txt and markovgen.py]
How is this a markov algorithm?
- The last two words are the current state.
- Next word depends on last two words only, or on present state only.
- The next word is randomly chosen from a statistical model of the corpus.
Here is some sample text.
"The quick brown fox jumps over the brown fox who is slow jumps over the brown fox who is dead."
This gives us a corpus like,
{('The', 'quick'): ['brown'],
('brown', 'fox'): ['jumps', 'who', 'who'],
('fox', 'jumps'): ['over'],
('fox', 'who'): ['is', 'is'],
('is', 'slow'): ['jumps'],
('jumps', 'over'): ['the', 'the'],
('over', 'the'): ['brown', 'brown'],
('quick', 'brown'): ['fox'],
('slow', 'jumps'): ['over'],
('the', 'brown'): ['fox', 'fox'],
('who', 'is'): ['slow', 'dead.']}
Now if we start with "brown fox", the next word can be "jumps" or "who". If we choose "jumps", then the current state is "fox jumps" and next word is over, and so on.
Hints
- The larger text we choose, the more choices are there at each transition, and a better looking text is generated.
- The state can be set to depend on one word, two words, or any number of words. As the number of words in each state increases, the generated text becomes less random.
- Don't strip the punctuations etc. They lead to a more representative corpus, and a better random text.
Resources
Did you know that markov chains have many other uses in finance, statistics and mathematics? In fact Google's page rank algorithm is essentially a markov chain of the web! We publish articles on Algorithms, Python, Django and related technologies every week. Why not Subscribe to us or follow us on twitter?
Comments
You don't seem to use the relative frequenccy of next words to weight the random choice?
- Paddy.
Paddy: The words are added to the list the number of times they appear. For example, a key might be
('milk', 'honey'):['story', 'book', 'book'],
........
When you do a random.choice, book has twice the probability of story, hence it is frequency weighted.
Markov chain has been the collection of random variables ,it is also used to generate gibberish ,actually markov chain depends on the present and future.It has the only one who can make a possible word of future.
do you remember a program from waaaaay back called "text mangler"? it used markov chains (i guess), and it was almost creepy how the sentences came out. it was hours of pointless and amazing fun... i haven't been able to get the same results with all the other newer generators... thanks for this - i'm going to get all texty with this tonight!!!!!
Generating pseudo random text with Markov chains using Python — The Usware Blog - Django Web Development...
...
For fun, here's a Markov-chain-based word generator to make words that have a realistic structure, based on existing words.
Usage is pretty much the same.
Real Good description and clear examples.
Big help in understanding.
Thanks
Reactions
NLTK has a pretty good text generation library. It generates text based on any corpus you give it.
This comment was originally posted on Reddit
Not ground-breaking by any means, but a nice, concrete application of Markov chains.
This comment was originally posted on Hacker News
This is relevant, if you want to do something fancier:http://megahal.alioth.debian.org/How.html
BTW, has anything better than Megahal turned up in the meantime, that is also open source?
This comment was originally posted on Hacker News
This is relevant, if you wan to do something fancier:http://megahal.alioth.debian.org/How.html
BTW, has anything better than Megahal turned up in the meantime, that is also open source?
This comment was originally posted on Hacker News
Markov chains can be hilarious.
This comment was originally posted on FriendFeed
This is a big reason why speech recognition works. You have a list of probabilities of sequence of words occuring(language model) and it decides on the next word based on the last few previous words.So if you have the word Super, odds are the next word is Man and not Microphone.
This comment was originally posted on Hacker News
In "The Practice of Programming", Brian Kernighan and Rob Pike use the Markov Chain Algorithm to illustrate a number of lessons about software design and implementation. I used it as my "Hello world" program when learning a new programming language.In Python, the implementation is surprisingly succinct:
import sys import random import collections MAXGEN = 10000 NONWORD = "" suffixchain = collections.defaultdict(list) w1 = w2 = NONWORD for line in sys.stdin: for word in line.split(): suffixchain[(w1, w2)].append(word) w1, w2 = w2, word suffixchain[(w1, w2)].append(NONWORD) w1 = w2 = NONWORD for i in range(MAXGEN): suffixes = suffixchain[(w1, w2)] next = random.choice(suffixes) if next is NONWORD: break print next, w1, w2 = w2, nextTo use it, just run: $ python markov.py < english_corpus.txt
This comment was originally posted on Hacker News
This should be called how to use a library in Python. Pretty weak IMO.
This comment was originally posted on Reddit
Here’s one I wrote that actually shows how they’re coded: http://justinbozonier.posterous.com/markov-chains-in-python
This comment was originally posted on Reddit
I once was given homework by the information theory lecture to implement such a random text generator (k-th order markov chain, k configurable). As I just started with python I thought it would be fun to do that in a one liner. It happened, though it was a really long line and a total lambda mess.
This comment was originally posted on Reddit
- How to use pep8.py to write better Django code
- Screencast: Django Tutorial Part 1
- How and why to use pyflakes to write better Python
- Getting started with South for Django DB migrations
- A brief overview of Vagrant
- Writing jQuery plugins using Coffeescript
- Behind the Scenes: Request to Response
- Using SQLite Database with Android
- Haml for Django developers
- Coffeescript for Python programmers
- rails
- django
- linkroundup
- django opinion
- opinion
- business
- API
- appengine
- python
- satire
- startup
- Uncategorized
- marketing
- personal
- rambling
- search
- interviews
- seo-interviews
- 5startupideas
- ideas
- seo
- tips
- forms
- paypal
- utilities
- datetime
- web2.0
- Amazon
- algorithms
- presentations
- products
- pinax
- satchmo
- ecommerce
- microsoft
- yahoo
- book
- tutorial
- models
- aggreagtion
- meta
- India
- apps
- about
- CSS
- Design
- wordpress
- test slug
- vim
- urls
- reviews
- javascript
- xmpp
- emacs
- Typography
- Grid Theory
- Color Theory
- iphone
- android
- titanium
- mobile applications
- CSS3
- Browser Compatibility
- mobile
- jobs
- lamson
- django setup
- files
- upload
- jsTree
- hierarchical view
- web page
- Treeview
- coffeescript
- request
- response
- South
- django south
- django migration
- --fake
- screencasts
- February 2012
- January 2012
- December 2011
- October 2011
- September 2011
- July 2011
- June 2011
- April 2011
- February 2011
- January 2011
- December 2010
- November 2010
- October 2010
- September 2010
- June 2010
- April 2010
- March 2010
- January 2010
- December 2009
- November 2009
- October 2009
- September 2009
- August 2009
- July 2009
- June 2009
- April 2009
- March 2009
- February 2009
- November 2008
- October 2008
- June 2008
- May 2008
- April 2008
links for 2009-06-17...
What is a browser?
OMFG (Ich sag ja immer: geht in die Fußgängerzone und FRAGT mal )
(tags: browser *****)
Thoughts on Opera Unite | FactoryCity
Opera Uni...