Sunday, February 20, 2011

How It All Began

The whole idea for writing computer code to generate haiku struck me one day while I was at work.

Ecologists use a lot of statistical programs, but a popular one is R, which is more than just a statistical program in that it is also a programming language.  I've been learning to use this program and language recently to work with ecological data and analyses.  A few weeks ago, I began a project where I wrote code in R to parse text.  We had species descriptions for over 9,000 plant species in South Africa, and I started to learn how to use regular expressions to extract usable bits of data from the paragraph descriptions (for example, the height of the plants or the months when they flower).

While I was learning how to manipulate text in this way, my mind wandered to imagining other possibilities for this type of code.  And that's how the idea was born.

Honestly, I may never have followed up on the idea if I hadn't shared it with my friend Jared, poet and editor of the online magazine The Jivin' Ladybug.  Jared got really, really excited, which got me even more excited, which meant that I went home after talking with him and spent the next two nights up late at night, working on code.

In order to get syllable counts for words, I downloaded a file of 185,000 words with hyphenation marks from the Moby Project, which is a public-domain lexicon.  I was very grateful to find something I could freely use to let me easily calculate syllables.

And that was all I needed in order to generate random-word haiku.

The code I wrote starts out by selecting a random word and then checks the syllable count.  If adding that word to the first line won't make the syllable count for the line go over 5, it adds the word to the line.  Then it does it again, until there are 5 syllables in the first line.  Then it does the same thing for the second line (only with 7 syllables instead of 5), and the third line.

Here is another random word haiku-- the second one that the program wrote:

urushiol glaze
smallholding jabbing Pliny
hypostyle hinder

For those of you who are familiar with R or other programming languages, the code used to generate random word haiku is after the cut.  For the rest of you, all that junk after 'Read More' is just the stuff that tells the computer do to what I described a couple paragraphs above.


Code in R to Generate Random Haiku

########### initial processing of word list  #############

mhyph=read.table("mhyph.txt", sep="\t") #This is the Moby Hyphenator downloaded from the Moby Project

## formatting issues
mhyph=as.matrix(mhyph)
mhyph=strsplit(mhyph, '\\n')
mhyph=matrix(unlist(mhyph))
mhyph=as.vector(unique(mhyph))

## get syllable count
count=gregexpr("[¥| |-]", mhyph)
count=as.numeric(as.vector(summary(count)))
count=count[!is.na(count)]
count=count+1
mhyph=as.matrix(as.data.frame(cbind(mhyph, count)))

#still have problem of 1-syllable words counting +1
for (i in 1:length(mhyph[,1])){
if(length(grep("[¥| |-]", mhyph[i,1]))==0)
mhyph[i,2]="1"
}

## get words
mhyph=as.data.frame(mhyph)
mhyph$word=gsub("¥", "", mhyph$mhyph)

## get rid of acronyms
mhyph=subset(mhyph, grepl("^[A-Z]+$", mhyph)==FALSE)

mhyph=as.matrix(mhyph)

########### random haiku generation ##############

line1=0
line2=0
line3=0

haiku1=""
haiku2=""
haiku3=""

while (line1!=5) {
x=sample(1:186468, size=1)
xsyll=as.numeric(mhyph[x, 2])
xword=mhyph[x, 3]
if(line1+xsyll<=5){
line1=line1+xsyll
haiku1=paste(haiku1, xword, sep=" ")
}
}

while (line2!=7) {
x=sample(1:186468, size=1)
xsyll=as.numeric(mhyph[x, 2])
xword=mhyph[x, 3]
if(line2+xsyll<=7){
line2=line2+xsyll
haiku2=paste(haiku2, xword, sep=" ")
}
}

while (line3!=5) {
x=sample(1:186468, size=1)
xsyll=as.numeric(mhyph[x, 2])
xword=mhyph[x, 3]
if(line3+xsyll<=5){
line3=line3+xsyll
haiku3=paste(haiku3, xword, sep=" ")
}
}

print(haiku1)
print(haiku2)
print(haiku3)

2 comments:

  1. I think you're next step is going to have to be adding parts of speech tagging so you can have it chose words that make more sense. I don't know about R but there are a few libraries for that sort of thing in Java.

    You could dig around Sourceforge.org, its a website for open source projects, you may find something useful you can use.

    I don't want to damper your enthusiasm, but this has been done before a few times. http://www.randomhaiku.com/ comes to mind. I think the majority of them were written by Comp Sci people who once they got the poems that had proper grammar and subject verb agreement probably left it at that. I don't think they were as interested in exploring the questions about meaning as you are. In that respect I think you have a shot at contributing something new. Especially if you can get your program to spit out something a little deeper than say:

    following my book,
    Japan pondered my elbows
    behind my luggage
    (from the link I posted above)

    Automatic text generation is an interesting problem, and Comp Sci people seem to love doing a half-baked job at it. Some guys at MIT wrote a Comp Sci paper generator and even got a randomly generated paper accepted to SCI 2005, which isn't really a legit conference but still. You can check it out at: http://pdos.csail.mit.edu/scigen/

    Anyway good luck with it.

    ReplyDelete
  2. Thanks for the comments, the resources, and the well-wishes!

    You're right, it's definitely not a unique idea, but I'm hoping that I can poke at it a little farther (or at least a little differently) than my predecessors. (If it weren't easy enough for someone to have done it already, I wouldn't know enough about coding yet to be able to do it myself!)

    I have a lot of tricks up my sleeve to take this farther than sticking random words together. Some I have done already, but in this blog I want to spend a little time focusing on each new step of the program before moving on to the next. I plan to continue to give the program freedom with syntax, though, for several reasons which I'll probably devote an entry to at some point. Even so, I will definitely poke around Sourceforge to see if any additional resources exist. There are a few particular things I have in mind to look for...

    (I actually really like your random haiku. My mind reads meaning into it, giving me a clear image of walking through a Tokyo airport while painfully aware of one's foreignness...)

    ReplyDelete