Tuesday, March 24, 2009

Portfolio 6

Portfolio 6 involves running hierarchical clustering on movie data from imdb.com. The python code I used I got from the PCI chapter 3 code. I took the .txt file containing all the movie data and pasted it into Excel and made some changes to the data. After that, I pasted it back into notepad and saved it as movies.txt.

I chose to cluster the movies in the .txt file by the rating they received. I may have made some kind of mistake because the dendogram was alphabetical. Even if this is the case I understand the concept behind hierarchical clustering. Which is to group the data into hierarchies and merger the two most similar. Repeating this until all data is clustered. This is how I got my dendogram.

I first ran the command:

rows,columns,data=clusters.readfile('C:\Python26\Lib\movies.txt')

Followed by:

clust=clusters.hcluster(data)

This command took quite a while to run. I started the operation and had to wait for around 30 minutes.

Hierachical clustering takes the data and bulids a hierarchy of groups. It then continuously merges the most similar groups. It repeats this until there is only one resulting group.

I then printed out the resulting dendogram:

clusters.printclust(clust,labels=rows)

This is a small section of the dendogram:



This is clearly incorrect.

I chose to try this again using different fields from the .txt file. I figured out that some fields just work better than others. I performed the same python commands as before, but this time on a file called movies4.txt. This file consisted of the movies, ratings they received, and then r1 (which I'm not sure what that means). This time I got a more convincing looking dendogram.

Here is a piece of that dendogram:



No comments:

Post a Comment