Tuesday, April 7, 2009

Portfolio 7

The first step in going through chapter 4 in PCI was to create a small set of pages that will need to be indexed. The author has provided such as list at http://kiwitobes.com/wiki. This was accomplished with the following code:

>>> import urllib2
>>> c=urllib2.urlopen('http://kiwitobes.com/wiki/Programming_language.html')
>>> contents=c.read()
>>> print contents[0:50]


The actual crawler code we are going to use, uses the Beautiful Soup API. BeautifulSoup.py was very easily downloaded from the Beautiful website. The BeautifulSoup.py was put in my working directory and:

>>> from BeautifulSoup import BeautifulSoup
>>> from urllib import urlopen
>>> soup=BeautifulSoup(urlopen('http://google.com'))
>>> soup.head.title
Google
>>> links=soup('a')
>>> len(links)
16
>>> links[0]
Images
>>> links[0].contents[0]
u'Images'

This of course just tests to make sure the BeautifulSoup.py works correctly.

Back to the crawling...I tried to find a website that would work for the crawler. I could not find a single .html website that would work. Here is an example of what I tried:

>>> pagelist=['http://rosemary.umw.edu/~stephen/cpsc330/milestone3.htm']
>>> crawler=searchengine.crawler('')
>>> crawler.crawl(pagelist)
Indexing http://rosemary.umw.edu/~stephen/cpsc330/milestone3.htm
Could not parse page http://rosemary.umw.edu/~stephen/cpsc330/milestone3.htm

I'm not sure what it is that I am doing wrong. I tried to find a simple html website and it still would not crawl.

The important thing to take away here is that the crawler is supposed to go through the website and index what it finds. Once indexed and stored the information can be queried and we can discover interesting things about that website. You can have content-based rankings where based on different criteria such as: Word Frequency, Document location, and Word distance are used to score the websites.

Once these scores are obtained it is common to normalize the scores so that they are easy to read and understand.

No comments:

Post a Comment