>>> recommendations.getRecommendations(recommendations.critics, 'Toby')
[(3.3477895267131013, 'The Night Listener'), (2.8325499182641614, 'Lady in the Water'), (2.5309807037655649, 'Just My Luck')]
Here I got something different than the book. The movies are correct, but the similarity numbers are different
>>> recommendations.getRecommendations(recommendations.critics, 'Toby', similarity=recommendations.sim_distance)
[(3.3235294117647061, 'The Night Listener'), (2.875, 'Lady in the Water'), (2.2272727272727271, 'Just My Luck')]
>>> movies=recommendations.transformPrefs(recommendations.critics)
>>> recommendations.topMatches(movies, 'Superman Returns')
[(0.65795169495976946, 'You, Me, and Dupree'), (0.48795003647426888, 'Lady in the Water'), (0.11180339887498941, 'Snakes on a Plane'), (-0.17984719479905439, 'The Night Listener'), (-0.46625240412015717, 'Just My Luck')]
The book also came up with (4.0,'Michael Phillips'). I'm not sure if I made a mistake or if the book is incorrect.
>>> recommendations.getRecommendations(movies,'Just My Luck')
[(3.0, 'Jack Matthews')]
I had some small issues with downloading the pydelicious API. I had to also download the feedparser API. It was interesting, even though this is a small aspect of the assignment I really learned something about libraries. I got all the APIs through Google Code. Which is an extremely cool site.
For this next part it is important to note that the information is real time. The answers I will come up with, will not match those of the books.
"user" is a random user that was pulled from the newly created dataset:
>>> fillItems(delusers)
>>> import random
>>> user=delusers.keys()[random.randint(0,len(delusers)-1)]
>>> user
u'mgraz'
When we try to get recommendations from the dataset:
>>> recommendations.getRecommendations(delusers,user)[0:10]
[(0.11886051080550095, u'http://www.ai-blog.net/archives/000158.html'), (0.091519318925998655, u'http://www.youtube.com/watch?v=zmRTGRbrATs&fms=18'), (0.091519318925998655, u'http://www.youtube.com/watch?v=ETN1px7i4KY'), (0.091519318925998655, u'http://www.windowsmoviemakers.net/PapaJohn/61/WinDV.aspx'), (0.091519318925998655, u'http://www.welcometohr.com/'), (0.091519318925998655, u'http://www.wearitwithpride.com/'), (0.091519318925998655, u'http://www.typographyserved.com/Gallery/Senior-Thesis-V--confess-online-confessions/109511'), (0.091519318925998655, u'http://www.theonion.com/content/node/28784?Revisit'), (0.091519318925998655, u'http://www.tgdaily.com/content/view/40785/140/'), (0.091519318925998655, u'http://www.spotify.com/en/')]
Here I got some extremely different values than the book. I checked all my code and it seems to be right. I'm thinking maybe it's not supposed to match the book exactly?
>>> itemsim=recommendations.calculateSimilarItems(recommendations.critics)
>>> itemsim
{'Lady in the Water': [(6.0, 'Superman Returns'), (4.0, 'Snakes on a Plane'), (2.0, 'The Night Listener'), (1.5, 'Just My Luck'), (0.0, 'You, Me, and Dupree')], 'Snakes on a Plane': [(8.0, 'You, Me, and Dupree'), (4.0, 'Lady in the Water'), (2.0, 'Superman Returns'), (1.0, 'The Night Listener'), (0, 'Just My Luck')], 'You, Me, and Dupree': [(9.5, 'Superman Returns'), (8.0, 'Snakes on a Plane'), (2.5, 'The Night Listener'), (0.0, 'Lady in the Water'), (0, 'Just My Luck')], 'Just My Luck': [(6.5, 'Superman Returns'), (5.0, 'The Night Listener'), (5.0, 'Snakes on a Plane'), (1.0, 'You, Me, and Dupree'), (0, 'Lady in the Water')], 'Superman Returns': [(9.5, 'You, Me, and Dupree'), (6.0, 'Lady in the Water'), (3.5, 'The Night Listener'), (2.0, 'Snakes on a Plane'), (0, 'Just My Luck')], 'The Night Listener': [(3.5, 'Superman Returns'), (2.5, 'You, Me, and Dupree'), (2.0, 'Lady in the Water'), (1.0, 'Snakes on a Plane'), (0, 'Just My Luck')]}
Here I had some difficulties
recommendations.getRecommendedItems(recommendations.critics,itemsim,'Toby')
* I did this section in two sittings. When I finished it during the second sitting, I kept getting a divide by zero error. I could not find out why.
This part took a couple of days because of some syntax errors on my part.
>>> prefs=recommendations.loadMovieLens()
>>> prefs['87']
prefs['87'] return a whole lot of movies, dates, and ratings. It was very satisfying once I fixed my errors.
>>> recommendations.getRecommendations(prefs, '87')[0:30]
This command also return a whole lot of movies, dates, and ratings relying on the user-based recommendations.
>>> itemsim=recommendations.calculateSimilarItems(prefs,n=50)
This worked fine, but when I tried to run getRecommendedItems using prefs, and itemsim I got:
>>> recommendations.getRecommendedItems(prefs,itemsim,'87')[0:30]
Traceback (most recent call last):
File "
File "C:\Python26\lib\recommendations.py", line 152, in getRecommendedItems
rankings=[(score/totalSim[item],item) for item,score in scores.items()]
ZeroDivisionError: float division
Exercise 1 in PCI:
I found the formula using Google, and I asked for help on the actual code.
def tanimotoSim(prefs,p1,p2):
si={}
for item in prefs[p1]:
if item in prefs[p2]:
si[item]=1
if len(si)==0:
return 0
distance = sum([prefs[p1][item]*prefs[p2][item] / abs((pow(prefs[p1][item],2))+
abs(pow(prefs[p2][item],2))-prefs[p1][item]*prefs[p2][item])
for item in prefs[p1] if item in prefs[p2]])
return 1/(1+distance)
Part 2 WEKA:
Weka was very easy to download. I went through section 10.1 of the book with the weather dataset. So far, Weka seems pretty straightforward.
I cut and pasted the dataset from the website into notepad and saved it as a .arff file. I ran J48 on it in Weka. I'm assuming that everything that was produced is correct because it gave me no problems (Dataset provided by Dr. Zacharski). Here is a snippet of the output:
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 235 77.5578 %
Incorrectly Classified Instances 68 22.4422 %
Kappa statistic 0.5443
Mean absolute error 0.1044
Root mean squared error 0.2725
Relative absolute error 52.0476 %
Root relative squared error 86.5075 %
Total Number of Instances 303
According to the output, J48 correctly classified 77.557% of the intances. Incorrectly classifying only 22.4422%. In my unexperienced opinion this is a fairly accurate method. Especially with a Root mean squared error of just 0.2725.