-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathupdate_database.py
More file actions
91 lines (57 loc) · 3.46 KB
/
update_database.py
File metadata and controls
91 lines (57 loc) · 3.46 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import time
import numpy as np
import pandas as pd
def update_database():
md = pd. read_csv('data/movies_metadata.csv')
# In[26]:
#Measuring time for performance improvements
start = time.time()
md['genres'] = md['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
end = time.time()
print("md Creation: " + str(end - start))
# ## Content Based Recommender
#
# The recommender we built in the previous section suffers some severe limitations. For one, it gives the same recommendation to everyone, regardless of the user's personal taste. If a person who loves romantic movies (and hates action) were to look at our Top 15 Chart, s/he wouldn't probably like most of the movies. If s/he were to go one step further and look at our charts by genre, s/he wouldn't still be getting the best recommendations.
#
# For instance, consider a person who loves *Dilwale Dulhania Le Jayenge*, *My Name is Khan* and *Kabhi Khushi Kabhi Gham*. One inference we can obtain is that the person loves the actor Shahrukh Khan and the director Karan Johar. Even if s/he were to access the romance chart, s/he wouldn't find these as the top recommendations.
#
# To personalise our recommendations more, I am going to build an engine that computes similarity between movies based on certain metrics and suggests movies that are most similar to a particular movie that a user liked. Since we will be using movie metadata (or content) to build this engine, this also known as **Content Based Filtering.**
#
# I will build two Content Based Recommenders based on:
# * Movie Overviews and Taglines
# * Movie Cast, Crew, Keywords and Genre
#
# Also, as mentioned in the introduction, I will be using a subset of all the movies available to us due to limiting computing power available to me.
# In[27]:
#Measuring time for performance improvements
start = time.time()
links_small = pd.read_csv('data/links_small.csv')
links_small = links_small[links_small['tmdbId'].notnull()]['tmdbId'].astype('int')
end = time.time()
print("links_small Creation: " + str(end - start))
# In[28]:
md = md.drop([19730, 29503, 35587])
# In[29]:
#Check EDA Notebook for how and why I got these indices.
md['id'] = md['id'].astype('int')
# In[30]:
#Measuring time for performance improvements
start = time.time()
smd = md[md['id'].isin(links_small)]
# We have **9099** movies avaiable in our small movies metadata dataset which is 5 times smaller than our original dataset of 45000 movies.
# ### Movie Description Based Recommender
#
# Let us first try to build a recommender using movie descriptions and taglines. We do not have a quantitative metric to judge our machine's performance so this will have to be done qualitatively.
# In[31]:
smd['tagline'] = smd['tagline'].fillna('')
smd['description'] = smd['overview'] + smd['tagline']
smd['description'] = smd['description'].fillna('')
end = time.time()
print("smd Creation and Modification: " + str(end - start))
# In[32]:
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(smd['description'].values.astype('U'))
print(type(tfidf_matrix))
scipy.sparse.save_npz('data/tfidf_matrix.npz', tfidf_matrix)
#Writing smd to file for future use
smd.to_csv("data/smd.txt")