dw推荐task02协同过滤

it2026-02-08  1

s i m u v = c o s ( u , v ) = u ⋅ v ∣ u ∣ ⋅ ∣ v ∣ sim_{uv} = cos(u,v) =\frac{u\cdot v}{|u|\cdot |v|} simuv=cos(u,v)=uvuv

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-0XujFdPm-1603351902804)(my.assets/image-20201022144149663.png)] s i m ( u , v ) = ∑ i ∈ I ( r u i − r ˉ u ) ( r v i − r ˉ v ) ∑ i ∈ I ( r u i − r ˉ u ) 2 ∑ i ∈ I ( r v i − r ˉ v ) 2 sim(u,v)=\frac{\sum_{i\in I}(r_{ui}-\bar r_u)(r_{vi}-\bar r_v)}{\sqrt{\sum_{i\in I }(r_{ui}-\bar r_u)^2}\sqrt{\sum_{i\in I }(r_{vi}-\bar r_v)^2}} sim(u,v)=iI(ruirˉu)2 iI(rvirˉv)2 iI(ruirˉu)(rvirˉv) [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-sQz0pbim-1603351902805)(my.assets/image-20201022144208210.png)]

UserCF

以上,第一行, Alice用户和用户1,用户2,用户3,用户4的相似度是0.85, 0.7, 0, -0.79。 所以如果n=2, 找到与Alice最相近的两个用户是用户1, 和Alice的相似度是0.85、 用户2, 和Alice相似度是0.7。

# 定义数据集, 也就是那个表格, 注意这里我们采用字典存放数据, 因为实际情况中数据是非常稀疏的, 很少有情况是现在这样 def loadData(): items={'A': {1: 5, 2: 3, 3: 4, 4: 3, 5: 1}, 'B': {1: 3, 2: 1, 3: 3, 4: 3, 5: 5}, 'C': {1: 4, 2: 2, 3: 4, 4: 1, 5: 5}, 'D': {1: 4, 2: 3, 3: 3, 4: 5, 5: 2}, 'E': {2: 3, 3: 5, 4: 4, 5: 1} } users={1: {'A': 5, 'B': 3, 'C': 4, 'D': 4}, 2: {'A': 3, 'B': 1, 'C': 2, 'D': 3, 'E': 3}, 3: {'A': 4, 'B': 3, 'C': 4, 'D': 3, 'E': 5}, 4: {'A': 3, 'B': 3, 'C': 1, 'D': 5, 'E': 4}, 5: {'A': 1, 'B': 5, 'C': 5, 'D': 2, 'E': 1} } return items,users items, users = loadData() item_df = pd.DataFrame(items).T user_df = pd.DataFrame(users).T

"""计算用户相似性矩阵""" similarity_matrix = pd.DataFrame(np.zeros((len(users), len(users))), index=[1, 2, 3, 4, 5], columns=[1, 2, 3, 4, 5]) # 遍历每条用户-物品评分数据 for userID in users: for otheruserId in users: vec_user = [] vec_otheruser = [] if userID != otheruserId: for itemId in items: # 遍历物品-用户评分数据 itemRatings = items[itemId] # 这也是个字典 每条数据为所有用户对当前物品的评分 if userID in itemRatings and otheruserId in itemRatings: # 说明两个用户都对该物品评过分 vec_user.append(itemRatings[userID]) vec_otheruser.append(itemRatings[otheruserId]) # 这里可以获得相似性矩阵(共现矩阵) similarity_matrix[userID][otheruserId] = np.corrcoef(np.array(vec_user), np.array(vec_otheruser))[0][1] #similarity_matrix[userID][otheruserId] = cosine_similarity(np.array(vec_user), np.array(vec_otheruser))[0][1] """计算前n个相似的用户""" n = 2 similarity_users = similarity_matrix[1].sort_values(ascending=False)[:n].index.tolist() # [2, 3] 也就是用户1和用户2 """计算最终得分""" base_score = np.mean(np.array([value for value in users[1].values()])) weighted_scores = 0. corr_values_sum = 0. for user in similarity_users: # [2, 3] corr_value = similarity_matrix[1][user] # 两个用户之间的相似性 mean_user_score = np.mean(np.array([value for value in users[user].values()])) # 每个用户的打分平均值 weighted_scores += corr_value * (users[user]['E']-mean_user_score) # 加权分数 corr_values_sum += corr_value final_scores = base_score + weighted_scores / corr_values_sum print('用户Alice对物品5的打分: ', final_scores) user_df.loc[1]['E'] = final_scores user_df

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-jyOKKFzi-1603351902806)(my.assets/image-20201022145539817.png)]

ItemCF

$$ P_{Alice, 物品5}=\bar{R}_{物品5}+\frac{\sum_{k=1}^{2}\left(S_{物品5,物品 k}\left(R_{Alice, 物品k}-\bar{R}_{物品k}\right)\right)}{\sum_{k=1}^{2} S_{物品k, 物品5}}=\frac{13}{4}+\frac{0.97*(5-3.2)+0.58*(4-3.4)}{0.97+0.58}=4.6 $$ """计算物品的相似矩阵""" similarity_matrix = pd.DataFrame(np.ones((len(items), len(items))), index=['A', 'B', 'C', 'D', 'E'], columns=['A', 'B', 'C', 'D', 'E']) # 遍历每条物品-用户评分数据 for itemId in items: for otheritemId in items: vec_item = [] # 定义列表, 保存当前两个物品的向量值 vec_otheritem = [] #userRagingPairCount = 0 # 两件物品均评过分的用户数 if itemId != otheritemId: # 物品不同 for userId in users: # 遍历用户-物品评分数据 userRatings = users[userId] # 每条数据为该用户对所有物品的评分, 这也是个字典 if itemId in userRatings and otheritemId in userRatings: # 用户对这两个物品都评过分 #userRagingPairCount += 1 vec_item.append(userRatings[itemId]) vec_otheritem.append(userRatings[otheritemId]) # 这里可以获得相似性矩阵(共现矩阵) similarity_matrix[itemId][otheritemId] = np.corrcoef(np.array(vec_item), np.array(vec_otheritem))[0][1] #similarity_matrix[itemId][otheritemId] = cosine_similarity(np.array(vec_item), np.array(vec_otheritem))[0][1]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-FIeJIAIC-1603351902812)(my.assets/image-20201022150625118.png)]

"""得到与物品5相似的前n个物品""" n = 2 similarity_items = similarity_matrix['E'].sort_values(ascending=False)[:n].index.tolist() # ['A', 'D'] """计算最终得分""" base_score = np.mean(np.array([value for value in items['E'].values()])) weighted_scores = 0. corr_values_sum = 0. for item in similarity_items: # ['A', 'D'] corr_value = similarity_matrix['E'][item] # 两个物品之间的相似性 mean_item_score = np.mean(np.array([value for value in items[item].values()])) # 每个物品的打分平均值 weighted_scores += corr_value * (users[1][item]-mean_item_score) # 加权分数 corr_values_sum += corr_value final_scores = base_score + weighted_scores / corr_values_sum print('用户Alice对物品5的打分: ', final_scores) user_df.loc[1]['E'] = final_scores user_df

_scores = base_score + weighted_scores / corr_values_sum print('用户Alice对物品5的打分: ', final_scores) user_df.loc[1][‘E’] = final_scores user_df

最新回复(0)