這篇文章主要介紹了SQL如何快速實(shí)現(xiàn)UCF,具有一定借鑒價(jià)值,感興趣的朋友可以參考下,希望大家閱讀完這篇文章之后大有收獲,下面讓小編帶著大家一起了解一下。
網(wǎng)站建設(shè)公司,為您提供網(wǎng)站建設(shè),網(wǎng)站制作,網(wǎng)頁(yè)設(shè)計(jì)及定制網(wǎng)站建設(shè)服務(wù),專注于成都定制網(wǎng)頁(yè)設(shè)計(jì),高端網(wǎng)頁(yè)制作,對(duì)白烏魚等多個(gè)行業(yè)擁有豐富的網(wǎng)站建設(shè)經(jīng)驗(yàn)的網(wǎng)站建設(shè)公司。專業(yè)網(wǎng)站設(shè)計(jì),網(wǎng)站優(yōu)化推廣哪家好,專業(yè)網(wǎng)站推廣優(yōu)化,H5建站,響應(yīng)式網(wǎng)站。
select uid1,uid2,sim from ( select uid1 ,uid2 ,cnt12 / sqrt(cnt1*cnt2) sim ,row_number() over(partition by uid1 order by cnt12 / sqrt(cnt1*cnt2) desc) sim_rn from ( select a.uid uid1 ,b.uid uid2 ,count(a.iid) cnt12 from tb_behavior a join tb_behavior b on a.iid = b.iid where a.uid <> b.uid group by a.uid,b.uid ) a12 join (select uid,count(iid) cnt1 from tb_behavior group by uid) a1 on a12.uid1 = a1.uid join (select uid,count(iid) cnt2 from tb_behavior group by uid) a2 on a12.uid1 = a2.uid ) tb_neighbour where sim > 0.1 and sim_rn <= 30
讀者實(shí)現(xiàn)的話只需要把上面的tb_behavior表替換成自己業(yè)務(wù)的用戶行為即可;iid,uid分別對(duì)應(yīng)物品id和用戶id;
根據(jù)共現(xiàn)相似度,即共同喜好的物品個(gè)數(shù)比上各自喜好物品總數(shù)乘積取平方;最后截?cái)嘤脩糇钕嗨频那?0個(gè)鄰居作為推薦的依據(jù)。
上面構(gòu)造了鄰居表,下面就是根據(jù)鄰居的喜好為用戶推薦了,具體sql如下:
select uid1,iid from ( select uid1 ,iid ,max(sim) score ,row_number() over(partition by uid1 order by max(sim) desc) user_rn from tb_neighbour a12 join (select uid,iid from tb_behavior) a2 on a12.uid2 = a2.uid join (select uid,collect_set(iid) iids1 from tb_behavior group by uid) a1 on a12.uid1 = a1.uid where not array_contaions(iids1,a2.iid) group by uid1,iid ) tb_rec where user_rn <= 500
這里說(shuō)明下包括上面的top30鄰居和用戶top500的最大推薦列表都是工程優(yōu)化,截?cái)喙?jié)約些存儲(chǔ);具體讀者可以根據(jù)自己業(yè)務(wù)需要進(jìn)行設(shè)置;
然后大概說(shuō)下各個(gè)表的含義:a1表是用戶已消費(fèi)過(guò)的物品,a2表是用戶每個(gè)鄰居喜好的物品;那么也就是說(shuō)從鄰居喜好的物品中過(guò)濾掉已經(jīng)消費(fèi)的
物品整體根據(jù)共現(xiàn)相似度進(jìn)行排序。
但思路很簡(jiǎn)單、實(shí)際作者開發(fā)中總會(huì)遇到各種各樣的問(wèn)題,下面就撿幾個(gè)主要的和大家一起討論下:
1.join引起的數(shù)據(jù)傾斜問(wèn)題:tb_neighbour表很大,往往熱點(diǎn)物品會(huì)占據(jù)80%的曝光和消費(fèi)記錄,如何解決?
2.增量更新問(wèn)題:上面的框架,tb_behavior表每次都是全量計(jì)算,是否能改造成增量更新鄰居表和推薦結(jié)果,并減少計(jì)算時(shí)間呢?
先思考問(wèn)題1,既然我們目的是求相似鄰居,物品join只是為了關(guān)聯(lián)上一組用戶對(duì),那自然的想法是可以根據(jù)feed做近似采樣、相似度精度也幾乎無(wú)損失。
下面我試著實(shí)現(xiàn)下這種思路:
with tb_behavior_sample as ( select uid,iid from ( select uid ,iid ,row_number() over(partition by iid order by rand()) feed_rn from tb_behavior ) bh where feed_rn <= 50000 ) select uid1,uid2,sim from ( select uid1 ,uid2 ,cnt12 / sqrt(cnt1*cnt2) sim ,row_number() over(partition by uid1 order by cnt12 / sqrt(cnt1*cnt2) desc) sim_rn from ( select a.uid uid1 ,b.uid uid2 ,count(a.iid) cnt12 from tb_behavior_sample a join tb_behavior_sample b on a.iid = b.iid where a.uid <> b.uid group by a.uid,b.uid ) a12 join (select uid,count(iid) cnt1 from tb_behavior group by uid) a1 on a12.uid1 = a1.uid join (select uid,count(iid) cnt2 from tb_behavior group by uid) a2 on a12.uid1 = a2.uid ) tb_neighbour where sim > 0.1 and sim_rn <= 30
這里用了hive的with as語(yǔ)法,讀者可自行查閱,篇幅有限,就不展開了;feed_rn就是隨機(jī)采樣了50000條,實(shí)際操作時(shí)讀者可以先統(tǒng)計(jì)下item的分布、大概找到一個(gè)閾值;
比如取top10的item的出現(xiàn)次數(shù)作為閾值;那計(jì)算相似度時(shí)分子最多減小10,分母不變。這對(duì)大多數(shù)情況精度應(yīng)該足夠了,而且因?yàn)楸苊饬藬?shù)據(jù)傾斜,大大降低了計(jì)算時(shí)間。
問(wèn)題2是一個(gè)工程問(wèn)題,lambda架構(gòu)能使初始結(jié)果效果不錯(cuò),可直接上線灰度了;在此基礎(chǔ)上再加小時(shí)或者天增量;kappa架構(gòu)相對(duì)就比較繁瑣、需要一開始就設(shè)計(jì)增量流程。
精度方面也需要一定的累積;不過(guò)如何選擇,讀者可以根據(jù)自己的數(shù)據(jù)量和熟悉程度自行選擇;作者這里僅以kappa架構(gòu)說(shuō)明。
重新review上面sql,我們發(fā)現(xiàn)我們僅需要記錄下cnt12,cnt1,cnt2,iids1這些計(jì)算關(guān)鍵即可,其中iids2是用戶鄰居喜好的物品數(shù)組;數(shù)值類型可累加更新、
數(shù)組類型合并起來(lái)比較麻煩,一種解決方案是注冊(cè)UDF;這里采取另一種這種的方案:把iids1合并成字符串,過(guò)濾的時(shí)候再分割為字符串?dāng)?shù)組。
with tb_behavior_sample_incr as ( select uid,iid from ( select uid ,iid ,row_number() over(partition by iid order by rand()) feed_rn from tb_behavior_incr ) bh where feed_rn <= 50000 ) insert overwrite table tb_neighbour select uid1,uid2,sim from ( select uid1 ,uid2 ,sum(cnt12) / sqrt(sum(cnt1)*sum(cnt2)) sim ,row_number() over(partition by uid1 order by sum(cnt12) / sqrt(sum(cnt1)*sum(cnt2)) desc) sim_rn from ( select uid1,uid2,cnt12,cnt1,cnt2 from tb_neighbour union all select a.uid uid1 ,b.uid uid2 ,count(a.iid) cnt12 ,cnt1 ,cnt2 from tb_behavior_sample_incr a join tb_behavior_sample_incr b on a.iid = b.iid where a.uid <> b.uid group by a.uid,b.uid ) a12 join (select uid,count(iid) cnt1 from tb_behavior_incr group by uid) a1 on a12.uid1 = a1.uid join (select uid,count(iid) cnt2 from tb_behavior_incr group by uid) a2 on a12.uid1 = a2.uid group by uid1,uid2 ) tb_neighbour where sim > 0.1 and sim_rn <= 30
其中tb_behavior_sample_incr,tb_behavior_incr是相應(yīng)tb_behavior_sample,tb_behavior的增量表;使用union all和group by聚合相同用戶對(duì)的結(jié)果
kappa架構(gòu)初次計(jì)算即是增量,不斷累積每次增量的結(jié)果更新tb_neighbour;相當(dāng)于lambda初始全量計(jì)算的一種回放,直至追到最新的時(shí)間分區(qū)。
insert overwrite table tb_user_consume select uid,substring_index(concat_ws(",",collect_list(iids1)),",",10000) iids1 from ( select uid,concat_ws(",",collect_set(cast(iid as string))) iids1 from tb_behavior_incr union all select uid,iids1 from tb_user_consume ) a group by uid select uid1,iid from ( select uid1 ,iid ,max(sim) score ,row_number() over(partition by uid1 order by max(sim) desc) user_rn from tb_neighbour a12 join (select uid,cast(iid as string) iid from tb_behavior_incr) a2 on a12.uid2 = a2.uid join (select uid,split(iids1,",") iids1 from tb_user_consume) a1 on a12.uid1 = a1.uid where not array_contaions(iids1,a2.iid) group by uid1,iid ) tb_rec where user_rn <= 500
使用tb_user_consume緩存用戶最近消費(fèi)的前10000條記錄,將用戶鄰居最新喜好物品推薦給用戶。
感謝你能夠認(rèn)真閱讀完這篇文章,希望小編分享的“SQL如何快速實(shí)現(xiàn)UCF”這篇文章對(duì)大家有幫助,同時(shí)也希望大家多多支持創(chuàng)新互聯(lián),關(guān)注創(chuàng)新互聯(lián)行業(yè)資訊頻道,更多相關(guān)知識(shí)等著你來(lái)學(xué)習(xí)!