學(xué)習(xí)任何spark知識(shí)點(diǎn)之前請先正確理解spark,可以參考:正確理解spark
讓客戶滿意是我們工作的目標(biāo),不斷超越客戶的期望值來自于我們對這個(gè)行業(yè)的熱愛。我們立志把好的技術(shù)通過有效、簡單的方式提供給客戶,將通過不懈努力成為客戶在信息化領(lǐng)域值得信任、有價(jià)值的長期合作伙伴,公司提供的服務(wù)項(xiàng)目有:主機(jī)域名、雅安服務(wù)器托管、營銷軟件、網(wǎng)站建設(shè)、志丹網(wǎng)站維護(hù)、網(wǎng)站推廣。本文詳細(xì)介紹了spark key-value類型的rdd java api
一、key-value類型的RDD的創(chuàng)建方式
1、sparkContext.parallelizePairs
JavaPairRDDjavaPairRDD = sc.parallelizePairs(Arrays.asList(new Tuple2("test", 3), new Tuple2("kkk", 3))); //結(jié)果:[(test,3), (kkk,3)] System.out.println("javaPairRDD = " + javaPairRDD.collect());
2、keyBy的方式
public class User implements Serializable { private String userId; private Integer amount; public User(String userId, Integer amount) { this.userId = userId; this.amount = amount; } @Override public String toString() { return "User{" + "userId='" + userId + '\'' + ", amount=" + amount + '}'; } } JavaRDDuserJavaRDD = sc.parallelize(Arrays.asList(new User("u1", 20))); JavaPairRDD userJavaPairRDD = userJavaRDD.keyBy(new Function () { @Override public String call(User user) throws Exception { return user.getUserId(); } }); //結(jié)果:[(u1,User{userId='u1', amount=20})] System.out.println("userJavaPairRDD = " + userJavaPairRDD.collect());
3、zip的方式
JavaRDDrdd = sc.parallelize(Arrays.asList(1, 1, 2, 3, 5, 8, 13)); //兩個(gè)rdd zip也是創(chuàng)建key-value類型RDD的一種方式 JavaPairRDD zipPairRDD = rdd.zip(rdd); //結(jié)果:[(1,1), (1,1), (2,2), (3,3), (5,5), (8,8), (13,13)] System.out.println("zipPairRDD = " + zipPairRDD.collect());
4、groupBy的方式
JavaRDDrdd = sc.parallelize(Arrays.asList(1, 1, 2, 3, 5, 8, 13)); Function isEven = new Function () { @Override public Boolean call(Integer x) throws Exception { return x % 2 == 0; } }; //將偶數(shù)和奇數(shù)分組,生成key-value類型的RDD JavaPairRDD > oddsAndEvens = rdd.groupBy(isEven); //結(jié)果:[(false,[1, 1, 3, 5, 13]), (true,[2, 8])] System.out.println("oddsAndEvens = " + oddsAndEvens.collect()); //結(jié)果:1 System.out.println("oddsAndEvens.partitions.size = " + oddsAndEvens.partitions().size()); oddsAndEvens = rdd.groupBy(isEven, 2); //結(jié)果:[(false,[1, 1, 3, 5, 13]), (true,[2, 8])] System.out.println("oddsAndEvens = " + oddsAndEvens.collect()); //結(jié)果:2 System.out.println("oddsAndEvens.partitions.size = " + oddsAndEvens.partitions().size());
二、combineByKey
JavaPairRDDjavaPairRDD = sc.parallelizePairs(Arrays.asList(new Tuple2("coffee", 1), new Tuple2("coffee", 2), new Tuple2("panda", 3), new Tuple2("coffee", 9)), 2); //當(dāng)在一個(gè)分區(qū)中遇到新的key的時(shí)候,對這個(gè)key對應(yīng)的value應(yīng)用這個(gè)函數(shù) Function > createCombiner = new Function >() { @Override public Tuple2 call(Integer value) throws Exception { return new Tuple2<>(value, 1); } }; //當(dāng)在一個(gè)分區(qū)中遇到已經(jīng)應(yīng)用過上面createCombiner函數(shù)的key的時(shí)候,對這個(gè)key對應(yīng)的value應(yīng)用這個(gè)函數(shù) Function2 , Integer, Tuple2 > mergeValue = new Function2 , Integer, Tuple2 >() { @Override public Tuple2 call(Tuple2 acc, Integer value) throws Exception { return new Tuple2<>(acc._1() + value, acc._2() + 1); } }; //當(dāng)需要對不同分區(qū)的數(shù)據(jù)進(jìn)行聚合的時(shí)候應(yīng)用這個(gè)函數(shù) Function2 , Tuple2 , Tuple2 > mergeCombiners = new Function2 , Tuple2 , Tuple2 >() { @Override public Tuple2 call(Tuple2 acc1, Tuple2 acc2) throws Exception { return new Tuple2<>(acc1._1() + acc2._1(), acc1._2() + acc2._2()); } }; JavaPairRDD > combineByKeyRDD = javaPairRDD.combineByKey(createCombiner, mergeValue, mergeCombiners); //結(jié)果:[(coffee,(12,3)), (panda,(3,1))] System.out.println("combineByKeyRDD = " + combineByKeyRDD.collect());
combineByKey的數(shù)據(jù)流如下:
對于combineByKey的原理講解詳細(xì)見: spark core RDD api原理詳解
三、aggregateByKey
JavaPairRDD> aggregateByKeyRDD = javaPairRDD.aggregateByKey(new Tuple2<>(0, 0), mergeValue, mergeCombiners); //結(jié)果:[(coffee,(12,3)), (panda,(3,1))] System.out.println("aggregateByKeyRDD = " + aggregateByKeyRDD.collect()); //aggregateByKey是由combineByKey實(shí)現(xiàn)的,上面的aggregateByKey就是等于下面的combineByKeyRDD Function > createCombinerAggregateByKey = new Function >() { @Override public Tuple2 call(Integer value) throws Exception { return mergeValue.call(new Tuple2<>(0, 0), value); } }; //結(jié)果是: [(coffee,(12,3)), (panda,(3,1))] System.out.println(javaPairRDD.combineByKey(createCombinerAggregateByKey, mergeValue, mergeCombiners).collect());
四、reduceByKey
JavaPairRDDreduceByKeyRDD = javaPairRDD.reduceByKey(new Function2 () { @Override public Integer call(Integer value1, Integer value2) throws Exception { return value1 + value2; } }); //結(jié)果:[(coffee,12), (panda,3)] System.out.println("reduceByKeyRDD = " + reduceByKeyRDD.collect()); //reduceByKey底層也是combineByKey實(shí)現(xiàn)的,上面的reduceByKey等于下面的combineByKey Function createCombinerReduce = new Function () { @Override public Integer call(Integer integer) throws Exception { return integer; } }; Function2 mergeValueReduce = new Function2 () { @Override public Integer call(Integer integer, Integer integer2) throws Exception { return integer + integer2; } }; //結(jié)果:[(coffee,12), (panda,3)] System.out.println(javaPairRDD.combineByKey(createCombinerReduce, mergeValueReduce, mergeValueReduce).collect());
五、foldByKey
JavaPairRDDfoldByKeyRDD = javaPairRDD.foldByKey(0, new Function2 () { @Override public Integer call(Integer integer, Integer integer2) throws Exception { return integer + integer2; } }); //結(jié)果:[(coffee,12), (panda,3)] System.out.println("foldByKeyRDD = " + foldByKeyRDD.collect()); //foldByKey底層也是combineByKey實(shí)現(xiàn)的,上面的foldByKey等于下面的combineByKey Function2 mergeValueFold = new Function2 () { @Override public Integer call(Integer integer, Integer integer2) throws Exception { return integer + integer2; } }; Function createCombinerFold = new Function () { @Override public Integer call(Integer integer) throws Exception { return mergeValueFold.call(0, integer); } }; //結(jié)果:[(coffee,12), (panda,3)] System.out.println(javaPairRDD.combineByKey(createCombinerFold, mergeValueFold, mergeValueFold).collect());
六、groupByKey
JavaPairRDD> groupByKeyRDD = javaPairRDD.groupByKey(); //結(jié)果:[(coffee,[1, 2, 9]), (panda,[3])] System.out.println("groupByKeyRDD = " + groupByKeyRDD.collect()); //groupByKey底層也是combineByKey實(shí)現(xiàn)的,上面的groupByKey等于下面的combineByKey Function > createCombinerGroup = new Function >() { @Override public List call(Integer integer) throws Exception { List list = new ArrayList<>(); list.add(integer); return list; } }; Function2 , Integer, List
> mergeValueGroup = new Function2 , Integer, List
>() { @Override public List call(List integers, Integer integer) throws Exception { integers.add(integer); return integers; } }; Function2 , List
, List > mergeCombinersGroup = new Function2 , List
, List >() { @Override public List call(List integers, List integers2) throws Exception { integers.addAll(integers2); return integers; } }; //結(jié)果:[(coffee,[1, 2, 9]), (panda,[3])] System.out.println(javaPairRDD.combineByKey(createCombinerGroup, mergeValueGroup, mergeCombinersGroup).collect());
對于api原理性的東西很難用文檔說明清楚,如果想更深入,更透徹的理解api的原理,可以參考: spark core RDD api原理詳解
另外有需要云服務(wù)器可以了解下創(chuàng)新互聯(lián)scvps.cn,海內(nèi)外云服務(wù)器15元起步,三天無理由+7*72小時(shí)售后在線,公司持有idc許可證,提供“云服務(wù)器、裸金屬服務(wù)器、高防服務(wù)器、香港服務(wù)器、美國服務(wù)器、虛擬主機(jī)、免備案服務(wù)器”等云主機(jī)租用服務(wù)以及企業(yè)上云的綜合解決方案,具有“安全穩(wěn)定、簡單易用、服務(wù)可用性高、性價(jià)比高”等特點(diǎn)與優(yōu)勢,專為企業(yè)上云打造定制,能夠滿足用戶豐富、多元化的應(yīng)用場景需求。