SparkSQLJoin原理分析-創(chuàng)新互聯(lián)

Spark SQL Join原理分析

1. Join問題綜述：

Join有inner,leftouter,rightouter,fullouter,leftsemi,leftanti六種類型，對單獨(dú)版本的Join操作，可以將問題表述為：

成都創(chuàng)新互聯(lián)公司堅(jiān)持“要么做到，要么別承諾”的工作理念，服務(wù)領(lǐng)域包括：網(wǎng)站設(shè)計、網(wǎng)站制作、企業(yè)官網(wǎng)、英文網(wǎng)站、手機(jī)端網(wǎng)站、網(wǎng)站推廣等服務(wù)，滿足客戶于互聯(lián)網(wǎng)時代的重慶網(wǎng)站設(shè)計、移動媒體設(shè)計的需求，幫助企業(yè)找到有效的互聯(lián)網(wǎng)解決方案。努力成為您成熟可靠的網(wǎng)絡(luò)建設(shè)合作伙伴！

IterA，IterB為兩個Iterator，根據(jù)規(guī)則A將兩個Iterator中相應(yīng)的Row進(jìn)行合并，然后按照規(guī)則B對合并后Row進(jìn)行過濾。
比如Inner_join，它的合并規(guī)則A為：對IterA中每一條記錄，生成一個key，并利用該key從IterB的Map集合中獲取到相應(yīng)記錄，并將它們進(jìn)行合并；而對于規(guī)則B可以為任意過濾條件，比如IterA和IterB任何兩個字段進(jìn)行比較操作。

對于IterA和IterB，當(dāng)我們利用iterA中key去IterB中進(jìn)行一一匹配時，我們稱IterA為streamedIter，IterB為BuildIter或者hashedIter。即我們流式遍歷streamedIter中每一條記錄，去hashedIter中去查找相應(yīng)匹配的記錄。

而這個查找過程中，即為Build過程，每一次Build操作的結(jié)果即為一條JoinRow（A,B），其中JoinRow(A)來自streamedIter，JoinRow(B)來自BuildIter，此時這個過程為BuildRight，而如果JoinRow(B)來自streamedIter，JoinRow(A)來自BuildIter，即為BuildLeft，

有點(diǎn)拗口！那么為什么要去區(qū)分BuildLeft和BuildRight呢？對于leftouter，rightouter，leftsemi,leftanti，它們的Build類型是確定，即left*為BuildRight，right*為BuildLeft類型，但是對于inner操作，BuildLeft和BuildRight兩種都可以，而且選擇不同，可能有很大性能區(qū)別：

BuildIter也稱為hashedIter，即需要將BuildIter構(gòu)建為一個內(nèi)存Hash，從而加速Build的匹配過程；此時如果BuildIter和streamedIter大小相差較大，顯然利用小的來建立Hash，內(nèi)存占用要小很多！

總結(jié)一下：Join即由下面幾部分組成：

trait Join {
  val joinType: JoinType //Join類型
  val streamedPlan: SparkPlan //用于生成streamedIter
  val buildPlan: SparkPlan //用于生成hashedIter

  val buildSide: BuildSide //BuildLeft或BuildRight
  val buildKeys: Seq[Expression] //用于從streamedIter中生成buildKey的表達(dá)式
  val streamedKeys: Seq[Expression] //用于從hashedIter中生成streamedKey的表達(dá)式

  val condition: Option[Expression]//對joinRow進(jìn)行過濾
}

注：對于fullouter，IterA和IterB同時為streamedIter和hashedIter，即先IterA＝streamedIter，IterB＝hashedIter進(jìn)行l(wèi)eftouter，然后再用先IterB＝streamedIter，IterA＝hashedIter進(jìn)行l(wèi)eftouter，再把兩次結(jié)果進(jìn)行合并。

1.1 幾種Join的實(shí)現(xiàn)

1.1.1 InnerJoin

利用streamIter中每個srow，從hashedIter中查找匹配項(xiàng)；

如果匹配成功，即構(gòu)建多個JoinRow，否則返回empty

streamIter.flatMap{ srow =>
    val joinRow = new JoinedRow
    joinRow.withLeft(srow)
    val matches = hashedIter.get(buildKeys(srow))
    if (matches != null) {
        matches.map(joinRow.withRight(_)).filter(condition)
    } else {
        Seq.empty
    }
}

1.1.2 LeftOutJoin

leftIter即為streamIter，而RightIter即為hashedIter，不可以改變
利用streamIter中每個srow，從hashedIter中查找匹配項(xiàng)；

如果匹配成功，即構(gòu)建多個JoinRow，否則返回JoinRow的Build部分為Null

val nullRow = new NullRow()
streamIter.flatMap{ srow =>
    val joinRow = new JoinedRow
    joinRow.withLeft(srow)
    val matches = hashedIter.get(buildKeys(srow))
    if (matches != null) {
        matches.map(joinRow.withRight(_)).filter(condition)
    } else {
        Seq(joinRow.withRight(nullRow))
    }
}

1.1.3 RightOutJoin

RightIter即為streamIter，而LeftIter即為hashedIter，不可以改變
利用streamIter中每個srow，從hashedIter中查找匹配項(xiàng)；

如果匹配成功，即構(gòu)建多個JoinRow，否則返回JoinRow的Build部分為Null

val nullRow = new NullRow()
streamIter.flatMap{ srow =>
    val joinRow = new JoinedRow
    joinRow.withRight(srow)//注意與LeftOutJoin的區(qū)別
    val matches = hashedIter.get(buildKeys(srow))
    if (matches != null) {
        matches.map(joinRow.withLeft(_)).filter(condition)
    } else {
        Seq(joinRow.withLeft(nullRow))
    }
}

1.1.4 LeftSemi

leftIter即為streamIter，而RightIter即為hashedIter，不可以改變
利用streamIter中每個srow，從hashedIter中查找匹配項(xiàng)；
如果匹配成功，即返回srow，否則返回empty

它不是返回JoinRow，而是返回srow

streamIter.filter{ srow =>
    val matches = hashedIter.get(buildKeys(srow))
    if(matches == null) {
        false //沒有找到匹配項(xiàng)
    } else{
        if(condition.isEmpty == false) { //需要對`假想`后joinrow進(jìn)行判斷
                val joinRow = new JoinedRow
                joinRow.withLeft(srow)
                ! matches.map(joinRow.withLeft(_)).filter(condition).isEmpty
        } else {
            true
        }
    }
}

LeftSemi從邏輯上來說，它即為In判斷。

1.1.5 LeftAnti

leftIter即為streamIter，而RightIter即為hashedIter，不可以改變
利用streamIter中每個srow，從hashedIter中查找匹配項(xiàng)；
它匹配邏輯為LeftSemi基本相反，即相當(dāng)于No In判斷。
如果匹配不成功，即返回srow，否則返回empty

它不是返回JoinRow，而是返回srow

streamIter.filter{ srow =>
    val matches = hashedIter.get(buildKeys(srow))
    if(matches == null) {
        true //沒有找到匹配項(xiàng)
    } else{
        if(condition.isEmpty == false) { //需要對`假想`后joinrow進(jìn)行判斷
                val joinRow = new JoinedRow
                joinRow.withLeft(srow)
                matches.map(joinRow.withLeft(_)).filter(condition).isEmpty
        } else {
            false
        }
    }
}

1.2 HashJoin與SortJoin

上面描述的Join是需要將BuildIter在內(nèi)存中構(gòu)建為hashedIter，從而加速匹配過程，因此我們也將這個Join稱為HashJoin。但是建立一個Hash表需要占用大量的內(nèi)存。
那么問題來：如果我們的Iter太大，無法建立Hash表怎么吧？在分布式Join計算下，Join過程中發(fā)生在Shuffle階段，如果一個數(shù)據(jù)集的Key存在數(shù)據(jù)偏移，很容易出現(xiàn)一個BuildIter超過內(nèi)存大小，無法完成Hash表的建立，進(jìn)而導(dǎo)致HashJoin失敗，那么怎么辦？

在HashJoin過程中，針對BuildIter建立hashedIter是為了加速匹配過程中。匹配查找除了建立Hash表這個方法以外，將streamedIter和BuildIter進(jìn)行排序，也是一個加速匹配過程，即我們這里說的sortJoin。

排序不也是需要內(nèi)存嗎？是的，首先排序占用內(nèi)存比建立一個hash表要小很多，其次排序如果內(nèi)存不夠，可以將一部分?jǐn)?shù)據(jù)Spill到磁盤，而Hash為全內(nèi)存，如果內(nèi)存不夠，將會導(dǎo)致整個Shuffle失敗。

下面以InnerJoin的SortJoin實(shí)現(xiàn)為例子，講述它與HashJoin的區(qū)別：

streamIter和BuildIter都需要為有序。

利用streamIter中每個srow，從BuildIter中順序查找，由于兩邊都是有序的，所以查找代價很小。

val buildIndex = 0
streamIter.flatMap{ srow =>
    val joinRow = new JoinedRow
    joinRow.withLeft(srow)
    //順序查找
    val matches = BuildIter.search(buildKeys(srow), buildIndex)
    if (matches != null) {
        matches.map(joinRow.withRight(_)).filter(condition)
        buildIndex += matches.length
    } else {
        Seq.empty
    }
}

對于FullOuterJoin，如果采用HashJoin方式來實(shí)現(xiàn)，代價較大，需要建立雙向的Hash表，而基于SortJoin，它的代價與其他幾種Join相差不大，因此`FullOuter默認(rèn)都是基于SortJon來實(shí)現(xiàn)。

2. Spark中的Join實(shí)現(xiàn)

Spark針對Join提供了分布式實(shí)現(xiàn)，但是Join操作本質(zhì)上也是單機(jī)進(jìn)行，怎么理解？如果要對兩個數(shù)據(jù)集進(jìn)行分布式Join，Spark會先對兩個數(shù)據(jù)集進(jìn)行Exchange，即進(jìn)行ShuffleMap操作，將Key相同數(shù)據(jù)分到一個分區(qū)中，然后在ShuffleFetch過程中利用HashJoin/SortJoin單機(jī)版算法來對兩個分區(qū)進(jìn)行Join操作。

另外如果Build端的整個數(shù)據(jù)集（非一個iter）大小較小，可以將它進(jìn)行Broadcast操作，從而節(jié)約Shuffle的開銷。

因此Spark支持ShuffledHashJoinExec,SortMergeJoinExec,BroadcastHashJoinExec三種Join算法，那么它怎么進(jìn)行選擇的呢？

如果build-dataset支持Broadcastable，并且它的大小小于spark.sql.autoBroadcastJoinThreshold，默認(rèn)10M，那么優(yōu)先進(jìn)行BroadcastHashJoinExec
如果dataset支持Sort，并且spark.sql.join.preferSortMergeJoin為True，那么優(yōu)先選擇SortMergeJoinExec
如果dataset不支持Sort，那么只能選擇ShuffledHashJoinExec了
- 如果Join同時支持BuildRight和BuildLeft，那么根據(jù)兩邊數(shù)據(jù)大小，優(yōu)先選擇數(shù)據(jù)量小的進(jìn)行Hash。

這一塊邏輯都在org.apache.spark.sql.execution.JoinSelection 中描述。ps：Spark也對Without joining keys的Join進(jìn)行支持，但是不在我們這次討論范圍中。

BroadcastHashJoinExec

val p = spark.read.parquet("/Users/p.parquet")
val p1 = spark.read.parquet("/Users/p1.parquet")
p.joinWith(p1, p("to_module") === p1("to_module"),"inner")

此時由于p和p1的大小都較小，它會默認(rèn)選擇BroadcastHashJoinExec
== Physical Plan ==
BroadcastHashJoin [_1#269.to_module], [_2#270.to_module], Inner, BuildRight
    :- Project p
    :- Project p1

SortMergeJoinExec

val p = spark.read.parquet("/Users/p.parquet")
val p1 = spark.read.parquet("/Users/p1.parquet")
p.joinWith(p1, p("to_module") === p1("to_module"),"fullouter")

fullouterJoin不支持Broadcast和ShuffledHashJoinExec，因此為ShuffledHashJoinExec

== Physical Plan ==
SortMergeJoin [_1#273.to_module], [_2#274.to_module], FullOuter
    :- Project p
    :- Project p1

由于ShuffledHashJoinExec一般情況下，不會被選擇，它的條件比較苛責(zé)。

//首先不能進(jìn)行Broadcast！
private def canBroadcast(plan: LogicalPlan): Boolean = {
  plan.statistics.isBroadcastable ||
    plan.statistics.sizeInBytes <= conf.autoBroadcastJoinThreshold（10M）
}
//其次spark.sql.join.preferSortMergeJoin必須設(shè)置false
//然后build端可以放的進(jìn)內(nèi)存！
private def canBuildLocalHashMap(plan: LogicalPlan): Boolean = {
  plan.statistics.sizeInBytes < conf.autoBroadcastJoinThreshold * conf.numShufflePartitions
}
 //最后build端和stream端大小必須相差3倍！否則使用sort性能要好。
private def muchSmaller(a: LogicalPlan, b: LogicalPlan): Boolean = {
  a.statistics.sizeInBytes * 3 <= b.statistics.sizeInBytes
}
//或者RowOrdering.isOrderable(leftKeys)==false

另外有需要云服務(wù)器可以了解下創(chuàng)新互聯(lián)scvps.cn，海內(nèi)外云服務(wù)器15元起步，三天無理由+7*72小時售后在線，公司持有idc許可證，提供“云服務(wù)器、裸金屬服務(wù)器、高防服務(wù)器、香港服務(wù)器、美國服務(wù)器、虛擬主機(jī)、免備案服務(wù)器”等云主機(jī)租用服務(wù)以及企業(yè)上云的綜合解決方案，具有“安全穩(wěn)定、簡單易用、服務(wù)可用性高、性價比高”等特點(diǎn)與優(yōu)勢，專為企業(yè)上云打造定制，能夠滿足用戶豐富、多元化的應(yīng)用場景需求。

文章名稱：SparkSQLJoin原理分析-創(chuàng)新互聯(lián)
鏈接URL：http://weahome.cn/article/cshpjs.html

真实的国产乱ⅩXXX66竹夫人,五月香六月婷婷激情综合,亚洲日本VA一区二区三区,亚洲精品一区二区三区麻豆

SparkSQLJoin原理分析-創(chuàng)新互聯(lián)