MapReduce設(shè)計模式有哪些

本篇內(nèi)容主要講解“MapReduce設(shè)計模式有哪些”，感興趣的朋友不妨來看看。本文介紹的方法操作簡單快捷，實用性強。下面就讓小編來帶大家學(xué)習(xí)“MapReduce設(shè)計模式有哪些”吧!

成都創(chuàng)新互聯(lián)公司從2013年開始，是專業(yè)互聯(lián)網(wǎng)技術(shù)服務(wù)公司，擁有項目成都網(wǎng)站建設(shè)、網(wǎng)站制作網(wǎng)站策劃，項目實施與項目整合能力。我們以讓每一個夢想脫穎而出為使命，1280元永定做網(wǎng)站,已為上家服務(wù),為永定各地企業(yè)和個人服務(wù),聯(lián)系電話:18982081108

1 (總計)Summarization Patterns

1.1（數(shù)字統(tǒng)計）Numerical Summarizations

這個算是Built-in的,因為這就是MapReduce的模式. 相當(dāng)于SQL語句里邊Count/Max,WordCount也是這個的實現(xiàn)。

1.2（反向索引）Inverted Index Summarizations

這個看著名字很玄，其實感覺算不上模式，只能算是一種應(yīng)用，并沒有涉及到MapReduce的設(shè)計。其核心實質(zhì)是對listof(V3)的索引處理，這是V3是一個引用Id。這個模式期望的結(jié)果是：
url-〉list of id

1.3（計數(shù)器統(tǒng)計）Counting with Counters

計數(shù)器很好很快，簡單易用。不過代價是占用tasktracker，最重要使jobtracker的內(nèi)存。所以在1.0時代建議tens，至少<100個。不過2.0時代，jobtracker變得per job，我看應(yīng)該可以多用，不過它比較適合Counting這種算總數(shù)的算法。
context.getCounter(STATE_COUNTER_GROUP, UNKNOWN_COUNTER).increment(1);

2 (過濾)Filtering Patterns

2.1（簡單過濾）Filtering

這個也算是Built-in的,因為這就是MapReduce中Mapper如果沒有Write，那么就算過濾掉

了. 相當(dāng)于SQL語句里邊Where。

map(key, record):
    if we want to keep record then
    emit key,value

2.2（Bloom過濾）Bloom Filtering

以前我一直不知道為什么叫BloomFilter，看了wiki后，才知道，貼過來大家瞧瞧：
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not, thus a Bloom filter has a 100% recall rate.
其原理可以參見這篇文章：

http://blog.csdn.net/jiaomeng/article/details/1495500
要是讓我一句話說，就是根據(jù)集合內(nèi)容，選取多種Hash做一個bitmap，那么如果一個詞的 hash落在map中，那么它有可能是，也有可能不是。但是如果它的hash不在，則它一定沒有落在里邊。此過濾有點意思，在HBase中得到廣泛應(yīng)用。接下來得實際試驗一下。

Note: 需要弄程序玩玩

2.3（Top N）Top Ten

這是一個典型的計算Top的操作，類似SQL里邊的top或limit，一般都是帶有某條件的top

操作。
算法實現(xiàn)：我喜歡偽代碼，一目了然：

class mapper:
    setup():
        initialize top ten sorted list
     
    map(key, record):
        insert record into top ten sorted list
        if length of array is greater-than 10 then
        truncate list to a length of 10

    cleanup():
        for record in top sorted ten list:
        emit null,record

class reducer:
    setup():
        initialize top ten sorted list

    reduce(key, records):
        sort records
        truncate records to top 10
        for record in records:
            emit record

2.4（排重）Distinct

這個模式也簡單，就是利用MapReduce的Reduce階段，看struct，一目了然：

map(key, record):
    emit record,null

reduce(key, records):
    emit key

3 (數(shù)據(jù)組織)Data Organization Patterns

3.1（結(jié)構(gòu)化到層級化）Structured to Hierarchical

這個在算法上是join操作,在應(yīng)用層面可以起到Denormalization的效果.其程序的關(guān)鍵之處是用到了MultipleInputs,可以引入多個Mapper,這樣便于把多種Structured的或者任何格式的內(nèi)容,聚合在reducer端,以前進行聚合為Hierarchical的格式.
MultipleInputs.addInputPath(job, new Path(args[0]),
TextInputFormat.class, PostMapper.class);
MultipleInputs.addInputPath(job, new Path(args[1]),
TextInputFormat.class, CommentMapper.class);
在Map輸出的時候,這里有一個小技巧,就是把輸出內(nèi)容按照分類,添加了前綴prefix,這樣在Reduce階段,就可以知道數(shù)據(jù)來源,以更好的進行笛卡爾乘積或者甄別操作. 從技術(shù)上講這樣節(jié)省了自己寫Writable的必要,理論上,可以定義格式,來攜帶更多信息. 當(dāng)然了,如果有特殊排序和組合需求,還是要寫特殊的Writable了.
outkey.set(post.getAttribute("ParentId"));
outvalue.set("A" + value.toString());

3.2（分區(qū)法）Partitioning

這個又來了,這個是built-in,寫自己的partitioner,進行定向Reducer.

3.3（裝箱法）Binning

這個有點意思,類似于分區(qū)法,不過它是MapSide Only的,效率較高,不過產(chǎn)生的結(jié)果可能需

要進一步merge.
The SPLIT operation in Pig implements this pattern.
具體實現(xiàn)上還是使用了MultipleOutputs.addNamedOutput().

// Configure the MultipleOutputs by adding an output called "bins"
// With the proper output format and mapper key/value pairs

MultipleOutputs.addNamedOutput(job, "bins", TextOutputFormat.class,Text.class, NullWritable.class);

// Enable the counters for the job
// If there are a significant number of different named outputs, this
// should be disabled

MultipleOutputs.setCountersEnabled(job, true);

// Map-only job
job.setNumReduceTasks(0);

3.4（全排序）Total Order Sorting

這個在Hadoop部分已經(jīng)詳細(xì)描述過了，略。

3.5（洗牌）Shuffling

這個的精髓在于隨機key的創(chuàng)建。
outkey.set(rndm.nextInt());
context.write(outkey, outvalue);

4 (連接)Join Patterns

4.1（Reduce連接）Reduce Side Join

這個比較簡單，Structured to Hierarchical中已經(jīng)講過了。

4.2（Mapside連接）Replicated Join

Mapside連接效率較高，但是需要把較小的數(shù)據(jù)集進行設(shè)置到distributeCache，然后把

另一份數(shù)據(jù)進入map，在map中完成連接。

4.3（組合連接）Composite Join

這種模式也是MapSide的join，而且可以進行兩個大數(shù)據(jù)集的join，然而，它有一個限制就是兩個數(shù)據(jù)集必須是相同組織形式的，那么何謂相同組織形式呢？
? An inner or full outer join is desired.
? All the data sets are sufficiently large.
? All data sets can be read with the foreign key as the input key to the mapper.
? All data sets have the same number of partitions.
? Each partition is sorted by foreign key, and all the foreign keys reside in the associated partition of each data set. That is, partition X of data sets A and B contain
the same foreign keys and these foreign keys are present only in partition X. For a visualization of this partitioning and sorting key, refer to Figure 5-3.
? The data sets do not change often (if they have to be prepared).

// The composite input format join expression will set how the records
// are going to be read in, and in what input format.
conf.set("mapred.join.expr", CompositeInputFormat.compose(joinType,
KeyValueTextInputFormat.class, userPath, commentPath));

4.4（笛卡爾）Cartesian Product

這個需要重寫InputFormat，以便兩部分?jǐn)?shù)據(jù)可以在record級別聯(lián)合起來。sample略。

5 (元模式)MetaPatterns

5.1（鏈?zhǔn)絁ob）Job Chaining

多種方式，可以寫在driver里邊，也可以用bash腳本調(diào)用。hadoop也提供了JobControl

可以跟蹤失敗的job等好的功能。

5.2（折疊Job）Chain Folding

ChainMapper and ChainReducer Approach，M+R*M

5.3（合并Job）Job Merging

合并job，就是把同數(shù)據(jù)的兩個job的mapper和reducer代碼級別的合并，這樣可以省去

I/O和解析的時間。

6 (輸入輸出)Input and Output Patterns

6.1 Customizing Input and Output in Hadoop

InputFormat
getSplits
createRecordReader
InputSplit
getLength()
getLocations()
RecordReader
  initialize
  getCurrentKey and getCurrentValue
  nextKeyValue
  getProgress
  close
OutputFormat
  checkOutputSpecs
  getRecordWriter
  getOutputCommiter
RecordWriter
write
close

6.2 (產(chǎn)生Random數(shù)據(jù))Generating Data

關(guān)鍵點：構(gòu)建虛假的InputSplit，這個不像FileInputSplit基于block，只能去騙hadoop了。

到此，相信大家對“MapReduce設(shè)計模式有哪些”有了更深的了解，不妨來實際操作一番吧！這里是創(chuàng)新互聯(lián)網(wǎng)站，更多相關(guān)內(nèi)容可以進入相關(guān)頻道進行查詢，關(guān)注我們，繼續(xù)學(xué)習(xí)！

當(dāng)前標(biāo)題：MapReduce設(shè)計模式有哪些
文章起源：http://weahome.cn/article/gccdci.html

真实的国产乱ⅩXXX66竹夫人,五月香六月婷婷激情综合,亚洲日本VA一区二区三区,亚洲精品一区二区三区麻豆