applicationmaster持續(xù)org.apache.hadoop.ipc.Client:Retryingconnecttoserver

一、問(wèn)題現(xiàn)象

某一個(gè)nodemanager退出后，導(dǎo)致 application master中出現(xiàn)大量的如下日志，并且持續(xù)很長(zhǎng)時(shí)間，application master才成功退出。

創(chuàng)新互聯(lián)專(zhuān)注于云州網(wǎng)站建設(shè)服務(wù)及定制，我們擁有豐富的企業(yè)做網(wǎng)站經(jīng)驗(yàn)。熱誠(chéng)為您提供云州營(yíng)銷(xiāo)型網(wǎng)站建設(shè)，云州網(wǎng)站制作、云州網(wǎng)頁(yè)設(shè)計(jì)、云州網(wǎng)站官網(wǎng)定制、成都小程序開(kāi)發(fā)服務(wù)，打造云州網(wǎng)絡(luò)公司原創(chuàng)品牌,更為您提供云州網(wǎng)站排名全網(wǎng)營(yíng)銷(xiāo)落地服務(wù)。

2016-06-24 09:32:35,596 INFO [ContainerLauncher #3] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-24 09:32:35,596 INFO [ContainerLauncher #9] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-24 09:32:35,597 INFO [ContainerLauncher #7] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-24 09:32:36,455 INFO [ContainerLauncher #8] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-24 09:32:36,539 INFO [ContainerLauncher #5] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-24 09:32:36,539 INFO [ContainerLauncher #1] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-24 09:32:36,539 INFO [ContainerLauncher #6] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-24 09:32:36,539 INFO [ContainerLauncher #2] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-24 09:32:36,539 INFO [ContainerLauncher #0] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-24 09:32:36,596 INFO [ContainerLauncher #4] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-24 09:32:36,597 INFO [ContainerLauncher #3] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-24 09:32:36,597 INFO [ContainerLauncher #9] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
.......

2016-06-24 12:57:52,328 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-24 12:57:53,339 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-24 12:58:04,357 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-24 12:58:05,367 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-24 12:58:06,378 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-24 12:58:07,392 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-24 12:58:08,399 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-24 12:58:09,408 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-24 12:58:10,417 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-24 12:58:11,425 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-24 12:58:12,434 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

1）dchadoop206上的nodemanager退出后（由于重啟），導(dǎo)致application master持續(xù)的去連接之前nodemanager上的container。顯然這些container是已經(jīng)連接不上了。

2）最終經(jīng)過(guò)非常長(zhǎng)的時(shí)間大概3-4小時(shí)后，連接不上的異常才拋出，application master正常結(jié)束。

二、問(wèn)題分析

這個(gè)問(wèn)題主要涉及hadoop的rpc機(jī)制。首先看下面兩個(gè)配置參數(shù)

 #定義client連接到nodemanager的最大超時(shí)時(shí)間,不是單次連接，而是經(jīng)過(guò)多少時(shí)間連接不上nodemanager，則認(rèn)為操作失敗
 
        yarn.client.nodemanager-connect.max-wait-ms
        15*60*1000
 
 # 定義每次嘗試去連接nodemanager的時(shí)間間隔
 
        yarn.client.nodemanager-connect.retry-interval-ms
        10*1000

根據(jù)這兩個(gè)參數(shù)的定義，ApplicationMaster經(jīng)過(guò)15分鐘仍然連不上nodemanager的container，會(huì)取消try connect。但觀察的情況是Application Master 需要等大約30分鐘，才取消try connect。主要原因在于hadoop 的rpc機(jī)制如下，首先ApplicationMaster 會(huì)根據(jù)上面的兩個(gè)參數(shù)，構(gòu)造一個(gè)RetryUpToMaximumCountWithFixedSleep的重連策略，這個(gè)重連策略會(huì)通過(guò)以下方式計(jì)算

MaximumCount：yarn.client.nodemanager-connect.max-wait-ms/yarn.client.nodemanager-connect.retry-interval-ms=90次

而每次的RPC請(qǐng)求中，Client也有自己的重連策略，就是類(lèi)似這樣的東東:

2016-06-24 09:32:36,455 INFO [ContainerLauncher #8] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
由兩個(gè)rpc參數(shù)控制，ipc.client.connect.max.retries=10和ipc.client.connect.retry.interval=1000ms控制

所以最終ApplicationMaster 放棄try connect的等待時(shí)間是：90*(10+10)=1800s

三、解決辦法

1）在提交map-reduce/hive sql/hive server2等客戶端機(jī)器修改yarn-site.xml的以下參數(shù)
2）hadoop命令行中通過(guò)-D設(shè)置該參數(shù)

 
        yarn.client.nodemanager-connect.max-wait-ms
        180000

這樣總的等待時(shí)間就是6分鐘。

注意事項(xiàng)

這個(gè)修改是不需要做任何重啟yarn組件操作的，是一個(gè)客戶端相關(guān)的操作！

本文名稱(chēng)：applicationmaster持續(xù)org.apache.hadoop.ipc.Client:Retryingconnecttoserver
分享URL：http://weahome.cn/article/pioegg.html

真实的国产乱ⅩXXX66竹夫人,五月香六月婷婷激情综合,亚洲日本VA一区二区三区,亚洲精品一区二区三区麻豆

applicationmaster持續(xù)org.apache.hadoop.ipc.Client:Retryingconnecttoserver

一、問(wèn)題現(xiàn)象

三、解決辦法

注意事項(xiàng)

其他資訊

網(wǎng)站制作

企業(yè)服務(wù)

網(wǎng)站建設(shè)

服務(wù)器托管

真实的国产乱ⅩXXX66竹夫人,五月香六月婷婷激情综合,亚洲日本VA一区二区三区,亚洲精品一区二区三区麻豆

applicationmaster持續(xù)org.apache.hadoop.ipc.Client:Retryingconnecttoserver

一、問(wèn)題現(xiàn)象

三、解決辦法

注意事項(xiàng)

其他資訊

網(wǎng)站制作

企業(yè)服務(wù)

網(wǎng)站建設(shè)

服務(wù)器托管

一、問(wèn)題現(xiàn)象

三、解決辦法