某一個(gè)nodemanager退出后,導(dǎo)致 application master中出現(xiàn)大量的如下日志,并且持續(xù)很長(zhǎng)時(shí)間,application master才成功退出。
創(chuàng)新互聯(lián)專(zhuān)注于云州網(wǎng)站建設(shè)服務(wù)及定制,我們擁有豐富的企業(yè)做網(wǎng)站經(jīng)驗(yàn)。 熱誠(chéng)為您提供云州營(yíng)銷(xiāo)型網(wǎng)站建設(shè),云州網(wǎng)站制作、云州網(wǎng)頁(yè)設(shè)計(jì)、云州網(wǎng)站官網(wǎng)定制、成都小程序開(kāi)發(fā)服務(wù),打造云州網(wǎng)絡(luò)公司原創(chuàng)品牌,更為您提供云州網(wǎng)站排名全網(wǎng)營(yíng)銷(xiāo)落地服務(wù)。
2016-06-24 09:32:35,596 INFO [ContainerLauncher #3] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2016-06-24 09:32:35,596 INFO [ContainerLauncher #9] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2016-06-24 09:32:35,597 INFO [ContainerLauncher #7] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2016-06-24 09:32:36,455 INFO [ContainerLauncher #8] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2016-06-24 09:32:36,539 INFO [ContainerLauncher #5] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2016-06-24 09:32:36,539 INFO [ContainerLauncher #1] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2016-06-24 09:32:36,539 INFO [ContainerLauncher #6] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2016-06-24 09:32:36,539 INFO [ContainerLauncher #2] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2016-06-24 09:32:36,539 INFO [ContainerLauncher #0] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2016-06-24 09:32:36,596 INFO [ContainerLauncher #4] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2016-06-24 09:32:36,597 INFO [ContainerLauncher #3] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2016-06-24 09:32:36,597 INFO [ContainerLauncher #9] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) ....... 2016-06-24 12:57:52,328 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2016-06-24 12:57:53,339 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2016-06-24 12:58:04,357 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2016-06-24 12:58:05,367 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2016-06-24 12:58:06,378 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2016-06-24 12:58:07,392 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2016-06-24 12:58:08,399 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2016-06-24 12:58:09,408 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2016-06-24 12:58:10,417 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2016-06-24 12:58:11,425 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2016-06-24 12:58:12,434 INFO [Thread-1835] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
1)dchadoop206上的nodemanager退出后(由于重啟),導(dǎo)致application master持續(xù)的去連接之前nodemanager上的container。顯然這些container是已經(jīng)連接不上了。
2)最終經(jīng)過(guò)非常長(zhǎng)的時(shí)間大概3-4小時(shí)后,連接不上的異常才拋出,application master正常結(jié)束。
二、問(wèn)題分析
這個(gè)問(wèn)題主要涉及hadoop的rpc機(jī)制。首先看下面兩個(gè)配置參數(shù)
#定義client連接到nodemanager的最大超時(shí)時(shí)間,不是單次連接,而是經(jīng)過(guò)多少時(shí)間連接不上nodemanager,則認(rèn)為操作失敗# 定義每次嘗試去連接nodemanager的時(shí)間間隔 yarn.client.nodemanager-connect.max-wait-ms 15*60*1000 yarn.client.nodemanager-connect.retry-interval-ms 10*1000
根據(jù)這兩個(gè)參數(shù)的定義,ApplicationMaster經(jīng)過(guò)15分鐘仍然連不上nodemanager的container,會(huì)取消try connect。但觀察的情況是Application Master 需要等大約30分鐘,才取消try connect。主要原因在于hadoop 的rpc機(jī)制如下,首先ApplicationMaster 會(huì)根據(jù)上面的兩個(gè)參數(shù),構(gòu)造一個(gè)RetryUpToMaximumCountWithFixedSleep的重連策略,這個(gè)重連策略會(huì)通過(guò)以下方式計(jì)算
MaximumCount:yarn.client.nodemanager-connect.max-wait-ms/yarn.client.nodemanager-connect.retry-interval-ms=90次
而每次的RPC請(qǐng)求中,Client也有自己的重連策略,就是類(lèi)似這樣的東東:
2016-06-24 09:32:36,455 INFO [ContainerLauncher #8] org.apache.hadoop.ipc.Client: Retrying connect to server: dchadoop206/192.168.1.199:32951. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 由兩個(gè)rpc參數(shù)控制,ipc.client.connect.max.retries=10和ipc.client.connect.retry.interval=1000ms控制
所以最終ApplicationMaster 放棄try connect的等待時(shí)間是:90*(10+10)=1800s
1)在提交map-reduce/hive sql/hive server2等客戶端機(jī)器修改yarn-site.xml的以下參數(shù)
2)hadoop命令行中通過(guò)-D設(shè)置該參數(shù)
yarn.client.nodemanager-connect.max-wait-ms 180000
這樣總的等待時(shí)間就是6分鐘。
這個(gè)修改是不需要做任何重啟yarn組件操作的,是一個(gè)客戶端相關(guān)的操作!