這篇文章主要介紹“flink中zk引起的重啟怎么解決”,在日常操作中,相信很多人在flink中zk引起的重啟怎么解決問題上存在疑惑,小編查閱了各式資料,整理出簡單好用的操作方法,希望對大家解答”flink中zk引起的重啟怎么解決”的疑惑有所幫助!接下來,請跟著小編一起來學(xué)習(xí)吧!
我們提供的服務(wù)有:成都網(wǎng)站制作、做網(wǎng)站、微信公眾號開發(fā)、網(wǎng)站優(yōu)化、網(wǎng)站認證、嘉陵ssl等。為成百上千家企事業(yè)單位解決了網(wǎng)站和推廣的問題。提供周到的售前咨詢和貼心的售后服務(wù),是有科學(xué)管理、有技術(shù)的嘉陵網(wǎng)站制作公司
最近用flink on k8s跑程序的過程中,發(fā)現(xiàn)某個時刻經(jīng)常導(dǎo)致程序重啟,定時任務(wù)每天加載一次緩存,該緩存有大量數(shù)據(jù),加載時長需要60-90s左右。這個定時任務(wù)經(jīng)常會導(dǎo)致k8s重啟程序,使其極不穩(wěn)定,于是各種調(diào)優(yōu)。
懷疑可能是算子的sender和receiver之間因為加載緩存導(dǎo)致某種通信不可達,默認的心跳時間是50s,于是修改參數(shù):heartbeat.timeout: 180000,heartbeat.interval: 20000。
jobmanager和taskmanager是用akka通信,修改參數(shù)akka.ask.timeout: 240s。
這些操作之后,偶爾還是會在加載緩存的時候發(fā)現(xiàn)異常,日志截取如下
2020-10-16 17:05:05,939 WARN org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Client session timed out, have not heard from server in 29068ms for sessionid 0x30135fa8005449f 2020-10-16 17:05:05,948 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Client session timed out, have not heard from server in 29068ms for sessionid 0x30135fa8005449f, closing socket connection and attempting reconnect 2020-10-16 17:05:07,609 INFO org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager - State change: SUSPENDED 2020-10-16 17:05:07,611 WARN org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper. 2020-10-16 17:05:07,612 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - JobManager for job 1bb3b7bdcfbc39cf760064ed9736ea80 with leader id bed26e07640e5e79197e468c85354534 lost leadership. 2020-10-16 17:05:07,613 WARN org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper. 2020-10-16 17:05:07,614 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Close JobManager connection for job 1bb3b7bdcfbc39cf760064ed9736ea80. 2020-10-16 17:05:07,615 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to fail task externally Source: Custom Source -> Flat Map -> Timestamps/Watermarks (15/15) (052a84a37a0647ab485baa54f149b762). 2020-10-16 17:05:07,615 INFO org.apache.flink.runtime.taskmanager.Task - Source: Custom Source -> Flat Map -> Timestamps/Watermarks (15/15) (052a84a37a0647ab485baa54f149b762) switched from RUNNING to FAILED. org.apache.flink.util.FlinkException: JobManager responsible for 1bb3b7bdcfbc39cf760064ed9736ea80 lost the leadership. at org.apache.flink.runtime.taskexecutor.TaskExecutor.closeJobManagerConnection(TaskExecutor.java:1274) at org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1200(TaskExecutor.java:155) at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerLostLeadership$1(TaskExecutor.java:1698) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at akka.actor.Actor$class.aroundReceive(Actor.scala:517) at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) at akka.actor.ActorCell.invoke(ActorCell.scala:561) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) at akka.dispatch.Mailbox.run(Mailbox.scala:225) at akka.dispatch.Mailbox.exec(Mailbox.scala:235) at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: java.lang.Exception: Job leader for job id 1bb3b7bdcfbc39cf760064ed9736ea80 lost leadership. ... 22 more
再經(jīng)過調(diào)查發(fā)現(xiàn),這個跟zk有關(guān)系,zk在切換leader或者遇到網(wǎng)絡(luò)波動之類的,會觸發(fā)SUSPENDED狀態(tài),這個狀態(tài),會導(dǎo)致lost the leadership錯誤,而遇到這個錯誤,k8s直接就重啟程序。其實訪問zk還是正常的。 再經(jīng)過一系列調(diào)查,這種問題別人早就遇到,還改了代碼,就是flink官方?jīng)]合并代碼。調(diào)查的過程不表,有用的鏈接如下
https://www.cnblogs.com/029zz010buct/p/10946244.html
這個有用的是升級curator包, flink用的是2.12.0,暫時沒去操作,里面提到的SessionConnectionStateErrorPolicy是在4.x版本的,應(yīng)該還是要去編譯部分代碼。
https://github.com/apache/flink/pull/9066 https://issues.apache.org/jira/browse/FLINK-10052
這個是其他人的解決方案,本人用的也是這個方法。 不把SUSPENDED狀態(tài)認為是lost leadership,修改LeaderLatch的handleStateChange方法
case RECONNECTED: { try { if (!hasLeadership.get()) { reset(); } } catch ( Exception e ) { ThreadUtils.checkInterrupted(e); log.error("Could not reset leader latch", e); setLeadership(false); } break; } case LOST: { setLeadership(false); break; }
找到這段代碼之后,自然是找到了flink-shaded-hadoop-2-uber-xxx.jar這個包,在flink1.10的版本,還支持hadoop的這個包,在1.11之后已經(jīng)不再主動支持,需要的要自己去下載,因為這個包在打鏡像時會特意加上去,所以目標鎖定這個包,重新編譯。簡單說下編譯過程
https://github.com/apache/curator/tree/apache-curator-2.12.0 下載這個版本的源碼,修改curator-recipes下的src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java,修改內(nèi)容如上所示,打的包是2.12.0。
https://github.com/apache/flink-shaded/tree/release-10.0 下載flink-shaded 1.10版本的源碼,修改flink-shaded-hadoop-2-parent的pom文件,增加exclusion,去掉curator-recipes的依賴,增加自己編譯的curator-recipes。觀察到不去掉依賴,默認是2.7.1版本,應(yīng)該是這塊代碼好多年沒動過,版本一直停留在2.7.1。
org.apache.hadoop hadoop-common ${hadoop.version} ...省略若干exclusion org.apache.curator curator-recipes org.apache.curator curator-recipes 2.12.0
因為我們用的是2.8.3-10.0版本的,源碼是2.4.1的,修改成
看根目錄的readme.md,在flink-shaded-release-10.0/flink-shaded-hadoop-2-parent目錄運行mvn package -Dshade-sources打包,打包完成之后,用工具反編譯觀察一下,SUSPENDED的代碼確實去掉了,重新打鏡像,跑程序。
到此,關(guān)于“flink中zk引起的重啟怎么解決”的學(xué)習(xí)就結(jié)束了,希望能夠解決大家的疑惑。理論與實踐的搭配能更好的幫助大家學(xué)習(xí),快去試試吧!若想繼續(xù)學(xué)習(xí)更多相關(guān)知識,請繼續(xù)關(guān)注創(chuàng)新互聯(lián)網(wǎng)站,小編會繼續(xù)努力為大家?guī)砀鄬嵱玫奈恼拢?/p>
分享標題:flink中zk引起的重啟怎么解決
URL鏈接:http://weahome.cn/article/iidohp.html