最近公司因為斷電之前沒有關(guān)閉Hadoop集群,造成數(shù)據(jù)丟失,namenode壞了,無法啟動,所以我嘗試恢復(fù)。
創(chuàng)新互聯(lián)自2013年起,先為潤州等服務(wù)建站,潤州等地企業(yè),進(jìn)行企業(yè)商務(wù)咨詢服務(wù)。為潤州企業(yè)網(wǎng)站制作PC+手機+微官網(wǎng)三網(wǎng)同步一站式服務(wù)解決您的所有建站問題。
方法一:使用hadoop namenode -importCheckpoint
1、刪除name目錄:
1 [hadoop@node1 hdfs]$ rm -rf name
2、關(guān)閉集群,從secondarynamenode拷貝namesecondary目錄到dfs.name.dir:
[hadoop@node2 hdfs]$ scp -r namesecondary node1:/app/user/hdfs/fsp_w_picpath 100% 157 0.2KB/s 00:00 fstime 100% 8 0.0KB/s 00:00 fsp_w_picpath 100% 2410 2.4KB/s 00:00 VERSION 100% 101 0.1KB/s 00:00 edits 100% 4 0.0KB/s 00:00 fstime 100% 8 0.0KB/s 00:00 fsp_w_picpath 100% 2410 2.4KB/s 00:00 VERSION 100% 101 0.1KB/s 00:00 edits 100% 4 0.0KB/s 00:00
3、在namenode節(jié)點上執(zhí)行hadoop namenode -importCheckpoint
[hadoop@node1 hdfs]$ hadoop namenode -importCheckpoint13/11/14 07:24:20 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = node1/192.168.1.151 STARTUP_MSG: args = [-importCheckpoint] STARTUP_MSG: version = 0.20.2 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010 ************************************************************/13/11/14 07:24:20 INFO metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=900013/11/14 07:24:20 INFO namenode.NameNode: Namenode up at: node1.com/192.168.1.151:900013/11/14 07:24:20 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=NameNode, sessionId=null13/11/14 07:24:20 INFO metrics.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext13/11/14 07:24:21 INFO namenode.FSNamesystem: fsOwner=hadoop,hadoop13/11/14 07:24:21 INFO namenode.FSNamesystem: supergroup=supergroup13/11/14 07:24:21 INFO namenode.FSNamesystem: isPermissionEnabled=true13/11/14 07:24:21 INFO metrics.FSNamesystemMetrics: Initializing FSNamesystemMetrics using context object:org.apache.hadoop.metrics.spi.NullContext13/11/14 07:24:21 INFO namenode.FSNamesystem: Registered FSNamesystemStatusMBean13/11/14 07:24:21 INFO common.Storage: Storage directory /app/user/hdfs/name is not formatted.13/11/14 07:24:21 INFO common.Storage: Formatting ...13/11/14 07:24:21 INFO common.Storage: Number of files = 2613/11/14 07:24:21 INFO common.Storage: Number of files under construction = 013/11/14 07:24:21 INFO common.Storage: Image file of size 2410 loaded in 0 seconds.13/11/14 07:24:21 INFO common.Storage: Edits file /app/user/hdfs/namesecondary/current/edits of size 4 edits # 0 loaded in 0 seconds.13/11/14 07:24:21 INFO common.Storage: Image file of size 2410 saved in 0 seconds.13/11/14 07:24:21 INFO common.Storage: Image file of size 2410 saved in 0 seconds.13/11/14 07:24:21 INFO namenode.FSNamesystem: Number of transactions: 0 Total time for transactions(ms): 0Number of transactions batched in Syncs: 0 Number of syncs: 0 SyncTimes(ms): 0 13/11/14 07:24:21 INFO namenode.FSNamesystem: Finished loading FSImage in 252 msecs13/11/14 07:24:21 INFO hdfs.StateChange: STATE* Safe mode ON. The ratio of reported blocks 0.0000 has not reached the threshold 0.9990. Safe mode will be turned off automatically.13/11/14 07:24:21 INFO mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog13/11/14 07:24:21 INFO http.HttpServer: Port returned by webServer.getConnectors()[0].getLocalPort() before open() is -1. Opening the listener on 5007013/11/14 07:24:21 INFO http.HttpServer: listener.getLocalPort() returned 50070 webServer.getConnectors()[0].getLocalPort() returned 5007013/11/14 07:24:21 INFO http.HttpServer: Jetty bound to port 5007013/11/14 07:24:21 INFO mortbay.log: jetty-6.1.1413/11/14 07:24:21 INFO mortbay.log: Started SelectChannelConnector@node1.com:5007013/11/14 07:24:21 INFO namenode.NameNode: Web-server up at: node1.com:5007013/11/14 07:24:21 INFO ipc.Server: IPC Server Responder: starting13/11/14 07:24:21 INFO ipc.Server: IPC Server listener on 9000: starting13/11/14 07:24:21 INFO ipc.Server: IPC Server handler 0 on 9000: starting13/11/14 07:24:21 INFO ipc.Server: IPC Server handler 1 on 9000: starting13/11/14 07:24:21 INFO ipc.Server: IPC Server handler 2 on 9000: starting13/11/14 07:24:21 INFO ipc.Server: IPC Server handler 3 on 9000: starting13/11/14 07:24:21 INFO ipc.Server: IPC Server handler 4 on 9000: starting13/11/14 07:24:21 INFO ipc.Server: IPC Server handler 5 on 9000: starting13/11/14 07:24:21 INFO ipc.Server: IPC Server handler 6 on 9000: starting13/11/14 07:24:21 INFO ipc.Server: IPC Server handler 9 on 9000: starting13/11/14 07:24:21 INFO ipc.Server: IPC Server handler 7 on 9000: starting13/11/14 07:24:21 INFO ipc.Server: IPC Server handler 8 on 9000: starting13/11/14 07:37:05 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at node1/192.168.1.151 ************************************************************/[hadoop@node1 current]$ start-all.sh starting namenode, logging to /app/hadoop/bin/../logs/hadoop-hadoop-namenode-node1.out192.168.1.152: starting datanode, logging to /app/hadoop/bin/../logs/hadoop-hadoop-datanode-node2.out192.168.1.153: starting datanode, logging to /app/hadoop/bin/../logs/hadoop-hadoop-datanode-node3.out192.168.1.152: starting secondarynamenode, logging to /app/hadoop/bin/../logs/hadoop-hadoop-secondarynamenode-node2.out starting jobtracker, logging to /app/hadoop/bin/../logs/hadoop-hadoop-jobtracker-node1.out192.168.1.152: starting tasktracker, logging to /app/hadoop/bin/../logs/hadoop-hadoop-tasktracker-node2.out192.168.1.153: starting tasktracker, logging to /app/hadoop/bin/../logs/hadoop-hadoop-tasktracker-node3.out[hadoop@node1 current]$ jps1027 JobTracker1121 Jps879 NameNode
總結(jié):
注意:恢復(fù)的namenode中secondarynamenode的最近一次check到故障發(fā)生這段時間的內(nèi)容將丟失,所以fs.checkpoint.period參數(shù)值在實際設(shè)定中要盡可能的權(quán)衡。并且也時常備份secondarynamenode節(jié)點中的內(nèi)容,因為scondarynamenode也是單點的,以防發(fā)生故障。
補充說明:如果是用新的節(jié)點來恢復(fù)namenode,則要注意
1、新節(jié)點的Linux環(huán)境,目錄結(jié)構(gòu),環(huán)境變量等等配置需要跟原來的namenode一模一樣,包括conf目錄下的所有文件配置。
2、新namenode的主機名要與原namenode保持一致,如果是重新命名主機名的話,則需要批量替換datanode和secondarynamenode的hosts文件,并且重新配置以下文件的部分core-site.xml文件中的fs.default.name
hdfs-site.xml文件中的dfs.http.address(secondarynamenode節(jié)點上)
mapred-site.xml文件中的mapred.job.tracker(如果jobtracker與namenode在同一個機器上,一般都是同一臺機器上)。
還有第二種方法:
使用namespaceID
1、關(guān)閉集群,格式化namenode:
1 [hadoop@node1 name]$ stop-all.sh 2 stopping jobtracker 3 192.168.1.152: stopping tasktracker 4 192.168.1.153: stopping tasktracker 5 no namenode to stop 6 192.168.1.152: stopping datanode 7 192.168.1.153: stopping datanode 8 192.168.1.152: stopping secondarynamenode 9 [hadoop@node1 name]$ hadoop namenode -format10 13/11/14 06:21:37 INFO namenode.NameNode: STARTUP_MSG: 11 /************************************************************12 STARTUP_MSG: Starting NameNode13 STARTUP_MSG: host = node1/192.168.1.15114 STARTUP_MSG: args = [-format]15 STARTUP_MSG: version = 0.20.216 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 201017 ************************************************************/18 Re-format filesystem in /app/user/hdfs/name ? (Y or N) Y19 13/11/14 06:21:39 INFO namenode.FSNamesystem: fsOwner=hadoop,hadoop20 13/11/14 06:21:39 INFO namenode.FSNamesystem: supergroup=supergroup21 13/11/14 06:21:39 INFO namenode.FSNamesystem: isPermissionEnabled=true22 13/11/14 06:21:39 INFO common.Storage: Image file of size 96 saved in 0 seconds.23 13/11/14 06:21:39 INFO common.Storage: Storage directory /app/user/hdfs/name has been successfully formatted.24 13/11/14 06:21:39 INFO namenode.NameNode: SHUTDOWN_MSG: 25 /************************************************************26 SHUTDOWN_MSG: Shutting down NameNode at node1/192.168.1.15127 ************************************************************/
2、從任意datanode中獲取namenode格式化之前namespaceID并修改namenode的namespaceID跟datanode一致:
#Thu Nov :: CST namespaceID storageIDDS. cTime storageType layoutVersion apphdfsdata ----修改namenode的namespaceID---- #Thu Nov :: CST namespaceID cTime storageType layoutVersion
3、刪除新的namenode的fsp_w_picpath文件:
1 [hadoop@node1 current]$ ll2 total 163 -rw-rw-r-- 1 hadoop hadoop 4 Nov 14 06:21 edits4 -rw-rw-r-- 1 hadoop hadoop 96 Nov 14 06:21 fsp_w_picpath6 -rw-rw-r-- 1 hadoop hadoop 8 Nov 14 06:21 fstime6 -rw-rw-r-- 1 hadoop hadoop 101 Nov 14 06:22 VERSION7 [hadoop@node1 current]$ rm fsp_w_picpath
4、從Secondarynamenode拷貝fsp_w_picpath到Namenode的current目錄下:
[hadoop@node2 current]$ ll total 16-rw-rw-r-- 1 hadoop hadoop 4 Nov 14 05:38 edits-rw-rw-r-- 1 hadoop hadoop 2410 Nov 14 05:38 fsp_w_picpath-rw-rw-r-- 1 hadoop hadoop 8 Nov 14 05:38 fstime-rw-rw-r-- 1 hadoop hadoop 101 Nov 14 05:38 VERSION[hadoop@node2 current]$ scp fsp_w_picpath node1:/app/user/hdfs/name/currentThe authenticity of host 'node1 (192.168.1.151)' can't be established. RSA key fingerprint is ca:9a:7e:19:ee:a1:35:44:7e:9d:d4:09:5c:fc:c5:0a. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'node1,192.168.1.151' (RSA) to the list of known hosts. fsp_w_picpath 100% 2410 2.4KB/s 00:00
5、重啟集群:
[hadoop@node1 current]$ start-all.sh starting namenode, logging to /app/hadoop/bin/../logs/hadoop-hadoop-namenode-node1.out192.168.1.152: starting datanode, logging to /app/hadoop/bin/../logs/hadoop-hadoop-datanode-node2.out192.168.1.153: starting datanode, logging to /app/hadoop/bin/../logs/hadoop-hadoop-datanode-node3.out192.168.1.152: starting secondarynamenode, logging to /app/hadoop/bin/../logs/hadoop-hadoop-secondarynamenode-node2.out starting jobtracker, logging to /app/hadoop/bin/../logs/hadoop-hadoop-jobtracker-node1.out192.168.1.152: starting tasktracker, logging to /app/hadoop/bin/../logs/hadoop-hadoop-tasktracker-node2.out192.168.1.153: starting tasktracker, logging to /app/hadoop/bin/../logs/hadoop-hadoop-tasktracker-node3.out[hadoop@node1 current]$ jps32486 Jps32419 JobTracker32271 NameNode
在第二種方法中,其中1,2步驟是格式化namenode,后面3,4,5是用備份恢復(fù)之前的數(shù)據(jù)。
我恢復(fù)的時候,3,4,5也做了,是備份數(shù)據(jù)竟然是去年9月份的,而且也是壞的,無奈,所有數(shù)據(jù)都沒了。。。所以說一定要定時去手動備份namenode和secondrynamenode,因為本身系統(tǒng)也是單點備份,很不可靠,折騰了幾天,還是沒有恢復(fù)數(shù)據(jù)。。算是吃一塹長一智吧。