hadoop|hadoop日常运维问题汇总

检查区块是否丢失 当系统有区块丢失的时候,我们在9870端口这个web页面已经可以看到哪些丢失的区块了。
hadoop|hadoop日常运维问题汇总
文章图片

当然用下面命令也可以查看。

hdfs fsck / -list-corruptfileblocks

hadoop|hadoop日常运维问题汇总
文章图片

删除数据块信息
hdfs fsck 路径 -delete

先检查文件是否属于损坏文件,如果是(corrupt)就删除,否则就被诊断出不是(healthy),就不会被删除
namenode节点服务挂了 查看日志信息如下:
src=https://www.it610.com/flink/job/sjzt/checkpoints/5a4be983e14b9538d344b7fb9584cded/chk-1435 dst=null perm=hadoop:hadoop:rwxr-xr-x proto=rpc 2022-08-22 12:21:19,021 INFO[IPC Server handler 7 on default port 8020] FSNamesystem.audit (FSNamesystem.java:logAuditMessage(7990)) - allowed=true ugi=hadoop (auth:SIMPLE) ip=/172.20.192.23 cmd=mkdirs src=/flink/job/sjzt/checkpoints/4aafc4e3232f8ffa015aa42500c3ac7f/chk-24012 dst=null perm=hadoop:hadoop:rwxr-xr-x proto=rpc 2022-08-22 12:21:19,847 INFO[Logger channel (from single-thread executor) to cdh192-57/172.20.192.57:8485] ipc.Client (Client.java:handleConnectionFailure(958)) - Retrying connect to server: cdh192-57/172.20.192.57:8485. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2022-08-22 12:21:20,648 INFO[FSEditLogAsync] client.QuorumJournalManager (QuorumCall.java:waitFor(188)) - Waited 6001 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [172.20.192.56:8485] 2022-08-22 12:21:20,795 INFO[Socket Reader #1 for port 8020] authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(138)) - Authorization successful for hadoop (auth:SIMPLE) for protocol=interface org.apache.hadoop.hdfs.protocol.ClientProtocol 2022-08-22 12:21:20,795 INFO[IPC Server handler 2 on default port 8020] FSNamesystem.audit (FSNamesystem.java:logAuditMessage(7990)) - allowed=true ugi=hadoop (auth:SIMPLE) ip=/172.20.192.22 cmd=mkdirs src=/flink/job/sjzt/checkpoints/d16a0d39cd2693e12af0d2abbdf7b2fb/chk-26898 dst=null perm=hadoop:hadoop:rwxr-xr-x proto=rpc 2022-08-22 12:21:20,848 INFO[Logger channel (from single-thread executor) to cdh192-57/172.20.192.57:8485] ipc.Client (Client.java:handleConnectionFailure(958)) - Retrying connect to server: cdh192-57/172.20.192.57:8485. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2022-08-22 12:21:21,395 INFO[Socket Reader #1 for port 8020] authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(138)) - Authorization successful for hadoop (auth:SIMPLE) for protocol=interface org.apache.hadoop.hdfs.protocol.ClientProtocol 2022-08-22 12:21:21,649 INFO[FSEditLogAsync] client.QuorumJournalManager (QuorumCall.java:waitFor(188)) - Waited 7002 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [172.20.192.56:8485] 2022-08-22 12:21:21,848 INFO[Logger channel (from single-thread executor) to cdh192-57/172.20.192.57:8485] ipc.Client (Client.java:handleConnectionFailure(958)) - Retrying connect to server: cdh192-57/172.20.192.57:8485. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2022-08-22 12:21:22,110 INFO[Socket Reader #1 for port 8020] authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(138)) - Authorization successful for hadoop (auth:SIMPLE) for protocol=interface org.apache.hadoop.hdfs.protocol.ClientProtocol 2022-08-22 12:21:22,110 INFO[IPC Server handler 0 on default port 8020] FSNamesystem.audit (FSNamesystem.java:logAuditMessage(7990)) - allowed=true ugi=hadoop (auth:SIMPLE) ip=/172.20.192.35 cmd=mkdirs src=/flink/job/sjzt/checkpoints/74eb2abea93b52545b7e6ba10a962df1/chk-14970 dst=null perm=hadoop:hadoop:rwxr-xr-x proto=rpc 2022-08-22 12:21:22,587 INFO[Socket Reader #1 for port 8020] authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(138)) - Authorization successful for hadoop (auth:SIMPLE) for protocol=interface org.apache.hadoop.hdfs.protocol.ClientProtocol 2022-08-22 12:21:22,650 INFO[FSEditLogAsync] client.QuorumJournalManager (QuorumCall.java:waitFor(188)) - Waited 8003 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [172.20.192.56:8485] 2022-08-22 12:21:22,848 INFO[Logger channel (from single-thread executor) to cdh192-57/172.20.192.57:8485] ipc.Client (Client.java:handleConnectionFailure(958)) - Retrying connect to server: cdh192-57/172.20.192.57:8485. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2022-08-22 12:21:23,075 INFO[Socket Reader #1 for port 8020] authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(138)) - Authorization successful for hadoop (auth:SIMPLE) for protocol=interface org.apache.hadoop.hdfs.protocol.ClientProtocol 2022-08-22 12:21:23,075 INFO[IPC Server handler 9 on default port 8020] FSNamesystem.audit (FSNamesystem.java:logAuditMessage(7990)) - allowed=true ugi=hadoop (auth:SIMPLE) ip=/172.20.192.51 cmd=create src=/user/hive/warehouse/xy_ods.db/ods_bigquery_contract_new/pk_year=2022/pk_month=2022-08/pk_day=2022-08-22/bigquery_contract_new.1661142083054.tmp dst=null perm=hadoop:hadoop:rw-r--r-- proto=rpc 2022-08-22 12:21:23,651 INFO[FSEditLogAsync] client.QuorumJournalManager (QuorumCall.java:waitFor(188)) - Waited 9004 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [172.20.192.56:8485] 2022-08-22 12:21:23,850 INFO[Logger channel (from single-thread executor) to cdh192-57/172.20.192.57:8485] ipc.Client (Client.java:handleConnectionFailure(958)) - Retrying connect to server: cdh192-57/172.20.192.57:8485. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2022-08-22 12:21:24,203 INFO[Socket Reader #1 for port 8020] authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(138)) - Authorization successful for hadoop (auth:SIMPLE) for protocol=interface org.apache.hadoop.hdfs.protocol.ClientProtocol 2022-08-22 12:21:24,652 INFO[FSEditLogAsync] client.QuorumJournalManager (QuorumCall.java:waitFor(188)) - Waited 10005 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [172.20.192.56:8485] 2022-08-22 12:21:24,850 INFO[Logger channel (from single-thread executor) to cdh192-57/172.20.192.57:8485] ipc.Client (Client.java:handleConnectionFailure(958)) - Retrying connect to server: cdh192-57/172.20.192.57:8485. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2022-08-22 12:21:25,068 INFO[Socket Reader #1 for port 8020] authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(138)) - Authorization successful for hadoop (auth:SIMPLE) for protocol=interface org.apache.hadoop.hdfs.protocol.ClientProtocol 2022-08-22 12:21:25,166 INFO[Socket Reader #1 for port 8020] authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(138)) - Authorization successful for hadoop (auth:SIMPLE) for protocol=interface org.apache.hadoop.hdfs.protocol.ClientProtocol 2022-08-22 12:21:25,167 INFO[IPC Server handler 7 on default port 8020] FSNamesystem.audit (FSNamesystem.java:logAuditMessage(7990)) - allowed=true ugi=hadoop (auth:SIMPLE) ip=/172.20.192.21 cmd=delete src=/user/hive/warehouse/iceberg_ods.db/ods_nft_listing/metadata/19b9ce1f689833f8b96925446296d8e8-00000-46952-370054-00001.avro dst=null perm=null proto=rpc 2022-08-22 12:21:25,652 INFO[FSEditLogAsync] client.QuorumJournalManager (QuorumCall.java:waitFor(188)) - Waited 11006 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [172.20.192.56:8485] 2022-08-22 12:21:25,765 INFO[Socket Reader #1 for port 8020] authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(138)) - Authorization successful for hadoop (auth:SIMPLE) for protocol=interface org.apache.hadoop.hdfs.protocol.ClientProtocol 2022-08-22 12:21:25,765 INFO[IPC Server handler 4 on default port 8020] FSNamesystem.audit (FSNamesystem.java:logAuditMessage(7990)) - allowed=true ugi=hadoop (auth:SIMPLE) ip=/172.20.192.34 cmd=mkdirs src=/flink/job/sjzt/checkpoints/353ca0c605b10433daf2e88ce9a9feb1/chk-601 dst=null perm=hadoop:hadoop:rwxr-xr-x proto=rpc 2022-08-22 12:21:25,850 INFO[Logger channel (from single-thread executor) to cdh192-57/172.20.192.57:8485] ipc.Client (Client.java:handleConnectionFailure(958)) - Retrying connect to server: cdh192-57/172.20.192.57:8485. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2022-08-22 12:21:26,654 INFO[FSEditLogAsync] client.QuorumJournalManager (QuorumCall.java:waitFor(188)) - Waited 12007 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [172.20.192.56:8485] 2022-08-22 12:21:26,851 INFO[Logger channel (from single-thread executor) to cdh192-57/172.20.192.57:8485] ipc.Client (Client.java:handleConnectionFailure(958)) - Retrying connect to server: cdh192-57/172.20.192.57:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2022-08-22 12:21:27,655 INFO[FSEditLogAsync] client.QuorumJournalManager (QuorumCall.java:waitFor(188)) - Waited 13008 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [172.20.192.56:8485]. Exceptions so far: [172.20.192.57:8485: Journal disabled until next roll] 2022-08-22 12:21:28,186 INFO[Socket Reader #1 for port 8020] authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(138)) - Authorization successful for hadoop (auth:SIMPLE) for protocol=interface org.apache.hadoop.hdfs.protocol.ClientProtocol 2022-08-22 12:21:28,186 INFO[IPC Server handler 8 on default port 8020] FSNamesystem.audit (FSNamesystem.java:logAuditMessage(7990)) - allowed=true ugi=hadoop (auth:SIMPLE) ip=/172.20.192.35 cmd=mkdirs src=/flink/job/sjzt/checkpoints/8e31c6b93b5ff61a0c811066fb666dab/chk-1494 dst=null perm=hadoop:hadoop:rwxr-xr-x proto=rpc 2022-08-22 12:21:28,656 WARN[FSEditLogAsync] client.QuorumJournalManager (QuorumCall.java:waitFor(186)) - Waited 14009 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [172.20.192.56:8485]. Exceptions so far: [172.20.192.57:8485: Journal disabled until next roll] 2022-08-22 12:21:29,259 INFO[Socket Reader #1 for port 8020] authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(138)) - Authorization successful for hadoop (auth:SIMPLE) for protocol=interface org.apache.hadoop.hdfs.protocol.ClientProtocol 2022-08-22 12:21:29,259 INFO[IPC Server handler 3 on default port 8020] FSNamesystem.audit (FSNamesystem.java:logAuditMessage(7990)) - allowed=true ugi=hadoop (auth:SIMPLE) ip=/172.20.192.35 cmd=delete src=/flink/job/sjzt/checkpoints/19b9ce1f689833f8b96925446296d8e8/chk-370055 dst=null perm=null proto=rpc 2022-08-22 12:21:29,649 INFO[Socket Reader #1 for port 8020] authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(138)) - Authorization successful for hadoop (auth:SIMPLE) for protocol=interface org.apache.hadoop.hdfs.protocol.ClientProtocol 2022-08-22 12:21:29,649 INFO[IPC Server handler 5 on default port 8020] FSNamesystem.audit (FSNamesystem.java:logAuditMessage(7990)) - allowed=true ugi=hadoop (auth:SIMPLE) ip=/172.20.192.51 cmd=mkdirs src=/flink/job/sjzt/checkpoints/fe154b52ce78ac3c9568314c424cc0eb/chk-44218 dst=null perm=hadoop:hadoop:rwxr-xr-x proto=rpc 2022-08-22 12:21:29,657 WARN[FSEditLogAsync] client.QuorumJournalManager (QuorumCall.java:waitFor(186)) - Waited 15010 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [172.20.192.56:8485]. Exceptions so far: [172.20.192.57:8485: Journal disabled until next roll] 2022-08-22 12:21:29,789 INFO[Socket Reader #1 for port 8020] authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(138)) - Authorization successful for hadoop (auth:SIMPLE) for protocol=interface org.apache.hadoop.hdfs.protocol.ClientProtocol 2022-08-22 12:21:29,789 INFO[IPC Server handler 7 on default port 8020] FSNamesystem.audit (FSNamesystem.java:logAuditMessage(7990)) - allowed=true ugi=hadoop (auth:SIMPLE) ip=/172.20.192.60 cmd=mkdirs src=/flink/job/sjzt/checkpoints/b99c1b359b92a91573d25107708f62ef/chk-1460 dst=null perm=hadoop:hadoop:rwxr-xr-x proto=rpc 2022-08-22 12:21:30,658 WARN[FSEditLogAsync] client.QuorumJournalManager (QuorumCall.java:waitFor(186)) - Waited 16011 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [172.20.192.56:8485]. Exceptions so far: [172.20.192.57:8485: Journal disabled until next roll] 2022-08-22 12:21:30,951 INFO[Socket Reader #1 for port 8020] authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(138)) - Authorization successful for hadoop (auth:SIMPLE) for protocol=interface org.apache.hadoop.hdfs.protocol.ClientProtocol 2022-08-22 12:21:31,659 WARN[FSEditLogAsync] client.QuorumJournalManager (QuorumCall.java:waitFor(186)) - Waited 17012 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [172.20.192.56:8485]. Exceptions so far: [172.20.192.57:8485: Journal disabled until next roll] 2022-08-22 12:21:32,660 WARN[FSEditLogAsync] client.QuorumJournalManager (QuorumCall.java:waitFor(186)) - Waited 18013 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [172.20.192.56:8485]. Exceptions so far: [172.20.192.57:8485: Journal disabled until next roll] 2022-08-22 12:21:32,927 INFO[Socket Reader #1 for port 8020] authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(138)) - Authorization successful for hadoop (auth:SIMPLE) for protocol=interface org.apache.hadoop.hdfs.protocol.ClientProtocol 2022-08-22 12:21:33,087 INFO[Socket Reader #1 for port 8020] authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(138)) - Authorization successful for hadoop (auth:SIMPLE) for protocol=interface org.apache.hadoop.hdfs.protocol.ClientProtocol 2022-08-22 12:21:33,661 WARN[FSEditLogAsync] client.QuorumJournalManager (QuorumCall.java:waitFor(186)) - Waited 19014 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [172.20.192.56:8485]. Exceptions so far: [172.20.192.57:8485: Journal disabled until next roll] 2022-08-22 12:21:34,229 INFO[Socket Reader #1 for port 8020] authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(138)) - Authorization successful for hadoop (auth:SIMPLE) for protocol=interface org.apache.hadoop.hdfs.protocol.ClientProtocol 2022-08-22 12:21:34,229 INFO[IPC Server handler 7 on default port 8020] FSNamesystem.audit (FSNamesystem.java:logAuditMessage(7990)) - allowed=true ugi=hadoop (auth:SIMPLE) ip=/172.20.192.59 cmd=mkdirs src=/flink/job/sjzt/checkpoints/fa0629a9682efc2a685d5f29b665a5fc/chk-157 dst=null perm=hadoop:hadoop:rwxr-xr-x proto=rpc 2022-08-22 12:21:34,647 FATAL [FSEditLogAsync] namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(390)) - Error: flush failed for required journal (JournalAndStream(mgr=QJM to [172.20.192.56:8485, 172.20.192.57:8485, 172.20.192.58:8485], stream=QuorumOutputStream starting at txid 1010629910)) java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond. at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:138) at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:113) at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:115) at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:109) at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:525) at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:385) at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:55) at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:521) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:713) at org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.run(FSEditLogAsync.java:243) at java.lang.Thread.run(Thread.java:748) 2022-08-22 12:21:34,647 WARN[FSEditLogAsync] client.QuorumJournalManager (QuorumOutputStream.java:abort(73)) - Aborting QuorumOutputStream starting at txid 1010629910 2022-08-22 12:21:34,652 INFO[FSEditLogAsync] util.ExitUtil (ExitUtil.java:terminate(210)) - Exiting with status 1: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [172.20.192.56:8485, 172.20.192.57:8485, 172.20.192.58:8485], stream=QuorumOutputStream starting at txid 1010629910)) 2022-08-22 12:21:34,656 INFO[shutdown-hook-0] namenode.NameNode (LogAdapter.java:info(51)) - SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at cdh192-56/172.20.192.56 ************************************************************/

可以看到Qjm通信超过20s最大值后自动断掉。
client.QuorumJournalManager (QuorumCall.java:waitFor(186)) - Waited 19014 ms (timeout=20000 ms)
namenode宕机的原因是full gc的时间太久,跟journal node的通信断开了.
参考文章:
https://blog.csdn.net/weixin_39445556/article/details/104712157
hadoop|hadoop日常运维问题汇总
文章图片

解决方案:
1)增加和qjm的通信超时时长,默认是20s,延长为2分钟。
修改hdfs-site.xml文件,添加以下配置
dfs.qjournal.write-txns.timeout.ms 120000

2)修改namenode服务的堆内存,由默认值修改为80G。
自从上了flink+iceberg后,小文件数飞起。blocks数量暴涨到3000万多。
hadoop|hadoop日常运维问题汇总
文章图片

修改hadoop-env.sh文件,大概在52行左右。
# Command specific options appended to HADOOP_OPTS when specified export HADOOP_NAMENODE_OPTS="-Xms80G -Xmx80G -Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_NAMENODE_OPTS" export HADOOP_DATANODE_OPTS="-Xms10G -Xmx10G -Dhadoop.security.logger=ERROR,RFAS $HADOOP_DATANODE_OPTS"

datanode服务节点偶尔挂掉 出现错误信息如下:
2022-08-22 18:32:06,108 INFO[Async disk worker #1384 for volume /dfs/data5] impl.FsDatasetAsyncDiskService (FsDatasetAsyncDiskService.java:run(333)) - Deleted BP-1555553207-10.0.50.200-1625229209582 blk_1201618387_127897013 URI file:/dfs/data5/current/BP-1555553207-10.0.50.200-1625229209582/current/finalized/subdir31/subdir29/blk_1201618387 2022-08-22 18:32:06,108 WARN[BP-1555553207-10.0.50.200-1625229209582 heartbeating to cdh192-56/172.20.192.56:8020] datanode.DataNode (BPServiceActor.java:run(855)) -Unexpected exception in block pool Block pool BP-1555553207-10.0.50.200-1625229209582 (Datanode Uuid 7a182b3f-caa2-4f13-8f87-6781fe3d9e46) service to cdh192-56/172.20.192.56:8020 java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:717) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:180) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:229) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2115) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2034) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:734) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:680) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:881) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:676) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:847) at java.lang.Thread.run(Thread.java:748) 2022-08-22 18:32:06,108 WARN[BP-1555553207-10.0.50.200-1625229209582 heartbeating to cdh192-56/172.20.192.56:8020] datanode.DataNode (BPServiceActor.java:run(858)) -Ending block pool service for: Block pool BP-1555553207-10.0.50.200-1625229209582 (Datanode Uuid 7a182b3f-caa2-4f13-8f87-6781fe3d9e46) service to cdh192-56/172.20.192.56:8020 2022-08-22 18:32:06,128 INFO[PacketResponder: BP-1555553207-10.0.50.200-1625229209582:blk_1201617661_127896282, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.20.192.60:9866]] DataNode.clienttrace (BlockReceiver.java:finalizeBlock(1533)) - src: /172.20.192.51:45422, dest: /172.20.192.59:9866, bytes: 7783726, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_1109828295_43, offset: 0, srvID: 7a182b3f-caa2-4f13-8f87-6781fe3d9e46, blockid: BP-1555553207-10.0.50.200-1625229209582:blk_1201617661_127896282, duration(ns): 1262186431032022-08-22 18:32:10,372 INFO[DataXceiver for client DFSClient_NONMAPREDUCE_-585719459_66 at /172.20.192.51:46504 [Receiving block BP-1555553207-10.0.50.200-1625229209582:blk_1201618421_127897047]] sasl.SaslDataTransferClient (SaslDataTransferClient.java:checkTrustAndSend(239)) - SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false 2022-08-22 18:32:10,373 ERROR [DataXceiver for client DFSClient_NONMAPREDUCE_-585719459_66 at /172.20.192.51:46504 [Receiving block BP-1555553207-10.0.50.200-1625229209582:blk_1201618421_127897047]] datanode.DataNode (DataXceiver.java:run(324)) - cdh192-59:9866:DataXceiver error processing WRITE_BLOCK operationsrc: /172.20.192.51:46504 dst: /172.20.192.59:9866 java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:717) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:968) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:908) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:173) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:107) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:292) at java.lang.Thread.run(Thread.java:748) 2022-08-22 18:32:11,329 INFO[DataXceiver for client DFSClient_NONMAPREDUCE_821955818_84 at /172.20.192.37:40696 [Receiving block BP-1555553207-10.0.50.200-1625229209582:blk_1201618425_127897052]] datanode.DataNode (DataXceiver.java:writeBlock(747)) - Receiving BP-1555553207-10.0.50.200-1625229209582:blk_1201618425_127897052 src: /172.20.192.37:40696 dest: /172.20.192.59:9866 2022-08-22 18:32:11,329 INFO[DataXceiver for client DFSClient_NONMAPREDUCE_821955818_84 at /172.20.192.37:40696 [Receiving block BP-1555553207-10.0.50.200-1625229209582:blk_1201618425_127897052]] sasl.SaslDataTransferClient (SaslDataTransferClient.java:checkTrustAndSend(239)) - SASL encryption trust check: localHostTrusted = false,remoteHostTrusted = false

【hadoop|hadoop日常运维问题汇总】同样增加datanode服务的堆内存空间,修改为10G。修改参考上面。

    推荐阅读