Oracle案例(ORA-00600: internal error code, arguments: [4187])

本案例客户来自某省电信,alert日志大量的ORA-00600[4187]报错,已经影响到业务正常运行。
Fri Nov 19 16:07:09 2021
Errors in file /u01/ora

cle/app/oracle/diag/rdbms/lcfa/LCFA1/trace/LCFA1_smon_5811.trc(incident=184182): ORA-00600: internal error code, arguments: [ 4187], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ] Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Block recovery from logseq 54671, block 287204 to scn 17162371413499 Recovery of Online Redo Log: Thread 1 Group 5 Seq 54671 Reading mem 0 Mem# 0: +DATA/lcfa/onlinelog/group_5.287.904064243 Block recovery stopped at EOT rba 54671.287388.16 Block recovery completed at rba 54671.287388.16, scn 3995.3977065979 Non-fatal internal error happenned while SMON was doing flushing of monitored table stats. SMON encountered 1 out of maximum 100 non-fatal internal errors. Fri Nov 19 16:07:10 2021 Sweep [ inc][ 184182]: completed Fri Nov 19 16:07:10 2021 Errors in file /u01/oracle/app/oracle/diag/rdbms/lcfa/LCFA1/trace/LCFA1_ora_1734.trc(incident=190317): ORA-00600: ě2?′?′ú?, 2?y: [ 4187], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ] Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Fri Nov 19 16:09:04 2021 Block recovery from logseq 54671, block 287204 to scn 17162371413499 Recovery of Online Redo Log: Thread 1 Group 5 Seq 54671 Reading mem 0 Mem# 0: +DATA/lcfa/onlinelog/group_5.287.904064243 Block recovery completed at rba 54671.287388.16, scn 3995.3977065982 Fri Nov 19 16:10:30 2021 Errors in file /u01/oracle/app/oracle/diag/rdbms/lcfa/LCFA1/trace/LCFA1_ora_6392.trc(incident=184485): ORA-00600: ě2?′?′ú?, 2?y: [ 4187], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ], [ ] Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details.

可以看到ORA-00600[4187]并伴随着blockrecover,通常ORA-00600[4XXX]错误都来自于undo相关,并且都会触发BRR。SMON已经遇到1次内部错误,如果smon遇到100次内部错误则会重启实例,由参数_smon_internal_errlimit控制。
SQL> @sp smon_inter -- show parameter by sp -- show hidden parameter by sp old3: where x.indx=y.indx and ksppinm like '_%& p%' new3: where x.indx=y.indx and ksppinm like '_%smon_inter%' NAMEVALUEDESC ---------------------------------------- ---------- ------------------------------------------------------------------------------------------ _smon_internal_errlimit100limit of SMON internal errorsORA-00600 4187在Doc ID 19700135.8上有比较清楚的说明:Description ORA-600 [4187] can occur for undo segments where wrap# is close to the max value of 0xffffffff (KSQNMAXVAL).This normally affects databases with high transaction rate that have existed for a relatively long time.

大致意思是长期的高TPS的环境,当在新的事务绑定到某个undo段某个slot上,将递增wrap#,但是递增后的wrap#超过最大值 KSQNMAXVAL(0xffffffff),就会抛出ORA-00600[4187]错误。
继续查看trace文件查找报异常的undo段头:
TRN CTL:: seq: 0xd14f chd: 0x0009 ctl: 0x0004 inc: 0x00000000 nfb: 0x0000 mgc: 0xb000 xts: 0x0068 flg: 0x0001 opt: 2147483646 (0x7ffffffe) uba: 0x00c06434.d14f.2c scn: 0x0f9b.ed0bc826 Version: 0x01 FREE BLOCK POOL:: uba: 0x00000000.d14f.2b ext: 0x2spc: 0x1dc uba: 0x00000000.d14f.2a ext: 0x2spc: 0x70e uba: 0x00000000.d14b.02 ext: 0x1e spc: 0x1f02 uba: 0x00000000.ce3f.02 ext: 0x12 spc: 0x14da uba: 0x00000000.3226.02 ext: 0x32 spc: 0x14ae TRN TBL:: indexstate cflagswrap#uelscndbaparent-xidnubstmt_numcmt ------------------------------------------------------------------------------------------------ 0x0090x000xfffffa0c0x000c0x0f9b.ed0bc8f10x00c064340x0000.000.000000000x000000010x000000001637305928 0x0190x000xfffff4ab0x00160x0f9b.ed0bc84e0x00c064340x0000.000.000000000x000000010x000000001637305927 0x0290x000xfffff3aa0x00080x0f9b.ed0bc9340x00c064340x0000.000.000000000x000000010x000000001637305928 0x0390x000xfffff8d90x000e0x0f9b.ed0bc9850x00c064340x0000.000.000000000x000000010x000000001637305928 0x0490x000xfffffce80xffff0x0f9b.ed0bc9e70x00c064340x0000.000.000000000x000000010x000000001637305928 0x0590x000xfffff6270x001a0x0f9b.ed0bc8330x00c064340x0000.000.000000000x000000010x000000001637305927 0x0690x000xfffff4e60x00040x0f9b.ed0bc9cb0x00c064340x0000.000.000000000x000000010x000000001637305928 0x0790x000xffffece50x000b0x0f9b.ed0bc85c0x00c064340x0000.000.000000000x000000010x000000001637305927 0x0890x000xfffff7240x00210x0f9b.ed0bc93a0x00c064340x0000.000.000000000x000000010x000000001637305928 0x0990x000xfffffff30x00150x0f9b.ed0bc8280x00c064340x0000.000.000000000x000000010x000000001637305927 0x0a90x000xfffffaf20x00180x0f9b.ed0bc90c0x00c064340x0000.000.000000000x000000010x000000001637305928 0x0b90x000xfffff6710x00100x0f9b.ed0bc8670x00c064340x0000.000.000000000x000000010x000000001637305927 0x0c90x000xfffffec00x001e0x0f9b.ed0bc9000x00c064340x0000.000.000000000x000000010x000000001637305928 0x0d90x000xfffff8bf0x00200x0f9b.ed0bc8890x00c064340x0000.000.000000000x000000010x000000001637305928 0x0e90x000xfffff4ce0x00130x0f9b.ed0bc9ab0x00c064340x0000.000.000000000x000000010x000000001637305928 0x0f90x000xfffff64d0x000d0x0f9b.ed0bc8750x00c064340x0000.000.000000000x000000010x000000001637305927 0x1090x000xfffff5ec0x000f0x0f9b.ed0bc86b0x00c064340x0000.000.000000000x000000010x000000001637305927 0x1190x000xfffffccb0x001c0x0f9b.ed0bc9500x00c064340x0000.000.000000000x000000010x000000001637305928 0x1290x000xfffff55a0x001f0x0f9b.ed0bc9760x00c064340x0000.000.000000000x000000010x000000001637305928 0x1390x000xfffff6590x00140x0f9b.ed0bc9b10x00c064340x0000.000.000000000x000000010x000000001637305928 0x1490x000xffffefb80x00060x0f9b.ed0bc9c20x00c064340x0000.000.000000000x000000010x000000001637305928 0x1590x000xffffed270x00050x0f9b.ed0bc82e0x00c064340x0000.000.000000000x000000010x000000001637305927 0x1690x000xfffffd660x00070x0f9b.ed0bc8540x00c064340x0000.000.000000000x000000010x000000001637305927 0x1790x000xfffffdd50x00000x0f9b.ed0bc8e60x00c064340x0000.000.000000000x000000010x000000001637305928 0x1890x000xfffff1f40x001d0x0f9b.ed0bc9170x00c064340x0000.000.000000000x000000010x000000001637305928 0x1990x000xfffff3030x00020x0f9b.ed0bc9270x00c064340x0000.000.000000000x000000010x000000001637305928 0x1a90x000xfffff5920x00010x0f9b.ed0bc83b0x00c064340x0000.000.000000000x000000010x000000001637305927 0x1b90x000xfffff9f10x00170x0f9b.ed0bc8df0x00c064340x0000.000.000000000x000000010x000000001637305928 0x1c90x000xffffeee00x00120x0f9b.ed0bc95b0x00c064340x0000.000.000000000x000000010x000000001637305928 0x1d90x000xfffff23f0x00190x0f9b.ed0bc91e0x00c064340x0000.000.000000000x000000010x000000001637305928 0x1e90x000xfffff67e0x000a0x0f9b.ed0bc9080x00c064340x0000.000.000000000x000000010x000000001637305928 0x1f90x000xfffff1ad0x00030x0f9b.ed0bc9820x00c064340x0000.000.000000000x000000010x000000001637305928 0x2090x000xfffffb0c0x001b0x0f9b.ed0bc8ba0x00c064340x0000.000.000000000x000000010x000000001637305928 0x2190x000xfffff1eb0x00110x0f9b.ed0bc9430x00c064340x0000.000.000000000x000000010x000000001637305928

异常的undo段头的dump可以看到所有slot的wrap#都非常高,ktuxc中的chd为0009,说明下一次事务将使用slot 9的事务槽,而slot 9的wrap#为 0xfffffff3已经非常接近 KSQNMAXVAL ,但是我们知道每次wrap#重用只会加1,并不会超过 KSQNMAXVAL ,那么为什么会报出ORA-00600[4187]呢?
原因在于重用slot时wrap#+1的算法已经过时了,当前采用的是在执行ktubnd函数为事务绑定undo段时,会调用kjqghd去计算出一个重用slot递增值delta,这个delta也是有限制的,必须小于16(由KTU_MAX_KSQN_DELTA定义),所以就可能会出现 0xfffffff3 +delta的值超过KSQNMAXVAL 。
知道了错误原因,解决方法其实很简单,就是删除异常的undo段或者重建undo表空间,如果删除不掉undo段,比如还有其他活动事务,那么可以用_corrupted_rollback_segments屏蔽掉该undo段。mos也提供了脚本去检查哪些undo段面临这样的问题。
select b.segment_name, b.tablespace_name ,a.ktuxeusn "Undo Segment Number" ,a.ktuxeslt "Slot" ,a.ktuxesqn "Wrap#" fromx$ktuxe a, dba_rollback_segs b where a.ktuxesqn > -429496730 and a.ktuxesqn < 0 and a.ktuxeusn = b.segment_id;

这里还有一点需要思考的是,为什么会出现wrap#如此大?仅仅是高TPS吗?我们知道事务绑定undo段的原则是尽可能的将活动事务平均的各个undo段上,具体算法为:
在当前undo tablespace中的online undo segment中寻找事务表中没有活动事务的undo segment;
如果没有找到则尝试在当前undo tablespace online那些处于offline状态的undo segment;
如果没有找到则尝试在当前undo tablespace创建undo segment并online;
如果无法创建则会寻找最近最少使用的undo segment。
有一种非常大的可能性就是可以online的undo段太少,经过查看该实例undo表空间大小为1.5g,且不可自动扩展,这才导致了undo事务表的各个slot的wrap#如此之高。
所以针对该case的补充建议是根据高峰期TPS,合理设置undo表空间大小以及_rollback_segment_count。
【Oracle案例(ORA-00600: internal error code, arguments: [4187])】墨天轮原文链接:https://www.modb.pro/db/17494...(复制链接至浏览器或点击文末阅读原文查看)
关于作者
李翔宇,云和恩墨西区交付技术顾问,长期服务移动运营商行业客户,熟悉Oracle性能优化,故障诊断,特殊恢复。

    推荐阅读