真实的国产乱ⅩXXX66竹夫人,五月香六月婷婷激情综合,亚洲日本VA一区二区三区,亚洲精品一区二区三区麻豆

成都創(chuàng)新互聯(lián)網(wǎng)站制作重慶分公司

小機(jī)上運(yùn)行ORACLE需要注意的進(jìn)程調(diào)度BUG

前 言

創(chuàng)新互聯(lián)是創(chuàng)新、創(chuàng)意、研發(fā)型一體的綜合型網(wǎng)站建設(shè)公司,自成立以來公司不斷探索創(chuàng)新,始終堅(jiān)持為客戶提供滿意周到的服務(wù),在本地打下了良好的口碑,在過去的10年時間我們累計(jì)服務(wù)了上千家以及全國政企客戶,如石涼亭等企業(yè)單位,完善的項(xiàng)目管理流程,嚴(yán)格把控項(xiàng)目進(jìn)度與質(zhì)量監(jiān)控加上過硬的技術(shù)實(shí)力獲得客戶的一致夸獎。

小y這個名字,是筆者臨時想的一個筆名,其實(shí)沒有什么特殊的含義,就暫且用他來代表我們這些為各個數(shù)據(jù)中心奉獻(xiàn)自己青春的一群默默無聞的IT人吧!

小y今天要和大家分享的是一個疑難雜癥的分析過程。如果大家有耐心讀完這個案例,一定會或多或少有些收獲,也就沒浪費(fèi)小y的一片苦心。

具體來說是一個應(yīng)用間歇性局部掛起案例的分析過程,報(bào)告中將對Oracle數(shù)據(jù)庫穩(wěn)定運(yùn)行的共性風(fēng)險(xiǎn)和隱患作出提醒。

1問題描述

據(jù)客戶反映,應(yīng)用會間歇性出現(xiàn)異常,包括insert單條記錄在內(nèi)的操作長時間無法完成,按照客戶的說法,數(shù)據(jù)庫內(nèi)可能有“死鎖”現(xiàn)象,希望能夠找到問題發(fā)生的根因,提出解決方案,以避免問題再次發(fā)生。

2015年12月23日,問題再次發(fā)生,客戶再次聯(lián)系到小y,小y通過遠(yuǎn)程方式進(jìn)行了信息收集和故障診斷,最終定位了問題的根本原因。

環(huán)境介紹:

操作系統(tǒng) HPUX IA64 B.11.31

數(shù)據(jù)庫 ORACLE 10.2.0.5,單實(shí)例

2分析過程

>>> 2.1 異常時刻數(shù)據(jù)庫出現(xiàn)異常等待

小機(jī)上運(yùn)行ORACLE需要注意的進(jìn)程調(diào)度BUG

可以看到:

有2個會話在等待行鎖(拿不到事務(wù)鎖,需要一直等待),另外有一個會話在等待“undo segment extension”。


>>>> 2.2 梳理異常等待之間的關(guān)系

1)分析行鎖等待的阻塞者

小機(jī)上運(yùn)行ORACLE需要注意的進(jìn)程調(diào)度BUG

可以看到:

SID 285/290的兩個會話都是被SID=315的會話阻塞了,他們在等待行鎖,時間已經(jīng)超過60000秒。

2)查看阻塞者SID=315在做什么

小機(jī)上運(yùn)行ORACLE需要注意的進(jìn)程調(diào)度BUG

可以看到:

SID=315的會話阻塞了其他兩個會話,他本身也處于一個資源的等待上,在等待“undo segment extension”,已經(jīng)等了70384秒了!該等待事件沒有阻塞者。

3)SID=315在執(zhí)行的SQL語句

INSERT INTO TABLE_NAME(COL1,COL2,COL3,COL4,COL5,COL6,COL7)VALUES(:1,:2,:3,:4,:5,:6,:7)


>>>> 2.3 什么是“undo segment extension”等待事件

“undo segment extension”即等待回滾段擴(kuò)展完成。當(dāng)執(zhí)行增刪改等操作時,數(shù)據(jù)庫需要回滾段來存儲前鏡像,當(dāng)回滾段空間不足時,則需要擴(kuò)展。

具體來說,undo segment的擴(kuò)展或者回收(extend / shrink)都是前臺進(jìn)程通過通知SMON后臺進(jìn)程來完成的。


發(fā)出下列命令查看undo的使用情況,UNDO表有空有1個文件27750M,其中目前活動的值有8M,未過undo retention的1892M,過了undo retention的有40M未發(fā)現(xiàn)異常。


小機(jī)上運(yùn)行ORACLE需要注意的進(jìn)程調(diào)度BUG

>>>> 2.4 收集數(shù)據(jù)庫hanganalyze和systemstate信息

發(fā)出下列命令異常時候的hanganalyze和systemstate信息

SQL> oradebug setmypid

Statement processed.

SQL> oradebug hanganalyze 3

Hang Analysis in /oracle/admin/xxdb/udump/xxdb_ora_14136.trc

SQL>

SQL> oradebug dump systemstate 266

Statement processed.

SQL> oradebug tracefile_name

/oracle/admin/xxdb/udump/xxdb_ora_14136.trc

>>>> 2.5 問題在收集信息后自動解決

在發(fā)出上述命令收集相關(guān)信息后,再次檢查,發(fā)現(xiàn)數(shù)據(jù)庫異常等待居然已經(jīng)自動解決掉了。

從原理和經(jīng)驗(yàn)分析,這是因?yàn)閛radebug 在收集systemstate dump時,其中會調(diào)用dbx等OS命令去做進(jìn)程堆棧的打印,此時會將進(jìn)行wakeup(喚醒)。

如下所示


小機(jī)上運(yùn)行ORACLE需要注意的進(jìn)程調(diào)度BUG

>>>> 2.6 獲得SID:315會話的活動會話歷史

小機(jī)上運(yùn)行ORACLE需要注意的進(jìn)程調(diào)度BUG

可以看到,確實(shí)一直在等待undo segment extension

>>>> 2.7 分析systemstate dump定位問題根因

其中SID=315的會話在等待”undo segment extension”,該會話對應(yīng)PROCESS 19 SSD當(dāng)中的信息如下:

PROCESS 19:

----------------------------------------

SO: c00000003949b948, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00

(process) Oracle pid=19, calls cur/top: c0000000397209b0/c0000000397209b0, flag: (0) -

int error: 0, call error: 0, sess error: 0, txn error 0

(post info) last post received: 0 0 121

last post received-location: kcbzww

last process to post me: c000000039496148 1 22

last post sent: 0 0 121

last post sent-location: kcbzww

last process posted by me: c000000039496148 1 22

(latch info) wait_event=0 bits=0

Process Group: DEFAULT, pseudo proc: c000000039529928

O/S info: user: oracle, term: UNKNOWN, ospid: 11880

OSD pid info: Unix process pid: 11880, p_w_picpath: oracle@ap-machine-

*** 2015-12-22 10:34:53.431

Short stack dump:

ksdxfstk()+48<-ksdxcb()+1200<-sspuser()+368<-<-_pw_wait()+48<-pw_wait()+128<-sskgpwwait()+384<-skgpwwait()+208<-ksliwat()+1728<-kslwaitns_timed()+112<-kskthbwt()+400<-kslwait()+640<-ktugur()+4416<-ktuchg()+1280<-ktbchg2()+704<-kdiins0()+267536<-kdiinsp()+320<-kauxsin()+2960<-insidx()+1744<-insrow()+1440<-insdrv()+960<-inscovexe()+1408<-insExecStmtExecIniEngine()+176<-insexe()+1040<-opiexe()+13776<-kpoal8()+3808<-opiodr()+2144<-ttcpip()+1680<-opitsk()+2368<-opiino()+1664<-opiodr()+2144<-opidrv()+1248<-sou2o()+240<-opimai_real()+496<-main()+240<-main_opd_entry()+80

----------------------------------------

SO: c0000000396d80d8, type: 4, owner: c00000003949b948, flag: INIT/-/-/0x00

(session) sid: 315 trans: c0000000355b2a28, creator: c00000003949b948, flag: (100041) USR/- BSY/-/-/-/-/-

DID: 0001-0013-00000027, short-term DID: 0000-0000-00000000

txn branch: 0000000000000000

oct: 2, prv: 0, sql: c0000000384bad50, psql: c0000000384bad50, user: 33/XXDB

service name: xxdb

O/S info: user: , term: , ospid: 1234, machine: rvwapp2-1

program:

waiting for 'db file sequential read' wait_time=0, seconds since wait started=0

file#=c, block#=2f359, blocks=1

blocking sess=0x0000000000000000 seq=42271

Dumping Session Wait History

for 'undo segment extension' count=1 wait_time=2 min 45 sec

segment#=6, =0, =0

for 'buffer busy waits' count=1 wait_time=0.000009 sec

file#=2, block#=59, class#=1b

for 'SQL*Net message from client' count=1 wait_time=0.000087 sec

driver id=28444553, #bytes=1, =0

for 'SQL*Net message to client' count=1 wait_time=0.000001 sec

driver id=28444553, #bytes=1, =0

for 'SQL*Net message from client' count=1 wait_time=0.000086 sec

driver id=28444553, #bytes=1, =0

for 'SQL*Net message to client' count=1 wait_time=0.000001 sec

driver id=28444553, #bytes=1, =0

for 'SQL*Net message from client' count=1 wait_time=0.000086 sec

driver id=28444553, #bytes=1, =0

for 'SQL*Net message to client' count=1 wait_time=0.000001 sec

driver id=28444553, #bytes=1, =0

for 'SQL*Net message from client' count=1 wait_time=0.000087 sec

driver id=28444553, #bytes=1, =0

for 'SQL*Net message to client' count=1 wait_time=0.000001 sec

driver id=28444553, #bytes=1, =0

Sampled Session History of session 315 serial 4544

---------------------------------------------------

The sampled session history is constructed by sampling

the target session every 1 second. The sampling process

captures at each sample if the session is in a non-idle wait,

an idle wait, or not in a wait. If the session is in a

non-idle wait then one interval is shown for all the samples

the session was in the same non-idle wait. If the

session is in an idle wait or not in a wait for

consecutive samples then one interval is shown for all

the consecutive samples. Though we display these consecutive

samples in a single interval the session may NOT be continuously

idle or not in a wait (the sampling process does not know).

The history is displayed in reverse chronological order.

sample interval: 1 sec, max history 120 sec

---------------------------------------------------

[120 samples, 10:32:52 - 10:34:53]

waited for 'undo segment extension', seq_num: 42270

p1: 'segment#'=0x6

p2: ''=0x0

p3: ''=0x0

time_waited: >= 120 sec (still in wait)

---------------------------------------------------

Sampled Session History Summary:

longest_non_idle_wait: 'undo segment extension'

[120 samples, 10:32:52 - 10:34:53]

time_waited: >= 120 sec (still in wait)

可以看到:

進(jìn)程在被ORADEBUG間接喚醒后,不再等待undo segment extension,而是做” db file sequential read”。

從前面的分析得知,“undo segment extension”即等待回滾段擴(kuò)展完成。當(dāng)執(zhí)行增刪改等操作時,數(shù)據(jù)庫需要回滾段來存儲前鏡像,當(dāng)回滾段空間不足時,則需要擴(kuò)展。具體來說,undo segment的擴(kuò)展或者回收(extend / shrink)都是前臺進(jìn)程通過通知SMON后臺進(jìn)程來完成的。CallStack堆棧的調(diào)用中,最后停在“pw_wait”的調(diào)用上,即說明在等待SMON返回?cái)U(kuò)展UNDO SEGMENT成功與否的消息。

因此,我們需要查看SMON進(jìn)程的狀態(tài)。

檢查SMON的信息如下:

PROCESS 8:

----------------------------------------

SO: c000000039496148, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00

(process) Oracle pid=8, calls cur/top: c00000003971e868/c00000003971e868, flag: (16) SYSTEM

int error: 0, call error: 0, sess error: 0, txn error 0

(post info) last post received: 0 0 121

last post received-location: kcbzww

last process to post me: c0000000394a0948 211 0

last post sent: 0 0 24

last post sent-location: ksasnd

last process posted by me: c000000039495148 1 6

(latch info) wait_event=0 bits=0

Process Group: DEFAULT, pseudo proc: c000000039529928

O/S info: user: oracle, term: UNKNOWN, ospid: 10072

OSD pid info: Unix process pid: 10072, p_w_picpath: oracle@ap-machine- (SMON)

Short stack dump:

ksdxfstk()+48<-ksdxcb()+1200<-sspuser()+368<-<-_pw_wait()+48<-pw_wait()+352<-sskgpwwait()+384<-skgpwwait()+208<-ksliwat()+1728<-kslwaitns_timed()+112<-kskthbwt()+400<-kslwait()+640<-ktmmon()+1168<-ktmSmonMain()+64<-ksbrdp()+2368<-opirip()+1184<-opidrv()+1184<-sou2o()+240<-opimai_real()+336<-main()+240<-main_opd_entry()+80

----------------------------------------

----------------------------------------

SO: c0000000396ead68, type: 4, owner: c000000039496148, flag: INIT/-/-/0x00

(session) sid: 329 trans: 0000000000000000, creator: c000000039496148, flag: (100051) USR/- BSY/-/-/-/-/-

DID: 0001-0008-00000002, short-term DID: 0000-0000-00000000

txn branch: 0000000000000000

oct: 0, prv: 0, sql: 0000000000000000, psql: 0000000000000000, user: 0/SYS

service name: SYS$BACKGROUND

waiting for 'smon timer' wait_time=0, seconds since wait started=3109

sleep time=12c, failed=0, =0

blocking sess=0x0000000000000000 seq=7382

Dumping Session Wait History

for 'smon timer' count=1 wait_time=4 min 53 sec

sleep time=12c, failed=0, =0

for 'smon timer' count=1 wait_time=4 min 53 sec

sleep time=12c, failed=0, =0

for 'smon timer' count=1 wait_time=4 min 11 sec

sleep time=12c, failed=0, =0

for 'smon timer' count=1 wait_time=4 min 53 sec

sleep time=12c, failed=0, =0

for 'smon timer' count=1 wait_time=4 min 53 sec

sleep time=12c, failed=0, =0

for 'smon timer' count=1 wait_time=4 min 53 sec

sleep time=12c, failed=0, =0

for 'smon timer' count=1 wait_time=4 min 53 sec

sleep time=12c, failed=0, =0

for 'smon timer' count=1 wait_time=4 min 53 sec

sleep time=12c, failed=0, =0

for 'smon timer' count=1 wait_time=4 min 53 sec

sleep time=12c, failed=0, =0

for 'smon timer' count=1 wait_time=4 min 53 sec

sleep time=12c, failed=0, =0

Sampled Session History of session 329 serial 1


可以看到,SMON進(jìn)程在等“SMON TIMER”,即空閑等待。


這說明SMON沒有因?yàn)樽枞诋惓5牡却?,?dǎo)致無法騰出時間來處理前臺進(jìn)程發(fā)過來的undo segment擴(kuò)展請求。

>>>> 2.8 定位問題的根本原因

綜上所有現(xiàn)象:

1) 前臺進(jìn)程SID 315向smon發(fā)送一條消息,請求對回滾段進(jìn)行擴(kuò)展,并在“undo segment extension“事件上等待,進(jìn)程callStack停在pw_wait,即post/wait,等待消息的返回;

2) SMON進(jìn)程在等待”SMON TIMER”,即空閑等待。此時SMON可能已經(jīng)完成了UNDO SEGMENT的擴(kuò)展請求并返回了前臺進(jìn)程;也可能沒有收到前臺進(jìn)程發(fā)送過來的請求;


3) ORADEBUG對SID 315收集信息并間接喚醒了該前臺進(jìn)程后,前臺進(jìn)行可以繼續(xù)往下工作,這說明,步驟2)的消息已經(jīng)返回給前臺進(jìn)程,只是由于操作系統(tǒng)調(diào)度進(jìn)制的問題,未能即使將前臺進(jìn)程調(diào)度到CPU上,拿到UNDO擴(kuò)展成功的消息。


結(jié)上所述,我們可以判定:造成該故障的根本原因是操作系統(tǒng)調(diào)度機(jī)制的問題。

環(huán)境介紹:

操作系統(tǒng) HPUX IA64 B.11.31

數(shù)據(jù)庫 ORACLE 10.2.0.5,單實(shí)例


因此,我們在ORACLE官方網(wǎng)站BUG中以“Hpux pw_wait“做關(guān)鍵字索HPUX已知的調(diào)度缺陷,可知命中操作系統(tǒng)缺陷。

The problem is a defect in the pw_wait() O/S system call.
On HP-UX 11.31, this is caused by a problem with the HP-UX scheduler patch PHKL_37456:

PHKL_37456 scheduler cumulative patch

即HPUX11.31上,當(dāng)安裝了PHKL_37456這個調(diào)度補(bǔ)丁后,操作系統(tǒng)調(diào)用pw_wait存在缺陷,導(dǎo)致進(jìn)程無法從post/wait中被喚醒,解決方案是Solution: on 11.31 install PHKL_38397 or later equivalent.


以下是官網(wǎng)原文


Processes Hang Waiting on 'cursor: pin S wait on X' (and other Wait Events) on HP-UX 11.23 and 11.31 Itanium Systems (Doc ID 580273.1)

In this Document

Symptoms

Changes

Cause

Solution

APPLIES TO:

Oracle Database - Enterprise Edition - Version 9.2.0.1 and later
HP-UX Itanium
***Checked for relevance on 23-JUL-2013*** 

SYMPTOMS

Process hangs indefinitely waiting on one of the following wait events:

  • cursor: pin S wait on X (10g wait)

  • kksfbc child completion

  • SGA: allocation forcing component growth

There is no blocker process.
The variety in wait events is explained by the fact that the underlying issue is with the pw_wait() HP-UX system call.
Therefore, the above list is not exhaustive, use the pstack output below to match the problem in all cases.
The problem is commonly seen on Parallel Execution Slave processes but can affect any Oracle process.
A command to generate a systemstate, processstate, or errorstack dump e.g. via ALTER SESSION or oradebug command frees the hanging process.

An example of the top of the call stack of an affected process, obtained using pstack:

0: c0000000003e2ff0 : pw_wait() + 0x30 (/usr/lib/hpux64/libc.so.1)
1: 4000000002e98a20 : pw_wait() + 0x80 
(/oracle/DwhProd/orabin/oradb/product/10.2.0/bin/oracle)
2: 4000000002e492f0 : sskgpwwait() + 0x170 (/oracle/DwhProd/orabin/oradb/pr
oduct/10.2.0/bin/oracle)
3: 4000000002e3af80 : skgpwwait() + 0xa0 (/oracle/DwhP
rod/orabin/oradb/product/10.2.0/bin/oracle)
4: 4000000002d6ffd0 : ksliwat() + 0x670 
(/oracle/DwhProd/orabin/oradb/product/10.2.0/bin/oracle)
5: 4000000002d4a1b0 : kslwaitns_timed() + 0x70 
(/oracle/DwhProd/orabin/oradb/product/10.2.0/bin/oracle)
6: 4000000002d4b9b0 : kskthbwt() + 0x190 (/oracle/DwhProd/orabin/oradb/p
roduct/10.2.0/bin/oracle)
7: 4000000002e35220 : kslwait() + 0xe0 (/oracle/DwhPr
od/orabin/oradb/product/10.2.0/bin/oracle)
...

First few calls of the call stacks will differ depending on the particular wait event.

Output of the tusc system call tracer tool will be similar to the following:

( Attached to process 6144 ("ora_j032_SATAVA02") [64-bit] )
15:30:27 [6144]{631499} #1 pw_wait(0x9ffffffffffe1700) ... [sleeping]
User time: 0.00
Interrupt time: 0.00
Elapsed time: 66.3

CHANGES

The following HP Patch Bundle for 11.31 may have been recently applied
Mar 2008: FEATURE11i,B.11.31.0803.318b.

Note: it has been brought to our attention that the problem may also occur on HP-UX 11.23 systems.

CAUSE

The problem is a defect in the pw_wait() O/S system call.
On HP-UX 11.31, this is caused by a problem with the HP-UX scheduler patch PHKL_37456:

PHKL_37456 scheduler cumulative patch

Additional info may be sought in HP Change Request QXCR1000811756.

Please note that the wait events shown above 'cursor: pin S wait on X' and 'kksfbc child completion' can occur for a number of other legitimate reasons, so presence of these waits alone does not indicate that the problem described in this article has been encountered. You will need to verify in detail the other symptoms described here when investigating the issue you are facing.

The problem is also reported to occur on HP-UX 11.23. We have no information at this time on specific PHKL patch levels are susceptible to it, only that it is fixed starting with PHKL_37809.

SOLUTION

A number of options are available to work around or resolve this problem:

  • Workaround: oradebug setospid+unlimit to free the hanging process:

1. identify the Unix process id of the hanging process

2. use oradebug on it as follows (the unlimit command is sufficient to stop the hang)

SQL> oradebug setospid
SQL> oradebug unlimit

3. alternatively, if there are many processes hanging on 'cursor: pin S wait on X', or time is short, a systemstate dump (at minimum level 1 to minimize the amount of trace that will be written) will be the quickest way to go over all hanging processes and wake them up:

SQL> oradebug setmypid
SQL> oradebug unlimit
SQL> oradebug dump systemstate 1

  • Solution: on 11.31 install PHKL_38397 or later equivalent.

  • Solution: on 11.23 install PHKL_37809 or later equivalent.

Note: the patches whose names begin with PHKL_ are HP-UX patches. Please check with HP-UX support on the latest applicable patch numbers as such patches are regularly updated and superceded.


3原因總結(jié)和建議

>>>> 3.1 原因總結(jié):應(yīng)用程序中的下列INSERT語句

INSERT INTO TABLE_NAME(COL1,COL2,COL3,COL4,COL5,COL6,COL7)VALUES(:1,:2,:3,:4,:5,:6,:7)

之所以執(zhí)行不下去,不是因?yàn)樗梨i,而是因?yàn)樵摃捈碨ID=315的會話,需要執(zhí)行DML,因此需要UNDO SEGMENT即回滾段來存儲前鏡像,但發(fā)現(xiàn)回滾段空間不足,需要通知SMON后臺進(jìn)程來完成擴(kuò)展的請求,但是長時間沒有獲得回滾段擴(kuò)展成功與否的返回消息。前臺進(jìn)程和SMON之間通過POST/WAIT進(jìn)制來通訊。

由于操作系統(tǒng)HPUX調(diào)度的缺陷,具體來說是pw_wait系統(tǒng)調(diào)用(post/wait)存在缺陷,當(dāng)SMON進(jìn)程完成回滾段擴(kuò)展后,消息返回時,前臺進(jìn)程SID=315并沒有能被及時調(diào)度到CPU上繼續(xù)處理后續(xù)工作,即體現(xiàn)出來就是等待在“undo segment extension“事件上等待,當(dāng)我們?nèi)藶槭褂肙RADEBUG對SID 315收集信息并間接喚醒了該前臺進(jìn)程后,前臺進(jìn)行可以繼續(xù)往下工作,這更加印證了SMON已經(jīng)將消息已經(jīng)返回給前臺進(jìn)程,只是由于操作系統(tǒng)調(diào)度進(jìn)制的問題,未能即使將前臺進(jìn)程調(diào)度到CPU上,拿到UNDO擴(kuò)展成功的消息

造成該故障的根本原因是操作系統(tǒng)調(diào)度機(jī)制的問題,該問題命中操作系統(tǒng)HPUX上的已知缺陷。

The problem is a defect in the pw_wait() O/S system call.
On HP-UX 11.31, this is caused by a problem with the HP-UX scheduler patch PHKL_37456:

PHKL_37456 scheduler cumulative patch

即HPUX11.31上,當(dāng)安裝了PHKL_37456這個調(diào)度補(bǔ)丁后,操作系統(tǒng)調(diào)用pw_wait存在缺陷,導(dǎo)致進(jìn)程無法從post/wait中被喚醒,解決方案是Solution: on 11.31 install PHKL_38397 or later equivalent.


>>>> 3.2 建議

Solution: on 11.31 install PHKL_38397 or later equivalent.

請系統(tǒng)管理員為操作系統(tǒng)安裝PHKL_38397補(bǔ)丁。


當(dāng)前名稱:小機(jī)上運(yùn)行ORACLE需要注意的進(jìn)程調(diào)度BUG
瀏覽路徑:http://weahome.cn/article/iipedc.html

其他資訊

在線咨詢

微信咨詢

電話咨詢

028-86922220(工作日)

18980820575(7×24)

提交需求

返回頂部