本節(jié)簡單介紹了PostgreSQL的后臺進(jìn)程:checkpointer,主要分析CreateCheckPoint函數(shù)的實現(xiàn)邏輯。
創(chuàng)新互聯(lián)公司是一家以成都網(wǎng)站建設(shè)、網(wǎng)頁設(shè)計、品牌設(shè)計、軟件運維、網(wǎng)站推廣、小程序App開發(fā)等移動開發(fā)為一體互聯(lián)網(wǎng)公司。已累計為LED顯示屏等眾行業(yè)中小客戶提供優(yōu)質(zhì)的互聯(lián)網(wǎng)建站和軟件開發(fā)服務(wù)。
CheckPoint
CheckPoint XLOG record結(jié)構(gòu)體.
/*
* Body of CheckPoint XLOG records. This is declared here because we keep
* a copy of the latest one in pg_control for possible disaster recovery.
* Changing this struct requires a PG_CONTROL_VERSION bump.
* CheckPoint XLOG record結(jié)構(gòu)體.
* 在這里聲明是因為我們在pg_control中保存了最新的副本,
* 以便進(jìn)行可能的災(zāi)難恢復(fù)。
* 改變這個結(jié)構(gòu)體需要一個PG_CONTROL_VERSION bump。
*/
typedef struct CheckPoint
{
//在開始創(chuàng)建CheckPoint時下一個可用的RecPtr(比如REDO的開始點)
XLogRecPtr redo; /* next RecPtr available when we began to
* create CheckPoint (i.e. REDO start point) */
//當(dāng)前的時間線
TimeLineID ThisTimeLineID; /* current TLI */
//上一個時間線(如該記錄正在開啟一條新的時間線,否則等于當(dāng)前時間線)
TimeLineID PrevTimeLineID; /* previous TLI, if this record begins a new
* timeline (equals ThisTimeLineID otherwise) */
//是否full-page-write
bool fullPageWrites; /* current full_page_writes */
//nextXid的高階位
uint32 nextXidEpoch; /* higher-order bits of nextXid */
//下一個free的XID
TransactionId nextXid; /* next free XID */
//下一個free的OID
Oid nextOid; /* next free OID */
//下一個fredd的MultiXactId
MultiXactId nextMulti; /* next free MultiXactId */
//下一個空閑的MultiXact偏移
MultiXactOffset nextMultiOffset; /* next free MultiXact offset */
//集群范圍內(nèi)的最小datfrozenxid
TransactionId oldestXid; /* cluster-wide minimum datfrozenxid */
//最小datfrozenxid所在的database
Oid oldestXidDB; /* database with minimum datfrozenxid */
//集群范圍內(nèi)的最小datminmxid
MultiXactId oldestMulti; /* cluster-wide minimum datminmxid */
//最小datminmxid所在的database
Oid oldestMultiDB; /* database with minimum datminmxid */
//checkpoint的時間戳
pg_time_t time; /* time stamp of checkpoint */
//帶有有效提交時間戳的最老Xid
TransactionId oldestCommitTsXid; /* oldest Xid with valid commit
* timestamp */
//帶有有效提交時間戳的最新Xid
TransactionId newestCommitTsXid; /* newest Xid with valid commit
* timestamp */
/*
* Oldest XID still running. This is only needed to initialize hot standby
* mode from an online checkpoint, so we only bother calculating this for
* online checkpoints and only when wal_level is replica. Otherwise it's
* set to InvalidTransactionId.
* 最老的XID還在運行。
* 這只需要從online checkpoint初始化熱備模式,因此我們只需要為在線檢查點計算此值,
* 并且只在wal_level是replica時才計算此值。
* 否則它被設(shè)置為InvalidTransactionId。
*/
TransactionId oldestActiveXid;
} CheckPoint;
/* XLOG info values for XLOG rmgr */
#define XLOG_CHECKPOINT_SHUTDOWN 0x00
#define XLOG_CHECKPOINT_ONLINE 0x10
#define XLOG_NOOP 0x20
#define XLOG_NEXTOID 0x30
#define XLOG_SWITCH 0x40
#define XLOG_BACKUP_END 0x50
#define XLOG_PARAMETER_CHANGE 0x60
#define XLOG_RESTORE_POINT 0x70
#define XLOG_FPW_CHANGE 0x80
#define XLOG_END_OF_RECOVERY 0x90
#define XLOG_FPI_FOR_HINT 0xA0
#define XLOG_FPI 0xB0
CheckpointerShmem
checkpointer進(jìn)程和其他后臺進(jìn)程之間通訊的共享內(nèi)存結(jié)構(gòu).
/*----------
* Shared memory area for communication between checkpointer and backends
* checkpointer進(jìn)程和其他后臺進(jìn)程之間通訊的共享內(nèi)存結(jié)構(gòu).
*
* The ckpt counters allow backends to watch for completion of a checkpoint
* request they send. Here's how it works:
* * At start of a checkpoint, checkpointer reads (and clears) the request
* flags and increments ckpt_started, while holding ckpt_lck.
* * On completion of a checkpoint, checkpointer sets ckpt_done to
* equal ckpt_started.
* * On failure of a checkpoint, checkpointer increments ckpt_failed
* and sets ckpt_done to equal ckpt_started.
* ckpt計數(shù)器可以讓后臺進(jìn)程監(jiān)控它們發(fā)出來的checkpoint請求是否已完成.其工作原理如下:
* * 在checkpoint啟動階段,checkpointer進(jìn)程獲取并持有ckpt_lck鎖后,
* 讀取(并清除)請求標(biāo)志并增加ckpt_started計數(shù).
* * checkpoint成功完成時,checkpointer設(shè)置ckpt_done值等于ckpt_started.
* * checkpoint如執(zhí)行失敗,checkpointer增加ckpt_failed計數(shù),并設(shè)置ckpt_done值等于ckpt_started.
*
* The algorithm for backends is:
* 1. Record current values of ckpt_failed and ckpt_started, and
* set request flags, while holding ckpt_lck.
* 2. Send signal to request checkpoint.
* 3. Sleep until ckpt_started changes. Now you know a checkpoint has
* begun since you started this algorithm (although *not* that it was
* specifically initiated by your signal), and that it is using your flags.
* 4. Record new value of ckpt_started.
* 5. Sleep until ckpt_done >= saved value of ckpt_started. (Use modulo
* arithmetic here in case counters wrap around.) Now you know a
* checkpoint has started and completed, but not whether it was
* successful.
* 6. If ckpt_failed is different from the originally saved value,
* assume request failed; otherwise it was definitely successful.
* 算法如下:
* 1.獲取并持有ckpt_lck鎖后,記錄ckpt_failed和ckpt_started的當(dāng)前值,并設(shè)置請求標(biāo)志.
* 2.發(fā)送信號,請求checkpoint.
* 3.休眠直至ckpt_started發(fā)生變化.
* 現(xiàn)在您知道自您啟動此算法以來檢查點已經(jīng)開始(盡管*不是*它是由您的信號具體發(fā)起的),并且它正在使用您的標(biāo)志。
* 4.記錄ckpt_started的新值.
* 5.休眠,直至ckpt_done >= 已保存的ckpt_started值(取模).現(xiàn)在已知checkpoint已啟動&已完成,但checkpoint不一定成功.
* 6.如果ckpt_failed與原來保存的值不同,則可以認(rèn)為請求失敗,否則它肯定是成功的.
*
* ckpt_flags holds the OR of the checkpoint request flags sent by all
* requesting backends since the last checkpoint start. The flags are
* chosen so that OR'ing is the correct way to combine multiple requests.
* ckpt_flags保存自上次檢查點啟動以來所有后臺進(jìn)程發(fā)送的檢查點請求標(biāo)志的OR或標(biāo)記。
* 選擇標(biāo)志,以便OR'ing是組合多個請求的正確方法。
*
* num_backend_writes is used to count the number of buffer writes performed
* by user backend processes. This counter should be wide enough that it
* can't overflow during a single processing cycle. num_backend_fsync
* counts the subset of those writes that also had to do their own fsync,
* because the checkpointer failed to absorb their request.
* num_backend_writes用于計算用戶后臺進(jìn)程寫入的緩沖區(qū)個數(shù).
* 在一個單獨的處理過程中,該計數(shù)器必須足夠大以防溢出.
* num_backend_fsync計數(shù)那些必須執(zhí)行fsync寫操作的子集,
* 因為checkpointer進(jìn)程未能接受它們的請求。
*
* The requests array holds fsync requests sent by backends and not yet
* absorbed by the checkpointer.
* 請求數(shù)組存儲后臺進(jìn)程發(fā)出的未被checkpointer進(jìn)程拒絕的fsync請求.
*
* Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and
* the requests fields are protected by CheckpointerCommLock.
* 不同于checkpoint域,num_backend_writes/num_backend_fsync通過CheckpointerCommLock保護(hù).
*
*----------
*/
typedef struct
{
RelFileNode rnode;//表空間/數(shù)據(jù)庫/Relation信息
ForkNumber forknum;//fork編號
BlockNumber segno; /* see md.c for special values */
/* might add a real request-type field later; not needed yet */
} CheckpointerRequest;
typedef struct
{
//checkpoint進(jìn)程的pid(為0則進(jìn)程未啟動)
pid_t checkpointer_pid; /* PID (0 if not started) */
//用于保護(hù)所有的ckpt_*域
slock_t ckpt_lck; /* protects all the ckpt_* fields */
//在checkpoint啟動時計數(shù)
int ckpt_started; /* advances when checkpoint starts */
//在checkpoint完成時計數(shù)
int ckpt_done; /* advances when checkpoint done */
//在checkpoint失敗時計數(shù)
int ckpt_failed; /* advances when checkpoint fails */
//檢查點標(biāo)記,在xlog.h中定義
int ckpt_flags; /* checkpoint flags, as defined in xlog.h */
//計數(shù)后臺進(jìn)程緩存寫的次數(shù)
uint32 num_backend_writes; /* counts user backend buffer writes */
//計數(shù)后臺進(jìn)程fsync調(diào)用次數(shù)
uint32 num_backend_fsync; /* counts user backend fsync calls */
//當(dāng)前的請求編號
int num_requests; /* current # of requests */
//最大的請求編號
int max_requests; /* allocated array size */
//請求數(shù)組
CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
} CheckpointerShmemStruct;
//靜態(tài)變量(CheckpointerShmemStruct結(jié)構(gòu)體指針)
static CheckpointerShmemStruct *CheckpointerShmem;
VirtualTransactionId
最頂層的事務(wù)通過VirtualTransactionIDs定義.
/*
* Top-level transactions are identified by VirtualTransactionIDs comprising
* the BackendId of the backend running the xact, plus a locally-assigned
* LocalTransactionId. These are guaranteed unique over the short term,
* but will be reused after a database restart; hence they should never
* be stored on disk.
* 最高層的事務(wù)通過VirtualTransactionIDs定義.
* VirtualTransactionIDs由執(zhí)行事務(wù)的后臺進(jìn)程BackendId和邏輯分配的LocalTransactionId組成.
*
* Note that struct VirtualTransactionId can not be assumed to be atomically
* assignable as a whole. However, type LocalTransactionId is assumed to
* be atomically assignable, and the backend ID doesn't change often enough
* to be a problem, so we can fetch or assign the two fields separately.
* We deliberately refrain from using the struct within PGPROC, to prevent
* coding errors from trying to use struct assignment with it; instead use
* GET_VXID_FROM_PGPROC().
* 請注意,不能假設(shè)struct VirtualTransactionId作為一個整體是原子可分配的。
* 但是,類型LocalTransactionId是假定原子可分配的,同時后臺進(jìn)程ID不會經(jīng)常變換,因此這不是一個問題,
* 因此我們可以單獨提取或者分配這兩個域字段.
*
*/
typedef struct
{
BackendId backendId; /* determined at backend startup */
LocalTransactionId localTransactionId; /* backend-local transaction id */
} VirtualTransactionId;
CreateCheckPoint函數(shù),執(zhí)行checkpoint,不管是在shutdown過程還是在運行中.
/*
* Perform a checkpoint --- either during shutdown, or on-the-fly
* 執(zhí)行checkpoint,不管是在shutdown過程還是在運行中
*
* flags is a bitwise OR of the following:
* CHECKPOINT_IS_SHUTDOWN: checkpoint is for database shutdown.
* CHECKPOINT_END_OF_RECOVERY: checkpoint is for end of WAL recovery.
* CHECKPOINT_IMMEDIATE: finish the checkpoint ASAP,
* ignoring checkpoint_completion_target parameter.
* CHECKPOINT_FORCE: force a checkpoint even if no XLOG activity has occurred
* since the last one (implied by CHECKPOINT_IS_SHUTDOWN or
* CHECKPOINT_END_OF_RECOVERY).
* CHECKPOINT_FLUSH_ALL: also flush buffers of unlogged tables.
* flags標(biāo)記說明:
* CHECKPOINT_IS_SHUTDOWN: 數(shù)據(jù)庫關(guān)閉過程中的checkpoint
* CHECKPOINT_END_OF_RECOVERY: 通過WAL恢復(fù)后的checkpoint
* CHECKPOINT_IMMEDIATE: 盡可能快的完成checkpoint,忽略checkpoint_completion_target參數(shù)
* CHECKPOINT_FORCE: 在最后一次checkpoint后就算沒有任何的XLOG活動發(fā)生,也強(qiáng)制執(zhí)行checkpoint
* (意味著CHECKPOINT_IS_SHUTDOWN或CHECKPOINT_END_OF_RECOVERY)
* CHECKPOINT_FLUSH_ALL: 包含unlogged tables一并刷盤
*
* Note: flags contains other bits, of interest here only for logging purposes.
* In particular note that this routine is synchronous and does not pay
* attention to CHECKPOINT_WAIT.
* 注意:標(biāo)志還包含其他位,此處僅用于日志記錄。
* 特別注意的是該過程同步執(zhí)行,并不會理會CHECKPOINT_WAIT.
*
* If !shutdown then we are writing an online checkpoint. This is a very special
* kind of operation and WAL record because the checkpoint action occurs over
* a period of time yet logically occurs at just a single LSN. The logical
* position of the WAL record (redo ptr) is the same or earlier than the
* physical position. When we replay WAL we locate the checkpoint via its
* physical position then read the redo ptr and actually start replay at the
* earlier logical position. Note that we don't write *anything* to WAL at
* the logical position, so that location could be any other kind of WAL record.
* All of this mechanism allows us to continue working while we checkpoint.
* As a result, timing of actions is critical here and be careful to note that
* this function will likely take minutes to execute on a busy system.
* 如果并不處在shutdown過程中,那么我們會等待一個在線checkpoint.
* 這是一種非常特殊的操作和WAL記錄,因為檢查點操作發(fā)生在一段時間內(nèi),而邏輯上只發(fā)生在一個LSN上。
* WAL Record(redo ptr)的邏輯位置與物理位置相同或者小于物理位置.
* 在回放WAL的時候我們通過checkpoint的物理位置定位位置,然后讀取redo ptr,
* 實際上在更早的邏輯位置開始回放,這樣該位置可以是任意類型的WAL Record.
* 這種機(jī)制的目的是允許我們在checkpoint的時候不需要暫停.
* 這種機(jī)制的結(jié)果是操作的時間會比較長,要小心的是在繁忙的系統(tǒng)中,該操作可能會持續(xù)數(shù)分鐘.
*/
void
CreateCheckPoint(int flags)
{
bool shutdown;//是否處于shutdown?
CheckPoint checkPoint;//checkpoint
XLogRecPtr recptr;//XLOG Record位置
XLogSegNo _logSegNo;//LSN(uint64)
XLogCtlInsert *Insert = &XLogCtl->Insert;//控制器
uint32 freespace;//空閑空間
XLogRecPtr PriorRedoPtr;//上一個Redo point
XLogRecPtr curInsert;//當(dāng)前插入的位置
XLogRecPtr last_important_lsn;//上一個重要的LSN
VirtualTransactionId *vxids;//虛擬事務(wù)ID
int nvxids;
/*
* An end-of-recovery checkpoint is really a shutdown checkpoint, just
* issued at a different time.
* end-of-recovery checkpoint事實上是shutdown checkpoint,只不過是在一個不同的時間發(fā)生的.
*/
if (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY))
shutdown = true;
else
shutdown = false;
/* sanity check */
//驗證
if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
elog(ERROR, "can't create a checkpoint during recovery");
/*
* Initialize InitXLogInsert working areas before entering the critical
* section. Normally, this is done by the first call to
* RecoveryInProgress() or LocalSetXLogInsertAllowed(), but when creating
* an end-of-recovery checkpoint, the LocalSetXLogInsertAllowed call is
* done below in a critical section, and InitXLogInsert cannot be called
* in a critical section.
* 在進(jìn)入critical section前,初始化InitXLogInsert工作空間.
* 通常來說,第一次調(diào)用RecoveryInProgress() or LocalSetXLogInsertAllowed()時已完成,
* 但在創(chuàng)建end-of-recovery checkpoint時,在下面的邏輯中LocalSetXLogInsertAllowed調(diào)用完成時,
* InitXLogInsert不能在critical section中調(diào)用.
*/
InitXLogInsert();
/*
* Acquire CheckpointLock to ensure only one checkpoint happens at a time.
* (This is just pro forma, since in the present system structure there is
* only one process that is allowed to issue checkpoints at any given
* time.)
* 請求CheckpointLock確保在同一時刻只能存在一個checkpoint.
* (這只是形式上的,因為在目前的系統(tǒng)架構(gòu)中,在任何給定的時間只允許一個進(jìn)程發(fā)出檢查點。)
*/
LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
/*
* Prepare to accumulate statistics.
* 為統(tǒng)計做準(zhǔn)備.
*
* Note: because it is possible for log_checkpoints to change while a
* checkpoint proceeds, we always accumulate stats, even if
* log_checkpoints is currently off.
* 注意:在checkpoint執(zhí)行過程總,log_checkpoints可能會出現(xiàn)變化,
* 因此我們通常會累計stats,即使log_checkpoints為off
*/
MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
/*
* Use a critical section to force system panic if we have trouble.
* 使用critical section,強(qiáng)制系統(tǒng)在出現(xiàn)問題時進(jìn)行應(yīng)對.
*/
START_CRIT_SECTION();
if (shutdown)
{
//shutdown = T
//更新control file(pg_control文件)
LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
ControlFile->state = DB_SHUTDOWNING;
ControlFile->time = (pg_time_t) time(NULL);
UpdateControlFile();
LWLockRelease(ControlFileLock);
}
/*
* Let smgr prepare for checkpoint; this has to happen before we determine
* the REDO pointer. Note that smgr must not do anything that'd have to
* be undone if we decide no checkpoint is needed.
* 讓smgr(資源管理器)為checkpoint作準(zhǔn)備.
* 在確定REDO pointer時必須執(zhí)行.
* 請注意,如果我們決定不執(zhí)行checkpoint,那么smgr不能執(zhí)行任何必須撤消的操作。
*/
smgrpreckpt();
/* Begin filling in the checkpoint WAL record */
//填充Checkpoint XLOG Record
MemSet(&checkPoint, 0, sizeof(checkPoint));
checkPoint.time = (pg_time_t) time(NULL);//時間
/*
* For Hot Standby, derive the oldestActiveXid before we fix the redo
* pointer. This allows us to begin accumulating changes to assemble our
* starting snapshot of locks and transactions.
* 對于Hot Standby,在修改redo pointer前,推導(dǎo)出oldestActiveXid.
* 這可以讓我們可以累計變化以組裝開始的snapshot的locks和transactions.
*/
if (!shutdown && XLogStandbyInfoActive())
checkPoint.oldestActiveXid = GetOldestActiveTransactionId();
else
checkPoint.oldestActiveXid = InvalidTransactionId;
/*
* Get location of last important record before acquiring insert locks (as
* GetLastImportantRecPtr() also locks WAL locks).
* 在請求插入locks前,獲取最后一個重要的XLOG Record的位置.
* (GetLastImportantRecPtr()函數(shù)會獲取WAL locks)
*/
last_important_lsn = GetLastImportantRecPtr();
/*
* We must block concurrent insertions while examining insert state to
* determine the checkpoint REDO pointer.
* 在檢查插入狀態(tài)確定checkpoint的REDO pointer時,必須阻塞同步插入操作.
*/
WALInsertLockAcquireExclusive();
curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
/*
* If this isn't a shutdown or forced checkpoint, and if there has been no
* WAL activity requiring a checkpoint, skip it. The idea here is to
* avoid inserting duplicate checkpoints when the system is idle.
* 不是shutdow或強(qiáng)制checkpoint,而且在請求時如果沒有WAL活動,則跳過.
* 這里的思想是避免在系統(tǒng)空閑時插入重復(fù)的checkpoints
*/
if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
CHECKPOINT_FORCE)) == 0)
{
if (last_important_lsn == ControlFile->checkPoint)
{
WALInsertLockRelease();
LWLockRelease(CheckpointLock);
END_CRIT_SECTION();
ereport(DEBUG1,
(errmsg("checkpoint skipped because system is idle")));
return;
}
}
/*
* An end-of-recovery checkpoint is created before anyone is allowed to
* write WAL. To allow us to write the checkpoint record, temporarily
* enable XLogInsertAllowed. (This also ensures ThisTimeLineID is
* initialized, which we need here and in AdvanceXLInsertBuffer.)
* 在允許寫入WAL后才會創(chuàng)建end-of-recovery checkpoint.
* 這可以讓我們寫Checkpoint Record,臨時啟用XLogInsertAllowed.
* (這同樣可以確保已初始化在這里和AdvanceXLInsertBuffer中需要的變量ThisTimeLineID)
*/
if (flags & CHECKPOINT_END_OF_RECOVERY)
LocalSetXLogInsertAllowed();
checkPoint.ThisTimeLineID = ThisTimeLineID;
if (flags & CHECKPOINT_END_OF_RECOVERY)
checkPoint.PrevTimeLineID = XLogCtl->PrevTimeLineID;
else
checkPoint.PrevTimeLineID = ThisTimeLineID;
checkPoint.fullPageWrites = Insert->fullPageWrites;
/*
* Compute new REDO record ptr = location of next XLOG record.
* 計算新的REDO record ptr = 下一個XLOG Record的位置.
*
* NB: this is NOT necessarily where the checkpoint record itself will be,
* since other backends may insert more XLOG records while we're off doing
* the buffer flush work. Those XLOG records are logically after the
* checkpoint, even though physically before it. Got that?
* 注意:這并不一定是檢查點記錄本身所在的位置,因為當(dāng)我們停止緩沖區(qū)刷新工作時,
* 其他后臺進(jìn)程可能會插入更多的XLOG Record。
* 這些XLOG Records邏輯上會在checkpoint之后,雖然物理上可能在checkpoint之前.
*/
freespace = INSERT_FREESPACE(curInsert);//獲取空閑空間
if (freespace == 0)
{
//沒有空閑空間了
if (XLogSegmentOffset(curInsert, wal_segment_size) == 0)
curInsert += SizeOfXLogLongPHD;//新的WAL segment file,偏移為LONG header
else
curInsert += SizeOfXLogShortPHD;//原WAL segment file,偏移為常規(guī)的header
}
checkPoint.redo = curInsert;
/*
* Here we update the shared RedoRecPtr for future XLogInsert calls; this
* must be done while holding all the insertion locks.
* 在這里,我們更新共享的RedoRecPtr以備將來的XLogInsert調(diào)用;
* 這必須在持有所有插入鎖才能完成。
*
* Note: if we fail to complete the checkpoint, RedoRecPtr will be left
* pointing past where it really needs to point. This is okay; the only
* consequence is that XLogInsert might back up whole buffers that it
* didn't really need to. We can't postpone advancing RedoRecPtr because
* XLogInserts that happen while we are dumping buffers must assume that
* their buffer changes are not included in the checkpoint.
* 注意:如果checkpoint失敗,RedoRecPtr仍會指向?qū)嶋H上它應(yīng)指向的位置.
* 這種做法沒有問題,唯一需要處理的XLogInsert可能會備份它并不真正需要的整個緩沖區(qū).
* 我們不能推遲推進(jìn)RedoRecPtr,因為在轉(zhuǎn)儲緩沖區(qū)時發(fā)生的XLogInserts,
* 必須假設(shè)它們的緩沖區(qū)更改不包含在該檢查點中。
*/
RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
/*
* Now we can release the WAL insertion locks, allowing other xacts to
* proceed while we are flushing disk buffers.
* 現(xiàn)在可以釋放WAL插入鎖,允許其他事務(wù)在刷新磁盤緩沖區(qū)時可以執(zhí)行.
*/
WALInsertLockRelease();
/* Update the info_lck-protected copy of RedoRecPtr as well */
//同時,更新RedoRecPtr的info_lck-protected拷貝鎖.
SpinLockAcquire(&XLogCtl->info_lck);
XLogCtl->RedoRecPtr = checkPoint.redo;
SpinLockRelease(&XLogCtl->info_lck);
/*
* If enabled, log checkpoint start. We postpone this until now so as not
* to log anything if we decided to skip the checkpoint.
* 如啟用log_checkpoints,則記錄checkpoint日志啟動.
* 我們將此推遲到現(xiàn)在,以便在決定跳過檢查點時不記錄任何東西。
*/
if (log_checkpoints)
LogCheckpointStart(flags, false);
TRACE_POSTGRESQL_CHECKPOINT_START(flags);
/*
* Get the other info we need for the checkpoint record.
* 獲取其他組裝checkpoint記錄的信息.
*
* We don't need to save oldestClogXid in the checkpoint, it only matters
* for the short period in which clog is being truncated, and if we crash
* during that we'll redo the clog truncation and fix up oldestClogXid
* there.
* 我們不需要在檢查點中保存oldestClogXid,它只在截斷clog的短時間內(nèi)起作用,
* 如果在此期間崩潰,我們將重新截斷clog并在修復(fù)oldestClogXid。
*/
LWLockAcquire(XidGenLock, LW_SHARED);
checkPoint.nextXid = ShmemVariableCache->nextXid;
checkPoint.oldestXid = ShmemVariableCache->oldestXid;
checkPoint.oldestXidDB = ShmemVariableCache->oldestXidDB;
LWLockRelease(XidGenLock);
LWLockAcquire(CommitTsLock, LW_SHARED);
checkPoint.oldestCommitTsXid = ShmemVariableCache->oldestCommitTsXid;
checkPoint.newestCommitTsXid = ShmemVariableCache->newestCommitTsXid;
LWLockRelease(CommitTsLock);
/* Increase XID epoch if we've wrapped around since last checkpoint */
//如果我們從上一個checkpoint開始wrapped around,則增加XID epoch
checkPoint.nextXidEpoch = ControlFile->checkPointCopy.nextXidEpoch;
if (checkPoint.nextXid < ControlFile->checkPointCopy.nextXid)
checkPoint.nextXidEpoch++;
LWLockAcquire(OidGenLock, LW_SHARED);
checkPoint.nextOid = ShmemVariableCache->nextOid;
if (!shutdown)
checkPoint.nextOid += ShmemVariableCache->oidCount;
LWLockRelease(OidGenLock);
MultiXactGetCheckptMulti(shutdown,
&checkPoint.nextMulti,
&checkPoint.nextMultiOffset,
&checkPoint.oldestMulti,
&checkPoint.oldestMultiDB);
/*
* Having constructed the checkpoint record, ensure all shmem disk buffers
* and commit-log buffers are flushed to disk.
* 在構(gòu)造checkpoint XLOG Record之后,確保所有shmem disk buffers和clog緩沖區(qū)都被刷到磁盤中。
*
* This I/O could fail for various reasons. If so, we will fail to
* complete the checkpoint, but there is no reason to force a system
* panic. Accordingly, exit critical section while doing it.
* 刷盤I/O可能會因為很多原因失敗.
* 如果出現(xiàn)問題,那么checkpoint會失敗,但沒有理由強(qiáng)制要求系統(tǒng)panic.
* 相反,在做這些工作時退出critical section.
*/
END_CRIT_SECTION();
/*
* In some cases there are groups of actions that must all occur on one
* side or the other of a checkpoint record. Before flushing the
* checkpoint record we must explicitly wait for any backend currently
* performing those groups of actions.
* 在某些情況下,必須在checkpoint XLOG Record的一邊或另一邊執(zhí)行一組操作。
* 在刷新checkpoint XLOG Record之前,我們必須顯式地等待當(dāng)前執(zhí)行這些操作組的所有后臺進(jìn)程。
*
* One example is end of transaction, so we must wait for any transactions
* that are currently in commit critical sections. If an xact inserted
* its commit record into XLOG just before the REDO point, then a crash
* restart from the REDO point would not replay that record, which means
* that our flushing had better include the xact's update of pg_xact. So
* we wait till he's out of his commit critical section before proceeding.
* See notes in RecordTransactionCommit().
* 其中一個例子是事務(wù)結(jié)束,我們必須等待當(dāng)前正處于commit critical sections的事務(wù)結(jié)束.
* 如果某個事務(wù)正好在REDO point前插入commit record到XLOG中,
* 如果系統(tǒng)crash,則重啟后,從REDO point起讀取時不會回放該commit記錄,
* 這意味著我們的刷盤最好包含xact對pg_xact的更新.
* 所以我們要等到該進(jìn)程離開commit critical section后再繼續(xù)。
* 參見RecordTransactionCommit()中的注釋。
*
* Because we've already released the insertion locks, this test is a bit
* fuzzy: it is possible that we will wait for xacts we didn't really need
* to wait for. But the delay should be short and it seems better to make
* checkpoint take a bit longer than to hold off insertions longer than
* necessary. (In fact, the whole reason we have this issue is that xact.c
* does commit record XLOG insertion and clog update as two separate steps
* protected by different locks, but again that seems best on grounds of
* minimizing lock contention.)
* 因為我們已經(jīng)釋放了插入鎖,這個測試有點模糊:有可能我們將等待我們實際上不需要等待的xacts。
* 但是延遲應(yīng)該很短,讓檢查點花費的時間比延遲插入所需的時間長一些似乎更好。
* (實際上,我們遇到這個問題的原因是xact.c將commit record XLOG插入和clog更新作為兩個單獨的步驟提交,
* 這兩個操作由不同的鎖進(jìn)行保護(hù),但基于最小化鎖爭用的理由這看起來是最好的。)
*
* A transaction that has not yet set delayChkpt when we look cannot be at
* risk, since he's not inserted his commit record yet; and one that's
* already cleared it is not at risk either, since he's done fixing clog
* and we will correctly flush the update below. So we cannot miss any
* xacts we need to wait for.
* 在我們搜索時,尚未設(shè)置delayChkpt的事務(wù)不會存在風(fēng)險,因為該事務(wù)還沒有插入它的提交記錄;
* 同樣的已清除了delayChkpt的事務(wù)也不會有風(fēng)險,因為該事務(wù)已修改了clog,
* 我們可以正確的在下面的處理邏輯中刷新更新.
* 因此我們不能錯失我們需要等待的所有xacts.
*/
vxids = GetVirtualXIDsDelayingChkpt(&nvxids);//獲取虛擬事務(wù)XID
if (nvxids > 0)
{
do
{
//等待10ms
pg_usleep(10000L); /* wait for 10 msec */
} while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids));
}
pfree(vxids);
//把共享內(nèi)存中的數(shù)據(jù)刷到磁盤上,并執(zhí)行fsync
CheckPointGuts(checkPoint.redo, flags);
/*
* Take a snapshot of running transactions and write this to WAL. This
* allows us to reconstruct the state of running transactions during
* archive recovery, if required. Skip, if this info disabled.
* 獲取正在運行的事務(wù)的快照,并將其寫入WAL。
* 如果需要,這允許我們在歸檔恢復(fù)期間重建正在運行的事務(wù)的狀態(tài)。
* 如果禁用此消息,則禁用。
*
* If we are shutting down, or Startup process is completing crash
* recovery we don't need to write running xact data.
* 如果正在關(guān)閉數(shù)據(jù)庫,或者啟動進(jìn)程已完成crash recovery,
* 則不需要寫正在運行的事務(wù)數(shù)據(jù).
*/
if (!shutdown && XLogStandbyInfoActive())
LogStandbySnapshot();
START_CRIT_SECTION();//進(jìn)入critical section.
/*
* Now insert the checkpoint record into XLOG.
* 現(xiàn)在可以插入checkpoint record到XLOG中了.
*/
XLogBeginInsert();//開始插入
XLogRegisterData((char *) (&checkPoint), sizeof(checkPoint));//注冊數(shù)據(jù)
recptr = XLogInsert(RM_XLOG_ID,
shutdown ? XLOG_CHECKPOINT_SHUTDOWN :
XLOG_CHECKPOINT_ONLINE);//執(zhí)行插入
XLogFlush(recptr);//刷盤
/*
* We mustn't write any new WAL after a shutdown checkpoint, or it will be
* overwritten at next startup. No-one should even try, this just allows
* sanity-checking. In the case of an end-of-recovery checkpoint, we want
* to just temporarily disable writing until the system has exited
* recovery.
* 我們不能在關(guān)閉檢查點之后寫入任何新的WAL,否則它將在下一次啟動時被覆蓋。
* 而且不應(yīng)該進(jìn)行這樣的嘗試,只允許健康檢查。
* 在end-of-recovery checkpoint情況下,我們只想暫時禁用寫入,直到系統(tǒng)退出恢復(fù)。
*/
if (shutdown)
{
//關(guān)閉過程中
if (flags & CHECKPOINT_END_OF_RECOVERY)
LocalXLogInsertAllowed = -1; /* return to "check" state */
else
LocalXLogInsertAllowed = 0; /* never again write WAL */
}
/*
* We now have ProcLastRecPtr = start of actual checkpoint record, recptr
* = end of actual checkpoint record.
* 現(xiàn)在我們有:
* ProcLastRecPtr = 實際的checkpoint XLOG record的起始位置,
* recptr = 實際checkpoint XLOG record的結(jié)束位置.
*/
if (shutdown && checkPoint.redo != ProcLastRecPtr)
ereport(PANIC,
(errmsg("concurrent write-ahead log activity while database system is shutting down")));
/*
* Remember the prior checkpoint's redo ptr for
* UpdateCheckPointDistanceEstimate()
* 為UpdateCheckPointDistanceEstimate()記錄上一個checkpoint的REDO ptr
*/
PriorRedoPtr = ControlFile->checkPointCopy.redo;
/*
* Update the control file.
* 更新控制文件(pg_control)
*/
LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
if (shutdown)
ControlFile->state = DB_SHUTDOWNED;
ControlFile->checkPoint = ProcLastRecPtr;
ControlFile->checkPointCopy = checkPoint;
ControlFile->time = (pg_time_t) time(NULL);
/* crash recovery should always recover to the end of WAL */
//crash recovery通常來說應(yīng)恢復(fù)至WAL的末尾
ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
ControlFile->minRecoveryPointTLI = 0;
/*
* Persist unloggedLSN value. It's reset on crash recovery, so this goes
* unused on non-shutdown checkpoints, but seems useful to store it always
* for debugging purposes.
* 持久化unloggedLSN值.
* 它是在崩潰恢復(fù)時重置的,因此在非關(guān)閉檢查點上不使用,但是為了調(diào)試目的而總是存儲它似乎很有用。
*/
SpinLockAcquire(&XLogCtl->ulsn_lck);
ControlFile->unloggedLSN = XLogCtl->unloggedLSN;
SpinLockRelease(&XLogCtl->ulsn_lck);
UpdateControlFile();
LWLockRelease(ControlFileLock);
/* Update shared-memory copy of checkpoint XID/epoch */
//更新checkpoint XID/epoch的共享內(nèi)存拷貝
SpinLockAcquire(&XLogCtl->info_lck);
XLogCtl->ckptXidEpoch = checkPoint.nextXidEpoch;
XLogCtl->ckptXid = checkPoint.nextXid;
SpinLockRelease(&XLogCtl->info_lck);
/*
* We are now done with critical updates; no need for system panic if we
* have trouble while fooling with old log segments.
* 已完成critical updates.
*/
END_CRIT_SECTION();
/*
* Let smgr do post-checkpoint cleanup (eg, deleting old files).
* 讓smgr執(zhí)行checkpoint收尾工作(比如刪除舊文件等).
*/
smgrpostckpt();
/*
* Update the average distance between checkpoints if the prior checkpoint
* exists.
* 如上一個checkpoint存在,則更新兩者之間的平均距離.
*/
if (PriorRedoPtr != InvalidXLogRecPtr)
UpdateCheckPointDistanceEstimate(RedoRecPtr - PriorRedoPtr);
/*
* Delete old log files, those no longer needed for last checkpoint to
* prevent the disk holding the xlog from growing full.
* 刪除舊的日志文件,這些文件自最后一個檢查點后已不再需要,
* 以防止保存xlog的磁盤撐滿。
*/
XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
KeepLogSeg(recptr, &_logSegNo);
_logSegNo--;
RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);
/*
* Make more log segments if needed. (Do this after recycling old log
* segments, since that may supply some of the needed files.)
* 如需要,申請更多的log segments.
* (在循環(huán)使用舊的log segments時才來做這個事情,因為那樣會需要一些需要的文件)
*/
if (!shutdown)
PreallocXlogFiles(recptr);
/*
* Truncate pg_subtrans if possible. We can throw away all data before
* the oldest XMIN of any running transaction. No future transaction will
* attempt to reference any pg_subtrans entry older than that (see Asserts
* in subtrans.c). During recovery, though, we mustn't do this because
* StartupSUBTRANS hasn't been called yet.
* 如可能,截斷pg_subtrans.
* 我們可以在任何正在運行的事務(wù)的最老的XMIN之前丟棄所有數(shù)據(jù)。
* 以后的事務(wù)都不會嘗試引用任何比這更早的pg_subtrans條目(參見sub.c中的斷言)。
* 但是在恢復(fù)期間,我們不能這樣做,因為StartupSUBTRANS還沒有被調(diào)用。
*
*/
if (!RecoveryInProgress())
TruncateSUBTRANS(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
/* Real work is done, but log and update stats before releasing lock. */
//實際的工作已完成,除了記錄日志已經(jīng)更新統(tǒng)計信息.
LogCheckpointEnd(false);
TRACE_POSTGRESQL_CHECKPOINT_DONE(CheckpointStats.ckpt_bufs_written,
NBuffers,
CheckpointStats.ckpt_segs_added,
CheckpointStats.ckpt_segs_removed,
CheckpointStats.ckpt_segs_recycled);
//釋放鎖
LWLockRelease(CheckpointLock);
}
/*
* Flush all data in shared memory to disk, and fsync
* 把共享內(nèi)存中的數(shù)據(jù)刷到磁盤上,并執(zhí)行fsync
*
* This is the common code shared between regular checkpoints and
* recovery restartpoints.
* 不管是普通的checkpoints還是recovery restartpoints,這些代碼都是共享的.
*/
static void
CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
{
CheckPointCLOG();
CheckPointCommitTs();
CheckPointSUBTRANS();
CheckPointMultiXact();
CheckPointPredicate();
CheckPointRelationMap();
CheckPointReplicationSlots();
CheckPointSnapBuild();
CheckPointLogicalRewriteHeap();
CheckPointBuffers(flags); /* performs all required fsyncs */
CheckPointReplicationOrigin();
/* We deliberately delay 2PC checkpointing as long as possible */
CheckPointTwoPhase(checkPointRedo);
}
更新數(shù)據(jù),執(zhí)行checkpoint.
testdb=# update t_wal_ckpt set c2 = 'C2_'||substr(c2,4,40);
UPDATE 1
testdb=# checkpoint;
啟動gdb,設(shè)置信號控制,設(shè)置斷點,進(jìn)入CreateCheckPoint
(gdb) handle SIGINT print nostop pass
SIGINT is used by the debugger.
Are you sure you want to change it? (y or n) y
Signal Stop Print Pass to program Description
SIGINT No Yes Yes Interrupt
(gdb)
(gdb) b CreateCheckPoint
Breakpoint 1 at 0x55b4fb: file xlog.c, line 8668.
(gdb) c
Continuing.
Program received signal SIGINT, Interrupt.
Breakpoint 1, CreateCheckPoint (flags=44) at xlog.c:8668
8668 XLogCtlInsert *Insert = &XLogCtl->Insert;
(gdb)
獲取XLOG插入控制器
8668 XLogCtlInsert *Insert = &XLogCtl->Insert;
(gdb) n
8680 if (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY))
(gdb) p XLogCtl
$1 = (XLogCtlData *) 0x7fadf8f6fa80
(gdb) p *XLogCtl
$2 = {Insert = {insertpos_lck = 0 '\000', CurrBytePos = 5505269968, PrevBytePos = 5505269928,
pad = '\000' , RedoRecPtr = 5521450856, forcePageWrites = false, fullPageWrites = true,
exclusiveBackupState = EXCLUSIVE_BACKUP_NONE, nonExclusiveBackups = 0, lastBackupStart = 0,
WALInsertLocks = 0x7fadf8f74100}, LogwrtRqst = {Write = 5521451392, Flush = 5521451392}, RedoRecPtr = 5521450856,
ckptXidEpoch = 0, ckptXid = 2307, asyncXactLSN = 5521363848, replicationSlotMinLSN = 0, lastRemovedSegNo = 0,
unloggedLSN = 1, ulsn_lck = 0 '\000', lastSegSwitchTime = 1546915130, lastSegSwitchLSN = 5521363360, LogwrtResult = {
Write = 5521451392, Flush = 5521451392}, InitializedUpTo = 5538226176, pages = 0x7fadf8f76000 "\230\320\006",
xlblocks = 0x7fadf8f70088, XLogCacheBlck = 2047, ThisTimeLineID = 1, PrevTimeLineID = 1,
archiveCleanupCommand = '\000' , SharedRecoveryInProgress = false, SharedHotStandbyActive = false,
WalWriterSleeping = true, recoveryWakeupLatch = {is_set = 0, is_shared = true, owner_pid = 0}, lastCheckPointRecPtr = 0,
lastCheckPointEndPtr = 0, lastCheckPoint = {redo = 0, ThisTimeLineID = 0, PrevTimeLineID = 0, fullPageWrites = false,
nextXidEpoch = 0, nextXid = 0, nextOid = 0, nextMulti = 0, nextMultiOffset = 0, oldestXid = 0, oldestXidDB = 0,
oldestMulti = 0, oldestMultiDB = 0, time = 0, oldestCommitTsXid = 0, newestCommitTsXid = 0, oldestActiveXid = 0},
lastReplayedEndRecPtr = 0, lastReplayedTLI = 0, replayEndRecPtr = 0, replayEndTLI = 0, recoveryLastXTime = 0,
currentChunkStartTime = 0, recoveryPause = false, lastFpwDisableRecPtr = 0, info_lck = 0 '\000'}
(gdb) p *Insert
$4 = {insertpos_lck = 0 '\000', CurrBytePos = 5505269968, PrevBytePos = 5505269928, pad = '\000' ,
RedoRecPtr = 5521450856, forcePageWrites = false, fullPageWrites = true, exclusiveBackupState = EXCLUSIVE_BACKUP_NONE,
nonExclusiveBackups = 0, lastBackupStart = 0, WALInsertLocks = 0x7fadf8f74100}
(gdb)
RedoRecPtr = 5521450856,這是REDO point,與pg_control文件中的值一致
[xdb@localhost ~]$ echo "obase=16;ibase=10;5521450856"|bc
1491AA768
[xdb@localhost ~]$ pg_controldata|grep REDO
Latest checkpoint's REDO location: 1/491AA768
Latest checkpoint's REDO WAL file: 000000010000000100000049
[xdb@localhost ~]$
在進(jìn)入critical section前,初始化InitXLogInsert工作空間.
請求CheckpointLock確保在同一時刻只能存在一個checkpoint.
(gdb) n
8683 shutdown = false;
(gdb)
8686 if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
(gdb)
8697 InitXLogInsert();
(gdb)
8705 LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
(gdb)
8714 MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
(gdb)
8715 CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
(gdb)
進(jìn)入critical section,讓smgr(資源管理器)為checkpoint作準(zhǔn)備.
8720 START_CRIT_SECTION();
(gdb)
(gdb)
8722 if (shutdown)
(gdb)
8736 smgrpreckpt();
(gdb)
8739 MemSet(&checkPoint, 0, sizeof(checkPoint));
(gdb)
開始填充Checkpoint XLOG Record
(gdb)
8740 checkPoint.time = (pg_time_t) time(NULL);
(gdb) p checkPoint
$5 = {redo = 0, ThisTimeLineID = 0, PrevTimeLineID = 0, fullPageWrites = false, nextXidEpoch = 0, nextXid = 0, nextOid = 0,
nextMulti = 0, nextMultiOffset = 0, oldestXid = 0, oldestXidDB = 0, oldestMulti = 0, oldestMultiDB = 0, time = 0,
oldestCommitTsXid = 0, newestCommitTsXid = 0, oldestActiveXid = 0}
(gdb) n
8747 if (!shutdown && XLogStandbyInfoActive())
(gdb)
8750 checkPoint.oldestActiveXid = InvalidTransactionId;
在請求插入locks前,獲取最后一個重要的XLOG Record的位置.
(gdb)
8756 last_important_lsn = GetLastImportantRecPtr();
(gdb)
8762 WALInsertLockAcquireExclusive();
(gdb)
(gdb) p last_important_lsn
$6 = 5521451352 --> 0x1491AA958
在檢查插入狀態(tài)確定checkpoint的REDO pointer時,必須阻塞同步插入操作.
(gdb) n
8763 curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
(gdb)
8770 if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
(gdb) p curInsert
$7 = 5521451392 --> 0x1491AA980
(gdb)
繼續(xù)填充Checkpoint XLOG Record
(gdb) n
8790 if (flags & CHECKPOINT_END_OF_RECOVERY)
(gdb)
8793 checkPoint.ThisTimeLineID = ThisTimeLineID;
(gdb)
8794 if (flags & CHECKPOINT_END_OF_RECOVERY)
(gdb)
8797 checkPoint.PrevTimeLineID = ThisTimeLineID;
(gdb) p ThisTimeLineID
$8 = 1
(gdb) n
8799 checkPoint.fullPageWrites = Insert->fullPageWrites;
(gdb)
8809 freespace = INSERT_FREESPACE(curInsert);
(gdb)
8810 if (freespace == 0)
(gdb) p freespace
$9 = 5760
(gdb) n
8817 checkPoint.redo = curInsert;
(gdb)
8830 RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
(gdb)
(gdb) p checkPoint
$10 = {redo = 5521451392, ThisTimeLineID = 1, PrevTimeLineID = 1, fullPageWrites = true, nextXidEpoch = 0, nextXid = 0,
nextOid = 0, nextMulti = 0, nextMultiOffset = 0, oldestXid = 0, oldestXidDB = 0, oldestMulti = 0, oldestMultiDB = 0,
time = 1546933255, oldestCommitTsXid = 0, newestCommitTsXid = 0, oldestActiveXid = 0}
(gdb)
更新共享的RedoRecPtr以備將來的XLogInsert調(diào)用,必須在持有所有插入鎖才能完成。
(gdb) n
8836 WALInsertLockRelease();
(gdb)
8839 SpinLockAcquire(&XLogCtl->info_lck);
(gdb)
8840 XLogCtl->RedoRecPtr = checkPoint.redo;
(gdb)
8841 SpinLockRelease(&XLogCtl->info_lck);
(gdb)
8847 if (log_checkpoints)
(gdb)
(gdb) p XLogCtl->RedoRecPtr
$11 = 5521451392
獲取其他組裝checkpoint記錄的信息.
(gdb) n
8850 TRACE_POSTGRESQL_CHECKPOINT_START(flags);
(gdb)
8860 LWLockAcquire(XidGenLock, LW_SHARED);
(gdb)
8861 checkPoint.nextXid = ShmemVariableCache->nextXid;
(gdb)
8862 checkPoint.oldestXid = ShmemVariableCache->oldestXid;
(gdb)
8863 checkPoint.oldestXidDB = ShmemVariableCache->oldestXidDB;
(gdb)
8864 LWLockRelease(XidGenLock);
(gdb)
8866 LWLockAcquire(CommitTsLock, LW_SHARED);
(gdb)
8867 checkPoint.oldestCommitTsXid = ShmemVariableCache->oldestCommitTsXid;
(gdb)
8868 checkPoint.newestCommitTsXid = ShmemVariableCache->newestCommitTsXid;
(gdb)
8869 LWLockRelease(CommitTsLock);
(gdb)
8872 checkPoint.nextXidEpoch = ControlFile->checkPointCopy.nextXidEpoch;
(gdb) n
8873 if (checkPoint.nextXid < ControlFile->checkPointCopy.nextXid)
(gdb)
8876 LWLockAcquire(OidGenLock, LW_SHARED);
(gdb)
8877 checkPoint.nextOid = ShmemVariableCache->nextOid;
(gdb) p checkPoint
$13 = {redo = 5521451392, ThisTimeLineID = 1, PrevTimeLineID = 1, fullPageWrites = true, nextXidEpoch = 0, nextXid = 2308,
nextOid = 0, nextMulti = 0, nextMultiOffset = 0, oldestXid = 561, oldestXidDB = 16400, oldestMulti = 0,
oldestMultiDB = 0, time = 1546933255, oldestCommitTsXid = 0, newestCommitTsXid = 0, oldestActiveXid = 0}
(gdb) n
8878 if (!shutdown)
(gdb)
8879 checkPoint.nextOid += ShmemVariableCache->oidCount;
(gdb)
8880 LWLockRelease(OidGenLock);
(gdb) p *ShmemVariableCache
$14 = {nextOid = 42575, oidCount = 8189, nextXid = 2308, oldestXid = 561, xidVacLimit = 200000561,
xidWarnLimit = 2136484208, xidStopLimit = 2146484208, xidWrapLimit = 2147484208, oldestXidDB = 16400,
oldestCommitTsXid = 0, newestCommitTsXid = 0, latestCompletedXid = 2307, oldestClogXid = 561}
(gdb) n
8882 MultiXactGetCheckptMulti(shutdown,
(gdb)
再次查看checkpoint結(jié)構(gòu)體
(gdb) p checkPoint
$15 = {redo = 5521451392, ThisTimeLineID = 1, PrevTimeLineID = 1, fullPageWrites = true, nextXidEpoch = 0, nextXid = 2308,
nextOid = 50764, nextMulti = 1, nextMultiOffset = 0, oldestXid = 561, oldestXidDB = 16400, oldestMulti = 1,
oldestMultiDB = 16402, time = 1546933255, oldestCommitTsXid = 0, newestCommitTsXid = 0, oldestActiveXid = 0}
(gdb)
結(jié)束CRIT_SECTION
(gdb)
8896 END_CRIT_SECTION();
獲取虛擬事務(wù)ID(無效的信息)
(gdb) n
8927 vxids = GetVirtualXIDsDelayingChkpt(&nvxids);
(gdb)
8928 if (nvxids > 0)
(gdb) p vxids
$16 = (VirtualTransactionId *) 0x2f4eb20
(gdb) p *vxids
$17 = {backendId = 2139062143, localTransactionId = 2139062143}
(gdb) p nvxids
$18 = 0
(gdb)
(gdb) n
8935 pfree(vxids);
(gdb)
把共享內(nèi)存中的數(shù)據(jù)刷到磁盤上,并執(zhí)行fsync
(gdb)
8937 CheckPointGuts(checkPoint.redo, flags);
(gdb) p flags
$19 = 44
(gdb) n
8947 if (!shutdown && XLogStandbyInfoActive())
(gdb)
進(jìn)入critical section.
(gdb) n
8950 START_CRIT_SECTION();
(gdb)
現(xiàn)在可以插入checkpoint record到XLOG中了.
(gdb)
8955 XLogBeginInsert();
(gdb) n
8956 XLogRegisterData((char *) (&checkPoint), sizeof(checkPoint));
(gdb)
8957 recptr = XLogInsert(RM_XLOG_ID,
(gdb)
8961 XLogFlush(recptr);
(gdb)
8970 if (shutdown)
(gdb)
更新控制文件(pg_control),首先為UpdateCheckPointDistanceEstimate()記錄上一個checkpoint的REDO ptr
(gdb)
8982 if (shutdown && checkPoint.redo != ProcLastRecPtr)
(gdb)
8990 PriorRedoPtr = ControlFile->checkPointCopy.redo;
(gdb)
8995 LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
(gdb) p ControlFile->checkPointCopy.redo
$20 = 5521450856
(gdb) n
8996 if (shutdown)
(gdb)
8998 ControlFile->checkPoint = ProcLastRecPtr;
(gdb)
8999 ControlFile->checkPointCopy = checkPoint;
(gdb)
9000 ControlFile->time = (pg_time_t) time(NULL);
(gdb)
9002 ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
(gdb)
9003 ControlFile->minRecoveryPointTLI = 0;
(gdb)
9010 SpinLockAcquire(&XLogCtl->ulsn_lck);
(gdb)
9011 ControlFile->unloggedLSN = XLogCtl->unloggedLSN;
(gdb)
9012 SpinLockRelease(&XLogCtl->ulsn_lck);
(gdb)
9014 UpdateControlFile();
(gdb)
9015 LWLockRelease(ControlFileLock);
(gdb)
9018 SpinLockAcquire(&XLogCtl->info_lck);
(gdb) p *ControlFile
$21 = {system_identifier = 6624362124887945794, pg_control_version = 1100, catalog_version_no = 201809051,
state = DB_IN_PRODUCTION, time = 1546934255, checkPoint = 5521451392, checkPointCopy = {redo = 5521451392,
ThisTimeLineID = 1, PrevTimeLineID = 1, fullPageWrites = true, nextXidEpoch = 0, nextXid = 2308, nextOid = 50764,
nextMulti = 1, nextMultiOffset = 0, oldestXid = 561, oldestXidDB = 16400, oldestMulti = 1, oldestMultiDB = 16402,
time = 1546933255, oldestCommitTsXid = 0, newestCommitTsXid = 0, oldestActiveXid = 0}, unloggedLSN = 1,
minRecoveryPoint = 0, minRecoveryPointTLI = 0, backupStartPoint = 0, backupEndPoint = 0, backupEndRequired = false,
wal_level = 0, wal_log_hints = false, MaxConnections = 100, max_worker_processes = 8, max_prepared_xacts = 0,
max_locks_per_xact = 64, track_commit_timestamp = false, maxAlign = 8, floatFormat = 1234567, blcksz = 8192,
relseg_size = 131072, xlog_blcksz = 8192, xlog_seg_size = 16777216, nameDataLen = 64, indexMaxKeys = 32,
toast_max_chunk_size = 1996, loblksize = 2048, float4ByVal = true, float8ByVal = true, data_checksum_version = 0,
mock_authentication_nonce = "\220\277\067Vg\003\205\232U{\177 h\216\271D\266\063[\\=6\365S\tA\353\361?w\301",
crc = 930305687}
(gdb)
更新checkpoint XID/epoch的共享內(nèi)存拷貝,退出critical section,并讓smgr執(zhí)行checkpoint收尾工作(比如刪除舊文件等).
(gdb) n
9019 XLogCtl->ckptXidEpoch = checkPoint.nextXidEpoch;
(gdb)
9020 XLogCtl->ckptXid = checkPoint.nextXid;
(gdb)
9021 SpinLockRelease(&XLogCtl->info_lck);
(gdb)
9027 END_CRIT_SECTION();
(gdb)
9032 smgrpostckpt();
(gdb)
刪除舊的日志文件,這些文件自最后一個檢查點后已不再需要,以防止保存xlog的磁盤撐滿。
(gdb) n
9038 if (PriorRedoPtr != InvalidXLogRecPtr)
(gdb) p PriorRedoPtr
$23 = 5521450856
(gdb) n
9039 UpdateCheckPointDistanceEstimate(RedoRecPtr - PriorRedoPtr);
(gdb)
9045 XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
(gdb)
9046 KeepLogSeg(recptr, &_logSegNo);
(gdb) p RedoRecPtr
$24 = 5521451392
(gdb) p _logSegNo
$25 = 329
(gdb) p wal_segment_size
$26 = 16777216
(gdb) n
9047 _logSegNo--;
(gdb)
9048 RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);
(gdb)
9054 if (!shutdown)
(gdb) p recptr
$27 = 5521451504
(gdb)
執(zhí)行其他相關(guān)收尾工作
(gdb) n
9055 PreallocXlogFiles(recptr);
(gdb)
9064 if (!RecoveryInProgress())
(gdb)
9065 TruncateSUBTRANS(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
(gdb)
9068 LogCheckpointEnd(false);
(gdb)
9070 TRACE_POSTGRESQL_CHECKPOINT_DONE(CheckpointStats.ckpt_bufs_written,
(gdb)
9076 LWLockRelease(CheckpointLock);
(gdb)
9077 }
(gdb)
完成調(diào)用
(gdb)
CheckpointerMain () at checkpointer.c:488
488 ckpt_performed = true;
(gdb)
DONE!
checkpointer.c