PG 库停库超时异常案例
文章目录
- 现象
- 官方文档
- 停库底层流程:
- 恢复脚本优化思路
- 总结
现象
- 停库超时
<2024-11-29 12:50:43.022 UTC 87472 192.167.60.1(54862) PostgreSQL JDBC Driver postgres stk>FATAL: terminating connection due to administrator command
<2024-11-29 12:50:43.022 UTC 108724 127.0.0.1(58913) PostgreSQL JDBC Driver postgres vehiclerecognize>FATAL: terminating connection due to administrator command
...
2024-11-29 12:50:43.022 UTC 75724 127.0.0.1(55169) PostgreSQL JDBC Driver postgres test>FATAL: terminating connection due to administrator command
<2024-11-29 12:50:43.072 UTC 69504 >FATAL: terminating autovacuum process due to administrator command
<2024-11-29 12:50:43.449 UTC 88328 127.0.0.1(60033) [unknown] postgres test>FATAL: the database system is shutting down
<2024-11-29 12:50:43.494 UTC 90688 127.0.0.1(60034) [unknown] postgres test>FATAL: the database system is shutting down
<2024-11-29 12:50:43.727 UTC 109536 127.0.0.1(60035) [unknown] postgres test>FATAL: the database system is shutting down
...
<2024-11-29 12:51:04.833 UTC 92964 127.0.0.1(60291) [unknown] [unknown] [unknown]>LOG: wrong key in cancel request for process 0
<2024-11-29 12:51:04.900 UTC 106280 127.0.0.1(60292) [unknown] postgres test>FATAL: the database system is shutting down
<2024-11-29 12:51:05.156 UTC 100548 127.0.0.1(60293) [unknown] postgres test>FATAL: the database system is shutting down
<2024-11-29 12:51:05.366 UTC 84108 127.0.0.1(60294) [unknown] [unknown] [unknown]>LOG: PID 99480 in cancel request did not match any process
...
<2024-11-29 12:56:43.887 UTC 78360 127.0.0.1(64920) [unknown] [unknown] [unknown]>LOG: wrong key in cancel request for process 0
<2024-11-29 12:56:43.931 UTC 109516 127.0.0.1(64921) [unknown] postgres test>FATAL: the database system is shutting down
<2024-11-29 12:56:43.933 UTC 88528 >LOG: shutting down
<2024-11-29 12:56:44.175 UTC 108644 127.0.0.1(64922) [unknown] [unknown] [unknown]>LOG: wrong key in cancel request for process 0
<2024-11-29 12:56:44.219 UTC 100096 127.0.0.1(64923) [unknown] postgres test>FATAL: the database system is shutting down
<2024-11-29 12:56:44.269 UTC 103412 127.0.0.1(64958) [unknown] postgres test>FATAL: the database system is shutting down
<2024-11-29 12:56:44.304 UTC 99748 127.0.0.1(64959) [unknown] postgres test>FATAL: the database system is shutting down
<2024-11-29 12:56:44.467 UTC 84828 127.0.0.1(64960) [unknown] [unknown] [unknown]>LOG: wrong key in cancel request for process 0
<2024-11-29 12:56:44.501 UTC 98716 127.0.0.1(64961) [unknown] postgres uap>FATAL: the database system is shutting down
<2024-11-29 12:56:44.577 UTC 92348 127.0.0.1(64962) [unknown] postgres test>FATAL: the database system is shutting down
- 任务管理器查看 postgresql-x64-15 服务是 stopped
- pg_ctl status -D %PGDATA% 是 STOPPED
- postgres.exe 相关进程是不存在的 (说明 pub_postgresql 服务与 postgres 进程都是停止了的, 只是由于 我们数据库恢复脚本中停库超时达到 100秒, 则认为是超时了, 从而不继续运行了)
可以看到, 停库消耗了 6 分钟
初步怀疑是 停库过程中的 checkpoint 等导致的写操作, 被亚信杀毒软件影响到了, 导致写的过慢超时了
官方文档
参考官方文档 pg_ctl stop 如果不通过 -t 指定超时限制, 那么默认超时时间为 60 秒, 不管是 pg_ctl start 还是 pg_ctl stop 都是默认超时限制为 60 秒.
pg_ctl stop [-D DATADIR] [-m SHUTDOWN-MODE] [-W] [-t SECS] [-s]-t seconds
--timeout=seconds
Specifies the maximum number of seconds to wait when waiting for an operation to complete (see option `-w`). Defaults to the value of the `PGCTLTIMEOUT` environment variable or, if not set, to 60 seconds.
停库底层流程:
当您执行 net stop postgresql-x64-15
或 pg_ctl stop -m fast
时,PostgreSQL 按以下顺序停止:
- 断开客户端连接:
主进程停止监听端口(5432),不再接受新的客户端连接。在这个阶段,您会发现 5432 端口已关闭。
- 回滚未完成事务:
PostgreSQL 将回滚所有未提交的事务,以保证一致性。这个过程可能耗时,尤其是有长事务或大量未写入的数据时。 - 终止后台子进程:
所有后台子进程(如autovacuum
、background workers
、replication workers
等)需要逐一终止。这可能因为某些进程被阻塞或超时而变慢。 - 写入检查点:
PostgreSQL 会将内存中的数据写入磁盘,更新数据文件的最新状态。如果数据库很大,或者有大量未完成的 WAL 日志,需要写入的内容会更多。 - 删除
postmaster.pid
文件:
最后,当所有工作完成后,PostgreSQL 会删除postmaster.pid
文件,表明服务已完全停止。
恢复脚本优化思路
所以本测试环境,应该是处于停止的中间状态, 但是由于默认停库超时是 2h(注册服务时设置的), 而当前环境实际停库时间是 6 分钟, 恢复脚本中设置的超时时间是 100 秒, 所以提前就终止恢复流程了
因此, 我们可以从如下几方面来优化该脚本:
-
- 延长超时限制与服务注册时的限制保持一致;
-
- 在停库之前显示执行两次 checkpoint 操作.
总结
对于停库或者数据库重启操作
比较保险的做法是显示执行 checkpoint (这将有助于我们使得关闭时检查点非常轻量化,减少停机时间) ,
并且是连续执行两次 checkpoint (如果第一个 CHECKPOINT 很重,可能会花费时间,而在此期间由于持续写入,新的脏缓冲区会积累——我们通过第二个 CHECKPOINT 来减轻这一问题,使关闭检查点保持轻量且快速)
停止 PostgreSQL 包括三种模式 参考:
1. "Smart" mode disallows new connections, then waits for all existing clients to disconnect. If the server is in hot standby, recovery and streaming replication will be terminated once all clients have disconnected.
2. "Fast" mode (the default) does not wait for clients to disconnect. All active transactions are rolled back and clients are forcibly disconnected, then the server is shut down.
3. "Immediate" mode will abort all server processes immediately, without a clean shutdown. This choice will lead to a crash-recovery cycle during the next server start.
参考: https://postgres-howto.cn/#/./docs/2?id=%e6%80%bb%e7%bb%93