1500字范文 > 金仓数据库KingbaseES V8R3集群删除test库主备切换测试案例

金仓数据库KingbaseES V8R3集群删除test库主备切换测试案例

时间：2022-08-09 20:14:47

案例说明

在KingbaseES R3集群中，kingbasecluster进程会通过test库访问，连接后台数据库服务测试；如果删除test数据库，导致后台数据库服务访问失败，在集群主备切换时，无法访问后台数据库服务，导致切换失败。修改集群HAmodule.conf配置文件相关参数后，可以解决集群test库被删除导致主备切换失败问题。

测试数据库版本

prod=# select version();

version -------------------------------------------------------------------------------------------------------------------------

Kingbase V008R003C002B0270 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.1.2 0704 (Red Hat 4.1.2-46), 64-bit

(1 row)

一、查看集群访问test库配置

[kingbase@node1 etc]$ cat HAmodule.conf |grep -i test#database instance built-in database.example:KB_DATANAME=“TEST”

KB_DATANAME=“TEST”

二、查看kingbase_monitor.sh访问test库信息

=可以从kingbase_monitor.sh start的启动过程，看到对test库的访问=

[kingbase@node1 bin]$ sh -x kingbase_monitor.sh restart > ~/kmon.txt

[kingbase@node1 ~]$ cat kmon.txt |grep -i test

param=‘KB_DATANAME=“TEST”’paramValue=‘“TEST”’‘[’ -z ‘“TEST”’ ‘]’eval ‘KB_DATANAME=“TEST”’

++ KB_DATANAME=TEST

++ /home/kingbase/cluster/kha/db/bin/ksql ‘host=192.168.7.248 port=54321 user=SUPERMANAGER_V8ADMIN password=KINGBASEADMIN dbname=TEST connect_timeout=10’ -Aqtc ‘select count(*)=1 from sys_stat_replication;’

二、集群删除test库测试（主库）

查看database cluster

[kingbase@node3 bin]$ ./ksql -U system -W 123456 prod

ksql (V008R003C002B0270)

Type “help” for help.

prod=# \l

List of databases

-----------±-------±---------±------------±------------±-------------------

(7 rows)

主库判断

prod=# select sys_is_in_recovery();
sys_is_in_recovery

(1 row)

查看流复制状态

prod=# select * from sys_stat_replication;

-------±---------±--------±-----------------±--------------±----------------±------------±------------------------------±-------------±-

25795 | 10 | system | node243 | 192.168.7.243 | | 45418 | -03-01 12:49:12.263710+08 | | s

treaming | 0/E0001B0 | 0/E0001B0 | 0/E0001B0 | 0/E000178 | 0 | async

(1 row)

主库删除test库：

prod=# drop database test;

DROP DATABASE

prod=# \l

List of databases

-----------±-------±---------±------------±------------±-------------------

(6 rows)

备库查看：

[kingbase@node3 bin]$ ./ksql -U system -W 123456 prod

ksql (V008R003C002B0270)

Type “help” for help.

prod=# \l

List of databases

-----------±-------±---------±------------±------------±-------------------

(6 rows)

三、主备failover切换测试

1）关闭主库数据库服务

[kingbase@node1 bin]$ ./sys_ctl stop -D …/data

waiting for server to shut down… done

server stopped

2）查看日志

备库cluster.log：(从日志可以获知，备库访问主库后台数据库服务失败，已经发起了failover的切换）

…-03-01 12:41:16: pid 14431: LOG: health checking retry count 1-03-01 12:41:16: pid 14431: LOG: failed to connect to kingbase server on “192.168.7.248:54321”, getsockopt() detected error "Connection refused"-03-01 12:41:16: pid 14431: ERROR: failed to make persistent db connection-03-01 12:41:16: pid 14431: DETAIL: connection to host:“192.168.7.248:54321” failed-03-01 12:41:18: pid 14644: LOG: watchdog checking if kingbasecluster is alive using heartbeat-03-01 12:41:18: pid 14644: DETAIL: the last heartbeat from “192.168.7.248:9999” received 0 seconds ago

…

-03-01 12:41:22: pid 14473: LOG: received the failover command lock request from local kingbasecluster on IPC interface-03-01 12:41:22: pid 14473: LOG: local kingbasecluster node “192.168.7.243:9999 Linux node3” is requesting to become a lock holder for failover ID: 69-03-01 12:41:22: pid 14473: LOG: local kingbasecluster node “192.168.7.243:9999 Linux node3” is the lock holder-03-01 12:41:22: pid 14431: LOG: starting degeneration. shutdown host 192.168.7.248(54321)-03-01 12:41:22: pid 14431: LOG: Restart all children-03-01 12:41:22: pid 14431: LOG: execute command: /home/kingbase/cluster/kha/kingbasecluster/bin/failover_stream.sh 192.168.7.243 1 1 192.168.7.248 192.168.7.248 0 0 /home/kingbase/cluster/kha/db/data

…

3）查看备库数据库服务（切换失败）

=备库数据库服务仍然启动recovery，还是在备库状态=

[kingbase@node3 log]$ ps -ef |grep kingbase

kingbase 13764 1 0 12:31 ? 00:00:01 /home/kingbase/cluster/kha/db/bin/kingbase -D /home/kingbase/cluster/kha/db/data

kingbase 13781 13764 0 12:31 ? 00:00:00 kingbase: logger process

kingbase 13782 13764 0 12:31 ? 00:00:00 kingbase: startup process recovering 00000001000000000000000F

kingbase 13786 13764 0 12:31 ? 00:00:00 kingbase: checkpointer process

kingbase 13787 13764 0 12:31 ? 00:00:00 kingbase: writer process

kingbase 13788 13764 0 12:31 ? 00:00:00 kingbase: stats collector process

…

查看备库failover.log：

=== 有以下日志信息获知，在切换过程中需要访问test库，而test库被删除，导致访问失败，主备切换不成功===

------------------03-01 12:41:22 failover beging---------------------------------------

----failover-stats is %H = hostname of the new master node [192.168.7.243], %P = old primary node id [1], %d = node id[1], %h = host name [192.168.7.248], %O = old primary host[192.168.7.248] %m = new master node id [0], %M = old master node id [0], %D = database cluster path [/home/kingbase/cluster/kha/db/data].

----ping trust ip

ping trust ip 192.168.7.1 success

----determine whether the faulty db is master or standby

master down, let 192.168.7.243 become new primary…

-03-01 12:41:24 del old primary VIP on 192.168.7.248

ssh connect host:192.168.7.248 success, will stop old primary db and del the vipstop the old primary db

DEL VIP NOW AT -03-01 12:58:53 ON enp0s3execute: [/sbin/ip addr del 192.168.7.245/24 dev enp0s3]

Oprate del ip cmd end.-03-01 12:41:24 add VIP on 192.168.7.243sys_ctl: PID file “/home/kingbase/cluster/kha/db/data/kingbase.pid” does not existIs server running?

ADD VIP NOW AT -03-01 12:41:25 ON enp0s3execute: [/sbin/ip addr add 192.168.7.245/24 dev enp0s3 label enp0s3:2]execute: /home/kingbase/cluster/kha/db/bin/arping -U 192.168.7.245 -I enp0s3 -w 1

ARPING 192.168.7.245 from 192.168.7.245 enp0s3

Sent 1 probes (1 broadcast(s))

Received 0 response(s)-03-01 12:41:26 promote begin…let 192.168.7.243 become master

check db if is alive

ksql “port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEST connect_timeout=10” -c “select 33333;” ksql: FATAL: database “TEST” does not exist

kingbase is down,retry check db is if alive,retry times:[1/3]

before promote query detail[] , try again!

ksql “port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEST connect_timeout=10” -c “select 33333;” ksql: FATAL: database “TEST” does not exist

kingbase is down,retry check db is if alive,retry times:[2/3]

before promote query detail[] , try again!

ksql “port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEST connect_timeout=10” -c “select 33333;” ksql: FATAL: database “TEST” does not exist

kingbase is down,retry check db is if alive,retry times:[3/3]

before promote query detail[] , try again!

kingbase is down,after retry 3 times ,cannot do promote, will exit

execute kingbase_promote.sh failed ,will exit script with error"ssh -o StrictHostKeyChecking=no -l kingbase -T 192.168.7.243 “/home/kingbase/cluster/kha/db/bin/kingbase_promote.sh /home/kingbase/cluster/kha/db/bin SUPERMANAGER_V8ADMIN TEST 54321 /home/kingbase/cluster/kha/db/data 3 3 2>&1"” execute failed, error num=[66]

四、修改HAmodule.conf配置文件参数（所有节点）

=将访问test库更改为template2库=

[kingbase@node1 etc]$ pwd

/home/kingbase/cluster/kha/kingbasecluster/etc

[kingbase@node1 etc]$ cat HAmodule.conf |grep -i temp

KB_DATANAME=“TEMPLATE2”

[kingbase@node1 etc]$ cd …/…/db/etc/

[kingbase@node1 etc]$ cat HAmodule.conf |grep -i temp

KB_DATANAME=“TEMPLATE2”

五、重新启动集群
1）重启集群
[kingbase@node1 bin]$ ./kingbase_monitor.sh restart
------------------------------------------------------------------------03-01 13:18:35 KingbaseES automation beging…-03-01 13:18:35 stop kingbasecluster [192.168.7.243] …
remove status file /home/kingbase/cluster/kha/run/kingbasecluster/kingbasecluster_status
DEL VIP NOW AT -03-01 13:01:11 ON enp0s3
No VIP on my dev, nothing to do.-03-01 13:18:41 Done…-03-01 13:18:41 stop kingbasecluster [192.168.7.248] …
remove status file /home/kingbase/cluster/kha/run/kingbasecluster/kingbasecluster_status
DEL VIP NOW AT -03-01 13:18:46 ON enp0s3
No VIP on my dev, nothing to do.-03-01 13:18:47 Done…-03-01 13:18:47 stop kingbase [192.168.7.243] …set /home/kingbase/cluster/kha/db/data down now…-03-01 13:18:50 Done…-03-01 13:18:51 Del kingbase VIP [192.168.7.245/24] …
DEL VIP NOW AT -03-01 13:01:22 ON enp0s3execute: [/sbin/ip addr del 192.168.7.245/24 dev enp0s3]
Oprate del ip cmd end.-03-01 13:18:51 Done…-03-01 13:18:51 stop kingbase [192.168.7.248] …set /home/kingbase/cluster/kha/db/data down now…-03-01 13:18:56 Done…-03-01 13:18:57 Del kingbase VIP [192.168.7.245/24] …
DEL VIP NOW AT -03-01 13:18:57 ON enp0s3
No VIP on my dev, nothing to do.-03-01 13:18:57 Done…
…
all stop…
ping trust ip 192.168.7.1 success ping times :[3], success times:[2]
ping trust ip 192.168.7.1 success ping times :[3], success times:[2]
start crontab kingbase position : [3]
Redirecting to /bin/systemctl restart crond.service
start crontab kingbase position : [2]
Redirecting to /bin/systemctl restart crond.service
ADD VIP NOW AT -03-01 13:19:10 ON enp0s3execute: [/sbin/ip addr add 192.168.7.245/24 dev enp0s3 label enp0s3:2]execute: /home/kingbase/cluster/kha/db/bin/arping -U 192.168.7.245 -I enp0s3 -w 1
ARPING 192.168.7.245 from 192.168.7.245 enp0s3
Sent 1 probes (1 broadcast(s))
Received 0 response(s)
ping vip 192.168.7.245 success ping times :[3], success times:[2]
ping vip 192.168.7.245 success ping times :[3], success times:[2]
wait kingbase recovery 5 sec…
start crontab kingbasecluster line number: [6]
Redirecting to /bin/systemctl restart crond.service
start crontab kingbasecluster line number: [3]
Redirecting to /bin/systemctl restart crond.service
…
all started…
…
now we check again

| ip | program| [status]
[ 192.168.7.243]| [kingbasecluster]| [active]
[ 192.168.7.248]| [kingbasecluster]| [active]
[ 192.168.7.243]| [kingbase]| [active]
[ 192.168.7.248]| [kingbase]| [active]

2）查看集群节点状态

[kingbase@node1 bin]$ ./ksql -U SYSTEM -W 123456 prod -p 9999

ksql (V008R003C002B0270)

Type “help” for help.

prod=# show pool_nodes;

0 | 192.168.7.243 | 54321 | up | 0.500000 | standby | 0 | true | 0

1 | 192.168.7.248 | 54321 | up | 0.500000 | primary | 0 | false | 0

(2 rows)

prod=# select * from sys_stat_replication;

state | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state ------±---------±--------±-----------------±--------------±----------------±------------±------------------------------±-------------±----------±--------------±---------------±---------------±----------------±--------------±-----------

5539 | 10 | system | node243 | 192.168.7.243 | | 47872 | -03-01 13:19:09.653137+08 | | st

reaming | 0/100000D0 | 0/100000D0 | 0/100000D0 | 0/100000D0 | 0 | async

(1 row)

[kingbase@node1 bin]$ ./ksql -U SYSTEM -W 123456 prod

ksql (V008R003C002B0270)

Type “help” for help.

prod=# select * from sys_stat_replication;

5539 | 10 | system | node243 | 192.168.7.243 | | 47872 | -03-01 13:19:09.653137+08 | | st

reaming | 0/100000D0 | 0/100000D0 | 0/100000D0 | 0/100000D0 | 0 | async

(1 row)

六、再次执行主备切换测试

1）停止主库数据库服务

[kingbase@node1 bin]$ ./sys_ctl stop -D …/data

waiting for server to shut down… done

server stopped

2、查看备库数据库服务进程（切换成功）

[kingbase@node3 bin]$ ps -ef|grep kingbase

kingbase 22541 1 0 13:01 ? 00:00:00 /home/kingbase/cluster/kha/db/bin/kingbase -D /home/kingbase/cluster/kha/db/datakingbase 22552 22541 0 13:01 ? 00:00:00 kingbase: logger process kingbase 22563 22541 0 13:01 ? 00:00:00 kingbase: checkpointer process kingbase 22564 22541 0 13:01 ? 00:00:00 kingbase: writer process kingbase 22565 22541 0 13:01 ? 00:00:00 kingbase: stats collector process root 23208 1 0 13:01 ? 00:00:00 ./kingbasecluster -nroot 23255 23208 0 13:01 ? 00:00:00 kingbasecluster: watchdogroot 23421 23208 0 13:02 ? 00:00:00 kingbasecluster: lifecheckroot 23423 23421 0 13:02 ? 00:00:00 kingbasecluster: heartbeat receiverroot 23424 23421 0 13:02 ? 00:00:00 kingbasecluster: heartbeat senderkingbase 24571 22541 0 13:05 ? 00:00:00 kingbase: wal writer process kingbase 24572 22541 0 13:05 ? 00:00:00 kingbase: autovacuum launcher process kingbase 24573 22541 0 13:05 ? 00:00:00 kingbase: archiver process last was 00000002.historykingbase 24574 22541 0 13:05 ? 00:00:00 kingbase: bgworker: syslogical supervisor

查看failover.log日志：

[kingbase@node3 log]$ tail -100 failover.log

------------------03-01 13:05:12 failover beging-------------------------------------------failover-stats is %H = hostname of the new master node [192.168.7.243], %P = old primary node id [1], %d = node id[1], %h = host name [192.168.7.248], %O = old primary host[192.168.7.248] %m = new master node id [0], %M = old master node id [0], %D = database cluster path [/home/kingbase/cluster/kha/db/data].----ping trust ip

ping trust ip 192.168.7.1 success ----determine whether the faulty db is master or standby

master down, let 192.168.7.243 become new primary…

-03-01 13:05:14 del old primary VIP on 192.168.7.248

ssh connect host:192.168.7.248 success, will stop old primary db and del the vip

stop the old primary db

sys_ctl: PID file “/home/kingbase/cluster/kha/db/data/kingbase.pid” does not existIs server running?

DEL VIP NOW AT -03-01 13:22:43 ON enp0s3execute: [/sbin/ip addr del 192.168.7.245/24 dev enp0s3]

Oprate del ip cmd end.-03-01 13:05:14 add VIP on 192.168.7.243ADD VIP NOW AT -03-01 13:05:15 ON enp0s3execute: [/sbin/ip addr add 192.168.7.245/24 dev enp0s3 label enp0s3:2]execute: /home/kingbase/cluster/kha/db/bin/arping -U 192.168.7.245 -I enp0s3 -w 1

ARPING 192.168.7.245 from 192.168.7.245 enp0s3

Sent 1 probes (1 broadcast(s))

Received 0 response(s)-03-01 13:05:16 promote begin…let 192.168.7.243 become mastercheck db if is alive

ksql “port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEMPLATE2 connect_timeout=10” -c “select 33333;” -03-01 13:05:17 kingbase is ok , to prepare execute promoteexecute promote

server promotingcheck db if is alive after promote

ksql “port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEMPLATE2 connect_timeout=10” -c “select 33333;” -03-01 13:05:17 after execute promote , kingbase status is ok.

after execute promote, kingbase is ok.-03-01 13:05:17 sync to asyncALTER SYSTEM

sys_reload_conf -----------------

(1 row)

-03-01 13:05:17 make checkpointcheck the db to see if it is alive

ksql “port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEMPLATE2 connect_timeout=10” -c "select 33333;"-03-01 13:05:18 kingbase is ok , to prepare execute checkpointexecute checkpoint

CHECKPOINTcheck the db to see if it is alive after execute checkpoint

ksql “port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEMPLATE2 connect_timeout=10” -c “select 33333;” -03-01 13:05:18 after execute checkpoint, kingbase is ok.

after execute checkpoint, kingbase is ok.

七、将原主库恢复为备库

1）在新主库上创建replication slots。

2）在原主库下创建recovery.conf文件后，sys_ctl手工启动数据库服务。

3）检查主备流复制状态。

4）重新启动集群测试。

八、总结

对于KingbaseES R3集群的test库，多用于kingbasecluster和后台数据库服务的健康检查访问，请不要轻易删除。

九、附件

案例：在没有在原主库data目录下创建recovery.conf文件，启动主库数据库服务。后创建了recovery.conf,再启动原主库以备库加入流复制失败，因为timeline与新主库不一致。采用sys_rewind工具重新将原主库加入集群。

1）查看原主库sys_log日志

=如下所示，备库数据库服务启动失败，因为timeline与新主库不一致=

[kingbase@node1 sys_log]$ tail -100f kingbase--03-01_132943.logLOG: database system was shut down in recovery at -03-01 13:29:42 CSTLOG: entering standby modeFATAL: requested timeline 2 is not a child of this server’s historyDETAIL: Latest checkpoint is at 0/12000028 on timeline 1, but in the history of the requested timeline, the server forked off from that timeline at 0/11000098.LOG: startup process (PID 10918) exited with exit code 1LOG: aborting startup due to startup process failureLOG: database system is shut down

2）在原主库执行sys_rewind加入集群

[kingbase@node1 bin]$ ./sys_rewind -D /home/kingbase/cluster/kha/db/data --source-server=‘host=192.168.7.243 port=54321 user=system dbname=PROD’ -P -n

connected to server

datadir_source = /home/kingbase/cluster/kha/db/data

rewinding from last common checkpoint at 0/10000028 on timeline 1

find last common checkpoint start time from -03-01 13:36:24.675116 CST to -03-01 13:36:24.702727 CST, in “0.027611” seconds.

reading source file list

reading target file list

reading WAL in target

need to copy 298 MB (total source directory size is 363 MB)

Rewind datadir file from sourceGet archive xlog list from source

Rewind archive log from source

59462/305222 kB (19%) copied

creating backup label and updating control file

syncing target data directory

rewind start wal location 0/10000028 (file 000000010000000000000010), end wal location 0/11070290 (file 000000020000000000000011). time from -03-01 13:36:25.675116 CST to -03-01 13:36:25.379386 CST, in “0.704270” seconds.

Done!

3）启动备库数据库服务

[kingbase@node1 bin]$ ./sys_ctl start -D …/data

server starting

[kingbase@node1 bin]$ LOG: redirecting log output to logging collector process

HINT: Future log output will appear in directory “/data/kingbase/cluster/r3/data/sys_log”.

[kingbase@node1 bin]$ ps -ef|grep kingbase

…

kingbase 11983 13899 0 13:31 pts/1 00:00:00 tail -100f kingbase--03-01_132943.log

kingbase 13611 1 0 13:36 pts/0 00:00:00 /home/kingbase/cluster/kha/db/bin/kingbase -D …/data

kingbase 13614 13611 0 13:36 ? 00:00:00 kingbase: logger process

kingbase 13615 13611 0 13:36 ? 00:00:00 kingbase: startup process recovering 000000020000000000000011