Redhat Openstack has build-in pacemaker to manage few docker containers status, and it also affects how Mariadb works on Openstack. Usually when you see a Mariadb failure on Redhat Openstack, you would see some thing like this:

[[email protected] etc]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller1 (version 1.1.19-8.el7_6.2-c3c624ea3d) - partition with quorum
Last updated: Thu May  7 21:55:43 2020
Last change: Thu May  7 21:51:15 2020 by hacluster via crmd on controller2

12 nodes configured
36 resources configured

Online: [ controller1 controller2 controller3 ]
GuestOnline: [ [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] ]

Full list of resources:

 Docker container set: rabbitmq-bundle [10.240.x.x:8787/rhosp13/openstack-rabbitmq:pcmklatest]
   rabbitmq-bundle-0    (ocf::heartbeat:rabbitmq-cluster):      Started controller1
   rabbitmq-bundle-1    (ocf::heartbeat:rabbitmq-cluster):      Started controller3
   rabbitmq-bundle-2    (ocf::heartbeat:rabbitmq-cluster):      Started controller2
 Docker container set: galera-bundle [10.240.x.x:8787/rhosp13/openstack-mariadb:pcmklatest]
   galera-bundle-0      (ocf::heartbeat:galera):        FAILED Master controller1 (blocked)
   galera-bundle-1      (ocf::heartbeat:galera):        FAILED Master controller3 (blocked)
   galera-bundle-2      (ocf::heartbeat:galera):        Master controller2
 Docker container set: redis-bundle [10.240.x.x:8787/rhosp13/openstack-redis:pcmklatest]
   redis-bundle-0       (ocf::heartbeat:redis): Master controller1
   redis-bundle-1       (ocf::heartbeat:redis): Slave controller3
   redis-bundle-2       (ocf::heartbeat:redis): Slave controller2
 ip-10.240.x.109      (ocf::heartbeat:IPaddr2):       Started controller1
 ip-10.240.x.102      (ocf::heartbeat:IPaddr2):       Started controller1
 ip-10.240.x.142      (ocf::heartbeat:IPaddr2):       Started controller1
 ip-10.240.x.141      (ocf::heartbeat:IPaddr2):       Started controller1
 ip-192.168.x.20      (ocf::heartbeat:IPaddr2):       Started controller1
 Docker container set: haproxy-bundle [10.240.x.101:8787/rhosp13/openstack-haproxy:pcmklatest]
   haproxy-bundle-docker-0      (ocf::heartbeat:docker):        Started controller3
   haproxy-bundle-docker-1      (ocf::heartbeat:docker):        Started controller1
   haproxy-bundle-docker-2      (ocf::heartbeat:docker):        Started controller2
 Docker container: openstack-cinder-volume [10.240.x.101:8787/rhosp13/openstack-cinder-volume:pcmklatest]
   openstack-cinder-volume-docker-0     (ocf::heartbeat:docker):        Started controller1

Failed Actions:
* galera_promote_0 on galera-bundle-1 'unknown error' (1): call=17101, status=complete, exitreason='MySQL server failed to start (pid=51848) (rc=0), please check your installation',
    last-rc-change='Thu May  7 21:54:51 2020', queued=0ms, exec=24120ms
* galera_promote_0 on galera-bundle-0 'unknown error' (1): call=102476, status=complete, exitreason='MySQL server failed to start (pid=308944) (rc=0), please check your installation',
    last-rc-change='Thu May  7 21:54:22 2020', queued=0ms, exec=26672ms

It shows 2 galera container failed and sometime they may not even exist on host at all. In this case, issue pcs resource cleanup will refresh and reload all broken containers across entire openstack controller cluster.

[[email protected] etc]# pcs resource cleanup
Cleaned up all resources on all nodes
Waiting for 2 replies from the CRMd.. OK
[[email protected] etc]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller1 (version 1.1.19-8.el7_6.2-c3c624ea3d) - partition with quorum
Last updated: Thu May  7 21:56:55 2020
Last change: Thu May  7 21:56:53 2020 by hacluster via crmd on controller3

12 nodes configured
36 resources configured

Online: [ controller1 controller2 controller3 ]
GuestOnline: [ [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] ]

Full list of resources:

 Docker container set: rabbitmq-bundle [10.240.x.101:8787/rhosp13/openstack-rabbitmq:pcmklatest]
   rabbitmq-bundle-0    (ocf::heartbeat:rabbitmq-cluster):      Started controller1
   rabbitmq-bundle-1    (ocf::heartbeat:rabbitmq-cluster):      Started controller3
   rabbitmq-bundle-2    (ocf::heartbeat:rabbitmq-cluster):      Started controller2
 Docker container set: galera-bundle [10.240.x.101:8787/rhosp13/openstack-mariadb:pcmklatest]
   galera-bundle-0      (ocf::heartbeat:galera):        Stopped controller1
   galera-bundle-1      (ocf::heartbeat:galera):        Stopped controller3
   galera-bundle-2      (ocf::heartbeat:galera):        Master controller2
 Docker container set: redis-bundle [10.240.x.101:8787/rhosp13/openstack-redis:pcmklatest]
   redis-bundle-0       (ocf::heartbeat:redis): Master controller1
   redis-bundle-1       (ocf::heartbeat:redis): Slave controller3
   redis-bundle-2       (ocf::heartbeat:redis): Slave controller2
 ip-10.240.172.109      (ocf::heartbeat:IPaddr2):       Started controller1
 ip-10.240.173.102      (ocf::heartbeat:IPaddr2):       Started controller1
 ip-10.240.173.142      (ocf::heartbeat:IPaddr2):       Started controller1
 ip-10.240.173.141      (ocf::heartbeat:IPaddr2):       Started controller1
 ip-192.168.151.20      (ocf::heartbeat:IPaddr2):       Started controller1
 Docker container set: haproxy-bundle [10.240.x.101:8787/rhosp13/openstack-haproxy:pcmklatest]
   haproxy-bundle-docker-0      (ocf::heartbeat:docker):        Started controller3
   haproxy-bundle-docker-1      (ocf::heartbeat:docker):        Started controller1
   haproxy-bundle-docker-2      (ocf::heartbeat:docker):        Started controller2
 Docker container: openstack-cinder-volume [10.240.x.101:8787/rhosp13/openstack-cinder-volume:pcmklatest]
   openstack-cinder-volume-docker-0     (ocf::heartbeat:docker):        Started controller1

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

To fix unsync issue with mariadb, we need to find out who’s the latest node in cluster and fix it first. To find it, simply try this command on each controller:

[[email protected] heat-admin]#   cat /var/lib/mysql/grastate.dat
# GALERA saved state
version: 2.1
uuid:    72067798-6786-11e9-80e8-a7605084ce9e
seqno:   140381117
cert_index:

[[email protected] etc]# cat /var/lib/mysql/grastate.dat
# GALERA saved state
version: 2.1
uuid:    72067798-6786-11e9-80e8-a7605084ce9e
seqno:   -1
cert_index:

Here seqno is the key we concern, -1 means no good in a failed cluster, and the host owns the highest positive value is the latest node we should work on. So here we should fix controller2 in this example.

Now let’s go inside controller2 and change its mysql config file content to wsrep_cluster_address="gcomm://", then restart container. Once it’s back online, we can start restarting all mariadb containers on each other node. After making sure all container up successuflly, we can change back controller2’s config file to its original value.

The final status should be like this:

MariaDB [(none)]>  show status like 'wsrep_incoming_addresses';
+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
| Variable_name            | Value                                                                                                                                        |
+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
| wsrep_incoming_addresses | controller3.internalapi.devstack.tdlab.ca:3306,controller1.internalapi.devstack.tdlab.ca:3306,controller2.internalapi.devstack.tdlab.ca:3306 |
+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

MariaDB [(none)]> show status like 'wsrep_last_committed';
+----------------------+-----------+
| Variable_name        | Value     |
+----------------------+-----------+
| wsrep_last_committed | 140385350 |
+----------------------+-----------+
1 row in set (0.01 sec)

MariaDB [(none)]> show status like 'wsrep_local_state_comment';
+---------------------------+--------+
| Variable_name             | Value  |
+---------------------------+--------+
| wsrep_local_state_comment | Synced |
+---------------------------+--------+
1 row in set (0.00 sec)

and all hosts share the same file content:

[[email protected] keystone]# cat /var/lib/mysql/grastate.dat
# GALERA saved state
version: 2.1
uuid:    72067798-6786-11e9-80e8-a7605084ce9e
seqno:   -1
cert_index: