Description
We hit two related bugs with async read replicas managed by the Kubernetes operator:
-
An existing read replica that is already present in InnoDB Cluster metadata can be classified as JOINABLE instead of REJOINABLE, so the operator calls Cluster.addReplicaInstance() and gets:
MYSQLSH 51305: Target instance already part of this InnoDB Cluster
-
After the replica is manually recovered with Cluster.rejoinInstance(), the operator still writes pod annotation mysql.oracle.com/membership-info as status=OFFLINE and flips readiness gate mysql.oracle.com/ready=False, even though Cluster.status() reports the read replica ONLINE.
This leaves the read replica pod Running but not Ready, so Services without publishNotReadyAddresses lose their endpoints.
Operator / server versions
- Helm chart:
mysql-operator 2.2.8
- Operator image:
container-registry.oracle.com/mysql/community-operator:9.7.0-2.2.8
- MySQL server image:
container-registry.oracle.com/mysql/community-server:9.6.0
- InnoDBCluster API:
mysql.oracle.com/v2
Cluster shape
- 1 group-member primary
- 1 read replica
router.instances = 0
What we observed
Cluster.status({extended:1}) from the primary shows the read replica under defaultReplicaSet.topology.<primary>.readReplicas with:
- before manual recovery:
status: OFFLINE, instanceErrors: ["WARNING: Read Replica's replication channel is stopped. Use Cluster.rejoinInstance() to restore it."]
- after manual recovery:
status: ONLINE
- While the replica was OFFLINE, the operator repeatedly logged:
Setting up '...-rr-0...:3306' as a Read Replica of Cluster '...'
ERROR: The instance '...-rr-0...:3306' is already part of this Cluster. A new Read-Replica must be created on a standalone instance.
MYSQLSH 51305: Target instance already part of this InnoDB Cluster
- After manual
Cluster.rejoinInstance(), replication recovered (Replica_IO_Running=Yes, Replica_SQL_Running=Yes, Seconds_Behind_Source=0) and Cluster.status() reported the read replica ONLINE, but the operator still kept the pod readiness gate false until we manually patched pod status.
Suspected root cause
There seem to be two read-replica-specific assumptions in the controller:
-
diagnose_cluster_candidate() checks membership with cluster.status()["defaultReplicaSet"]["topology"].keys().
This only covers GR members. Read replicas live under topology[*]["readReplicas"], so an existing OFFLINE read replica can be treated as not-a-member and become JOINABLE instead of REJOINABLE.
-
probe_member_status() uses shellutils.query_membership_info(), which only queries performance_schema.replication_group_members.
Async read replicas are not GR members, so this returns no row and falls back to status="OFFLINE". That value is then written into:
- pod annotation
mysql.oracle.com/membership-info
- readiness gate
mysql.oracle.com/ready=False
In trunk, these paths still appear unchanged in:
mysqloperator/controller/diagnose.py
mysqloperator/controller/innodbcluster/cluster_controller.py
mysqloperator/controller/shellutils.py
Minimal reproduction
- Deploy an
InnoDBCluster with:
instances: 1
- one
readReplicas entry with instances: 1
router.instances: 0
- Stop the async read replica channel on the read replica:
STOP REPLICA FOR CHANNEL 'read_replica_replication';
- Trigger operator reconciliation for the read replica pod (for example, recreate the pod or let the pod create handler run).
- Observe the operator tries
addReplicaInstance() and gets MYSQLSH 51305.
- Manually run
Cluster.rejoinInstance('<rr-endpoint>').
- Observe
Cluster.status() shows the replica ONLINE, but the pod annotation/readiness stays OFFLINE / NotReady.
Expected behavior
- Existing read replicas that are already present in cluster metadata should be classified as
REJOINABLE, not JOINABLE.
- After a successful rejoin, the operator should update read replica membership/readiness using a read-replica-aware status source, and the pod should become
Ready.
Description
We hit two related bugs with async read replicas managed by the Kubernetes operator:
An existing read replica that is already present in InnoDB Cluster metadata can be classified as
JOINABLEinstead ofREJOINABLE, so the operator callsCluster.addReplicaInstance()and gets:MYSQLSH 51305: Target instance already part of this InnoDB ClusterAfter the replica is manually recovered with
Cluster.rejoinInstance(), the operator still writes pod annotationmysql.oracle.com/membership-infoasstatus=OFFLINEand flips readiness gatemysql.oracle.com/ready=False, even thoughCluster.status()reports the read replicaONLINE.This leaves the read replica pod
Runningbut notReady, so Services withoutpublishNotReadyAddresseslose their endpoints.Operator / server versions
mysql-operator2.2.8container-registry.oracle.com/mysql/community-operator:9.7.0-2.2.8container-registry.oracle.com/mysql/community-server:9.6.0mysql.oracle.com/v2Cluster shape
router.instances = 0What we observed
Cluster.status({extended:1})from the primary shows the read replica underdefaultReplicaSet.topology.<primary>.readReplicaswith:status: OFFLINE,instanceErrors: ["WARNING: Read Replica's replication channel is stopped. Use Cluster.rejoinInstance() to restore it."]status: ONLINECluster.rejoinInstance(), replication recovered (Replica_IO_Running=Yes,Replica_SQL_Running=Yes,Seconds_Behind_Source=0) andCluster.status()reported the read replicaONLINE, but the operator still kept the pod readiness gate false until we manually patched pod status.Suspected root cause
There seem to be two read-replica-specific assumptions in the controller:
diagnose_cluster_candidate()checks membership withcluster.status()["defaultReplicaSet"]["topology"].keys().This only covers GR members. Read replicas live under
topology[*]["readReplicas"], so an existing OFFLINE read replica can be treated as not-a-member and becomeJOINABLEinstead ofREJOINABLE.probe_member_status()usesshellutils.query_membership_info(), which only queriesperformance_schema.replication_group_members.Async read replicas are not GR members, so this returns no row and falls back to
status="OFFLINE". That value is then written into:mysql.oracle.com/membership-infomysql.oracle.com/ready=FalseIn
trunk, these paths still appear unchanged in:mysqloperator/controller/diagnose.pymysqloperator/controller/innodbcluster/cluster_controller.pymysqloperator/controller/shellutils.pyMinimal reproduction
InnoDBClusterwith:instances: 1readReplicasentry withinstances: 1router.instances: 0STOP REPLICA FOR CHANNEL 'read_replica_replication';addReplicaInstance()and getsMYSQLSH 51305.Cluster.rejoinInstance('<rr-endpoint>').Cluster.status()shows the replicaONLINE, but the pod annotation/readiness staysOFFLINE/ NotReady.Expected behavior
REJOINABLE, notJOINABLE.Ready.