Skip to content

Read replica reconcile misclassifies existing replicas and keeps rejoined replicas NotReady #47

@hhhizzz

Description

@hhhizzz

Description

We hit two related bugs with async read replicas managed by the Kubernetes operator:

  1. An existing read replica that is already present in InnoDB Cluster metadata can be classified as JOINABLE instead of REJOINABLE, so the operator calls Cluster.addReplicaInstance() and gets:
    MYSQLSH 51305: Target instance already part of this InnoDB Cluster

  2. After the replica is manually recovered with Cluster.rejoinInstance(), the operator still writes pod annotation mysql.oracle.com/membership-info as status=OFFLINE and flips readiness gate mysql.oracle.com/ready=False, even though Cluster.status() reports the read replica ONLINE.

This leaves the read replica pod Running but not Ready, so Services without publishNotReadyAddresses lose their endpoints.

Operator / server versions

  • Helm chart: mysql-operator 2.2.8
  • Operator image: container-registry.oracle.com/mysql/community-operator:9.7.0-2.2.8
  • MySQL server image: container-registry.oracle.com/mysql/community-server:9.6.0
  • InnoDBCluster API: mysql.oracle.com/v2

Cluster shape

  • 1 group-member primary
  • 1 read replica
  • router.instances = 0

What we observed

  • Cluster.status({extended:1}) from the primary shows the read replica under defaultReplicaSet.topology.<primary>.readReplicas with:
    • before manual recovery: status: OFFLINE, instanceErrors: ["WARNING: Read Replica's replication channel is stopped. Use Cluster.rejoinInstance() to restore it."]
    • after manual recovery: status: ONLINE
  • While the replica was OFFLINE, the operator repeatedly logged:
Setting up '...-rr-0...:3306' as a Read Replica of Cluster '...'
ERROR: The instance '...-rr-0...:3306' is already part of this Cluster. A new Read-Replica must be created on a standalone instance.
MYSQLSH 51305: Target instance already part of this InnoDB Cluster
  • After manual Cluster.rejoinInstance(), replication recovered (Replica_IO_Running=Yes, Replica_SQL_Running=Yes, Seconds_Behind_Source=0) and Cluster.status() reported the read replica ONLINE, but the operator still kept the pod readiness gate false until we manually patched pod status.

Suspected root cause

There seem to be two read-replica-specific assumptions in the controller:

  1. diagnose_cluster_candidate() checks membership with cluster.status()["defaultReplicaSet"]["topology"].keys().
    This only covers GR members. Read replicas live under topology[*]["readReplicas"], so an existing OFFLINE read replica can be treated as not-a-member and become JOINABLE instead of REJOINABLE.

  2. probe_member_status() uses shellutils.query_membership_info(), which only queries performance_schema.replication_group_members.
    Async read replicas are not GR members, so this returns no row and falls back to status="OFFLINE". That value is then written into:

    • pod annotation mysql.oracle.com/membership-info
    • readiness gate mysql.oracle.com/ready=False

In trunk, these paths still appear unchanged in:

  • mysqloperator/controller/diagnose.py
  • mysqloperator/controller/innodbcluster/cluster_controller.py
  • mysqloperator/controller/shellutils.py

Minimal reproduction

  1. Deploy an InnoDBCluster with:
    • instances: 1
    • one readReplicas entry with instances: 1
    • router.instances: 0
  2. Stop the async read replica channel on the read replica:
    STOP REPLICA FOR CHANNEL 'read_replica_replication';
  3. Trigger operator reconciliation for the read replica pod (for example, recreate the pod or let the pod create handler run).
  4. Observe the operator tries addReplicaInstance() and gets MYSQLSH 51305.
  5. Manually run Cluster.rejoinInstance('<rr-endpoint>').
  6. Observe Cluster.status() shows the replica ONLINE, but the pod annotation/readiness stays OFFLINE / NotReady.

Expected behavior

  • Existing read replicas that are already present in cluster metadata should be classified as REJOINABLE, not JOINABLE.
  • After a successful rejoin, the operator should update read replica membership/readiness using a read-replica-aware status source, and the pod should become Ready.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions