Author: Yiwen Chen (Committer of TiDB Operator, Software Engineer at PingCAP)
Transcreator: Ran Huang; Editor: Tom Dewan
Previous articles in this series:
In my last article, I introduced how we design and implement tidb-controller-manager
and the controllers’ internal logic and how each controller receives and handles changes. This time, I'll describe how we implement the component controllers.
The TidbCluster
controller manages the lifecycles of major components in TiDB. I'll take TidbCluster
as an example to introduce the design of the component control loop. You'll learn how the control loop events are orchestrated during TiDB cluster lifecycle management and how these events manage resources.
This article is only a general introduction to the processes and definitions. The specific application of each component will be covered in the next article. Now, let's get started.
In the section on the controller's internal logic, I mentioned the updateTidbCluster
function of the TidbCluster
controller, which is located in pkg/controller/tidbcluster/tidb_cluster_control.go
. As the entrance of TiDB component lifecycle management, this function calls a sequence of lifecycle management functions:
c.reclaimPolicyManager.Sync(tc)
c.orphanPodsCleaner.Clean(tc)
c.discoveryManager.Reconcile(tc)
c.ticdcMemberManager.Sync(tc)
c.tiflashMemberManager.Sync(tc)
c.pdMemberManager.Sync(tc)
c.tikvMemberManager.Sync(tc)
c.pumpMemberManager.Sync(tc)
c.tidbMemberManager.Sync(tc)
c.metaManager.Sync(tc)
c.pvcCleaner.Clean(tc)
c.pvcResizer.Resize(tc)
c.tidbClusterStatusManager.Sync(tc)
These functions are divided into two categories:
TidbCluster
object.The control loop code for TiDB's major components is located in the directory pkg/manager/member
and in files that end with _member_manager.go
, such as pd_member_manager.go
. These files reference other files that implement scaling and upgrade features, such as _scaler.go
and _upgrader.go
.
From the _member_manager.go
files of components, we can identify the generic implementation:
// Sync Service
if err := m.syncServiceForTidbCluster(tc); err != nil {
return err
}
// Sync Headless Service
if err := m.syncHeadlessServiceForTidbCluster(tc); err != nil {
return err
}
// Sync StatefulSet
return syncStatefulSetForTidbCluster(tc)
func syncStatefulSetForTidbCluster(tc *v1alpha1.TidbCluster) error {
if err := m.syncTidbClusterStatus(tc, oldSet); err != nil {
klog.Errorf("failed to sync TidbCluster: [%s/%s]'s status, error: %v", ns, tcName, err)
}
if tc.Spec.Paused {
klog.V(4).Infof("tidb cluster %s/%s is paused, skip syncing for statefulset", tc.GetNamespace(), tc.GetName())
return nil
}
cm, err := m.syncConfigMap(tc, oldSet)
newSet, err := getnewSetForTidbCluster(tc, cm)
if err := m.scaler.Scale(tc, oldSet, newSet); err != nil {
return err
}
if err := m.failover.Failover(tc); err != nil {
return err
}
if err := m.upgrader.Upgrade(tc, oldSet, newSet); err != nil {
return err
}
return UpdateStatefulSet(m.deps.StatefulSetControl, tc, newSet, oldSet)
}
The above code performs two major tasks:
TidbCluster
stops the synchronization.TidbCluster
and performs the related operations on the new StatefulSet, such as rolling update, scaling out, scaling in, and failover.The component control loop repeatedly performs the tasks above to make sure the component stays up to date.
The following sections introduce the specific jobs completed in the control loop.
When component reconciliation starts, service reconciliation also kicks off. This process creates and syncs the Services used by the components, such as cluster1-pd
and cluster1-pd-peer
.
The control loop function calls the getNewServiceForTidbCluster
function, which creates a new Service template according to the information recorded in the TidbCluster
custom resource (CR). If the Service doesn't exist, the control loop function creates a Service; if the Service exists, it compares the old Service Spec with the new one and determines whether to update the Service object.
There are both Services and Headless Services to allow the components to be accessed by others.
After the Service is synced, the components are connected to the cluster network, so they can access the cluster and be accessed from within the cluster. The control loop enters the syncStatefulSetForTidbCluster
function and starts reconciling the StatefulSet.
The first step in reconciling the StatefulSet is to sync the component status by running the syncTidbClusterStatus
function. Then, according to the status information, other operations will be carried out such as upgrade, scaling in, scaling out, and failover.
One of TiDB Operator's key operations is to sync the component status. The status information contains:
After syncing the status, TiDB Operator determines whether the cluster stops the synchronization by checking tc.Spec.Paused
. If the sync is paused, TiDB Operator skips all the following operations that updates the StatefulSet.
After syncing the status, the syncConfigMap
function updates the ConfigMap, which contains the component's configuration file and startup script.
The configuration file is extracted from the Config
item in Spec
of the YAML file. TiDB Operator supports directly using TOML configuration (recommended) or transforming the configuration from YAML.
The startup script contains the startup parameters required for the component and launches the component process using the parameters.
When a component needs to obtain the startup parameters from TiDB Operator, the information processing is performed in the Discovery component. For example, when PD needs to use parameters to determine whether it should initialize a node or join a node, it uses wget to access Discovery and get the parameters. By obtaining parameters from the startup script, TiDB Operator avoids unexpected rolling updates while StatefulSet is being updated. This may impact online services.
The getNewPDSetForTidbCluster
function obtains a new StatefulSet template, which contains the references of the newly generated Service and ConfigMap. The function uses the latest status and spec to generate other items, such as env
, container
, and volume
.
This new StatefulSet then goes through three processes: rolling update, scaling, and failover. Finally, the UpdateStatefulSet
function compares the existing StatefulSet and the new one and determines whether to update the existing one.
The m.upgraded.Upgrade
function performs operations related to rolling update, mainly updating UpgradeStrategy.Type
and UpgradeStrategy.Partition
in the StatefulSet.
The rolling update operation is performed using the rolling update strategy in the StatefulSet. When the component is being reconciled, it sets the update strategy of the StatefulSet to rolling update. In Kubernetes, you can control the rolling update progress by configuring UpgradeStrategy.Partition
. The StatefulSet only updates the Pods that are not updated before and whose ID is greater than or equal to the value of UpgradeStrategy.Partition
. TiDB Operator uses this mechanism to ensure that each Pod is only rolling updated after it can provide service to the external application normally.
When the cluster is not being updated or is just in the starting phase of update, the component reconciliation sets UpgradeStrategy.Partition
to the largest Pod ID in the StatefulSet. This prevents any Pod from being updated. After the update starts, when a Pod is updated and provides service after restart, the Pod is deemed successfully upgraded. TiDB Operator then decrements the UpgradeStrategy.Partition
value and updates the next Pod.
m.scaler.Scale
scales components in and out. Its primary task is to update Replicas
of components in the StatefulSet.
The m.scaler.Scale
function scales components one by one, one step at a time. It compares the number of Replicas
for a component in TidbCluster
CR (for example tc.Spec.PD.Replicas
) and with the current Replicas
. Based on that, it determines whether to scale out or scale in the component and performs the scaling operation on one replica. It then enters the next round of component reconciliation. Through multiple rounds of reconciliation, m.scaler.Scale
completes all scaling requirements.
During the scaling process, PD needs to transfer the Leader, and TiKV needs to delete stores. These operations use PD APIs. During reconciliation, the m.scaler.Scale
function uses PD APIs to perform the operation and check whether the operation is successful. If so, it moves to the next scaling operation.
The m.failover.Failover
function performs operations related to disaster recovery, including discovering failure, recording failure status, and recovering from failure.
If you enable AutoFailover
when you deploy TiDB Operator, TiDB Operator monitors the component failure status. When it finds one, TiDB Operator records the failure information to FailureStores
or FailureMembers
. Next, it starts a new component Pod to take over the workload of the failed Pod. After the failed Pod recovers, TiDB Operator modifies the number of replicas in the StatefulSet to scale in the new Pod.
When TiDB Operator performs failover for TiKV and TiFlash, the newly-created Pod is not scaled in by default. You need to configure spec.tikv.recoverFailover: true
to enable the auto scaling-in.
In the last phase, the new StatefulSet is created. The control loop now enters the UpdateStatefulSet
function, which compares the new StatefulSet with the existing one. If the two StatefulSets are inconsistent, the function updates the existing StatefulSet.
The function also checks whether there are StatefulSets that are not managed by TiDB Operator. Because earlier versions of TiDB Operator used Helm Chart to deploy TiDB, TiDB Operator needs to add dependency marks to these old StatefulSets and include them in the lifecycle management.
After the operations above, the status of the TidbCluster
CR is updated to the latest version. The related Service and ConfigMap is created. The new StatefulSet is generated, which performs rolling update, scaling, and failover. The component reconciliation rolls on, monitoring the component lifecycle and responding to lifecycle status change and user-specified change. The whole cluster runs normally.
Apart from the reconciliation of major components in TiDB, other lifecycle management operations are performed by the following functions:
tc.Spec.PVReclaimPolicy
. By default, the PV reclaim policy is set to Retain
to reduce the risk of data loss.TidbCluster
.StoreIDLabel
, MemberIDLabel
, and NamespaceLabel
to the labels of Pods, PVCs, and PVs.TidbMonitor
and TiDB Dashboard.By now, you have learned the design of TiDB components’ control loop, including:
Almost all control loops in TiDB Operator conform to the design logic described in this article. In future posts, I'll further explain how this logic is applied to each component to manage the component lifecycle.
If you have any questions or ideas about TiDB Operator, feel free to join our Slack channel or join our discussions at pingcap/tidb-operator!