Industry: Information Technology
Author: Min Zhou (Staff engineer at Databricks)
Editors: Yajing Wang, Tom Dewan
Databricks is an enterprise software company founded by the creators of Apache Spark. The company has also created many other popular open-source projects like Delta Lake and MLflow, helping data teams solve some of the world's toughest problems.
In this article, I'll discuss some of the design decisions we made as well as some of the lessons we learned from adopting TiDB, an elastic, real-time Hybrid Transactional/Analytical Processing (HTAP) database, at Databricks.
At Databricks, we use Amazon Relational Database Service (RDS) and MySQL as our standard databases. Unfortunately, MySQL cannot scale beyond a single node. The lack of a scalable database has been a source of challenges at Databricks.
Currently, we deploy our MySQL databases across three clouds: Amazon Web Services (AWS), Azure, and Google Cloud Platform, in close to 100 regions. Because of MySQL's scalability limit, we have to deploy many MySQL instances in each region—one instance per service. The data on top of those regions totals tens of terabytes. To address this issue, we needed to find an alternative that is:
Our research led us to TiDB.
We decided to run TiDB ourselves. However, using self-operating TiDB to replace cloud providers’ managed MySQL services brought the following challenges:
We deployed TiDB in a dedicated Kubernetes cluster for each region. The application services are behind an L7 proxy on another Kubernetes cluster. The application service uses the DNS name of each TiDB cluster's load balancer to query TiDB. This is exactly the same way application services connect to RDS MySQL. Therefore, migrating from MySQL to TiDB did not need significant configuration changes.
TiDB Operator helps manage TIDB on Kubernetes and automates operating tasks. TiDB Operator is in the tidb-admin
namespace. One Kubernetes cluster can run multiple TiDB clusters. The clusters are separated by Kubernetes namespaces. Each of the component parts is exclusively associated with one node for consistent performance.
TiDB is designed to survive software and hardware failures from server restarts to datacenter outages. The three cloud providers we use offer the option of deploying TiDB clusters across multiple availability zones (AZs).
The diagram above shows a minimal configuration of TiDB zonal redundancy. In each of the three regions, there is one Placement Driver (PD), one TiDB server, one TiKV server, and one change data capture framework (TiCDC).
We do not rely on cloud providers’ zonal redundancy features like cross-zone remote disk attachment or remote snapshotting to recover TiDB cluster data. This is because:
Instead, we rely on the zonal redundancy provided by TiDB. All replicas are distributed among three AZs with high availability and disaster recovery. If one AZ goes down, the other two AZs automatically restart the leader election and resume the service within a configurable period.
TiDB provides two approaches for regional redundancy: Raft-based and binlog-based.
The core mechanism of TiDB's disaster recovery is Raft: a consensus algorithm based on logs and state machines. The Raft-based approach takes advantage of TiDB's native Raft support by placing two replicas in two zones, and the third replica in a different data region, or three replicas in three regions.
The binlog-based approach is akin to the traditional MySQL replication method, which used asynchronous replication. We deploy one TiDB cluster in each region and pair the clusters through an internal replication service.
We chose the binlog-based approach because:
We also provide a complete backup plan. We will back up TiDB clusters to external storage such as AWS S3 or Azure Blob storage once a day, and then retain the backup for seven days. We use TiDB's BR for backup and restoration. BR does not yet allow backup to Azure; however, we're working with PingCAP to implement this feature.
In a TiDB cluster, failures can occur at several levels, including the disk, Pod, Kubernetes node, and network partition. TiDB has demonstrated its excellent stability and viability through formal testing. Examples include:
Note:
If a network partition fails, TiDB is unavailable. This failure needs human intervention.
TiDB offers many types of manual operations, including the following:
Cloud providers usually offer both local storage physically attached to the machine as well as remote disks. According to our benchmarks, the local disk and the remote disk have the same read performance. Although local disks offer the best write latency, they add significant overhead to operations. For example, If we replace a host, we must migrate all the data on it. In some extreme scenarios, if we update all the machines in the cluster, we must copy all the data on that cluster to another fleet. Furthermore, because local disks are ephemeral, a local disk configuration cannot tolerate concurrent maintenance of machines.
To reduce the overhead of remote disks, we propose a cost-effective solution. This also improves the throughput of the remote disk and lowers the latency to an acceptable level.
Data sharing comes with penalties. The queries of one service may affect the queries of another. In other words, a bad query may occupy the resource of a certain node, thereby slowing down other services on it. To avoid this situation, we propose:
We began evaluating TiDB earlier this year, starting with a single TiDB cluster without isolation, and we're now deploying TiDB in two of our regions. We expect to implement Placement Rules in SQL through joint efforts with PingCAP in Q4 2021.
If you'd like to learn more about our experience with TiDB, or if you have any questions, you can join the TiDB community on Slack. If you're interested, you can get started with TiDB here.
This article is based on a talk given by Min Zhou at PingCAP DevCon 2021.