Industry: Knowledge Sharing
Author: Xiaoguang Sun (Backend Search Manager at Zhihu, TiKV Project Maintainer)
Zhihu, which means “Do you know?” in classical Chinese, is the Quora of China: a question-and-answer website where all kinds of questions are created, answered, edited, and organized by the community of its users. As China's biggest knowledge sharing platform, we currently have 220 million registered users, and 30 million questions with more than 130 million answers on the site. In August 2018, we announced that we had raised $270 million in series E funding.
As our business boomed, the data size of our applications grew out of hand. About 1.3 trillion rows of data were stored in our Moneta application (which stores posts users have already read). With approximately 100 billion rows of data accruing each month and growing, this number will reach 3 trillion in two years. We faced severe challenges in scaling our backend system while maintaining good user experience.
In this post, I'll dive deep into how we managed to keep milliseconds of query response time over such a large amount of data and how TiDB, an open source MySQL-compatible NewSQL Hybrid Transactional/Analytical Processing (HTAP) database, empowered us to get real-time insights into our data. I'll introduce why we chose TiDB, how we are using it, what we learned and best practice and some thoughts about the future.
This section covers the architecture for our Moneta application, and the solution we tried to build an ideal architecture, and reveals that database scalability is our major pain point.
Zhihu's Post Feed service is a crucial system through which users are exposed to content posted on the site. The Moneta application in the backend stores the posts users have read, and filters out these posts in the post stream of Zhihu's Recommendations page.
The Moneta application has the following characteristics:
Considering the facts above, we need an application architecture with the following features:
To build an ideal architecture with the features above, we incorporated three key components in our previous architecture:
MySQL sharding and MHA is not a good solution, because both MySQL sharding and MHA have their shortcomings.
Database scalability was still the weak point of the overall system until we found TiDB and migrated data from MySQL to TiDB.
The TiDB platform is a collection of components that when used together become a NewSQL database with HTAP capabilities.
Inside the TiDB platform, the main components are as follows:
Beyond these main components, TiDB also has an ecosystem of tools, such as Ansible scripts for quick deployment, Syncer and TiDB Data Migration for seamless migration from MySQL, and TiDB Binlog, for collecting the logical changes made to a TiDB cluster and providing incremental backup and replication to the downstream (TiDB, Kafka, or MySQL).
TiDB's key features include the following:
In this section, I'll show you how TiDB functions in Moneta's architecture and its performance metrics for the Moneta application.
We deployed TiDB in our system, and the overall architecture of the Moneta application became:
In this system, all the components are self-recoverable, and the entire system has a global failure monitoring mechanism. We then use Kubernetes to orchestrate the entire system to guarantee high availability of the service as a whole.
Since we've applied TiDB in the production environment, our system is highly-available and easy to scale, and system performance has remarkably improved.
As an example, take a set of performance metrics for the Moneta application in June, 2019:
Our migration to TiDB wasn't without a hitch. Here we'd like to share some lessons learned.
We used TiDB Data Migration (DM) to collect MySQL incremental binlog files and then used TiDB Lightning to fast import the data to the TiDB cluster.
To our surprise, it only took four days to import these 1.1 trillion records to TiDB. If we logically wrote the data into the system, it might have taken a month or more. We could have imported the data even faster if we had more hardware resources.
After we finished the migration, we tested a small amount of read traffic. When the Moneta application first went online, we found that the query latency didn't meet our requirement. To solve the latency issue, we worked with the PingCAP engineers to tune system performance.
During this process, we accumulated precious experience:
Some queries are sensitive to query latency and some are not. We deployed a separate TiDB database to process latency-sensitive queries. (Other queries that are not latency-sensitive are processed in a different TiDB database.) This way, large queries and latency-sensitive queries are processed in different databases, and the execution of the former won't affect the latter.
For queries that didn't have an ideal execution plan, we wrote SQL hints to help the execution engine select an optimal execution plan.
We used low-precision timestamp Oracle (TSO) and prepared statements to reduce network round-trips.
Before we tried TiDB, we didn't analyze how many hardware resources we would need to support the same amount of data we had on the MySQL side. To reduce maintenance costs, we deployed MySQL in the single primary — single secondary topology. In contrast, the Raft protocol implemented in TiDB requires at least three replicas. Therefore, we needed more hardware resources to support our business data in TiDB, and we needed to prepare machine resources in advance.
Once our data center was set up properly, we could quickly complete our evaluation of TiDB.
At Zhihu, the anti-spam and Moneta applications are architected the same way. We tried Titan and Table Partition in the release candidate versions of TiDB 3.0 (TiDB 3.0.0-rc.1 and TiDB 3.0.0-rc.2) in our anti-spam application for in-production data. As a result, we found that these two features were incredibly awesome.
Titan reduced latency
The anti-spam application has always been tormented with high latency of both queries and writes.
We heard that TiDB 3.0 would introduce Titan, a key-value storage engine, to reduce write amplification in RocksDB (the underlying storage engine in TiKV) when using large values.
To give this feature a try, we enabled Titan after TiDB 3.0.0-rc.2 was released. The following figure shows the latency for writes and queries respectively, compared to that of RocksDB and Titan:
The statistics show that after we enabled Titan, both write and query latencies dropped sharply. It was really astonishing! When we saw the statistics, we couldn't believe our eyes.
Table Partition improved query performance
We have also used TiDB 3.0's Table Partition feature in the anti-spam application. Using this feature, we can divide a table into multiple partitions by time. When a query comes, it is performed on a partition that covers the target time range. This has drastically improved our query performance.
Let's consider what may happen if we implement TiDB 3.0 in both Moneta and anti-spam applications in the future.
TiDB 3.0 has outstanding features, such as Batch Message in gRPC, multi-thread Raftstore, SQL plan management, and TiFlash. We believe these will add lustre to the Moneta application.
Moneta's write throughput is more than 40 thousand transactions per second (TPS). This number isn't small. It's stirring news that TiDB 3.0 can send and receive Raft messages in batch, and it can process Region Raft logic in multiple threads. We believe these features will noticeably improve our system's concurrency capacity.
As mentioned above, we had written quite a few SQL hints to make the query optimizer select an optimal execution plan. TiDB 3.0 adds an SQL plan management feature that binds queries to specific execution plans directly within the TiDB server. With this feature, we don't need to modify query text to inject a hint.
At TiDB DevCon 2019, I first heard about TiFlash, an extended analytical engine for TiDB. It uses the column-oriented storage technique to achieve a high data compression rate, and it applies the extended Raft consensus algorithm in data replication to ensure data security.
Because we have massive data with high write throughput, we can't afford to use ETL every day to replicate data to Hadoop for analysis. But with TiFlash, we're optimistic that we can easily analyze our tremendous data volume.
In contrast to the Moneta application's enormous historical data size, the anti-spam application has much higher write throughput. However, it only queries the data stored in the last 48 hours. In this application, data increases by 8 billion records and 1.5 TB every day.
Because TiDB 3.0 can send and receive Raft messages in batch, and it can process Region Raft logic in multiple threads, we can manage the application with fewer nodes. Previously, we used seven physical nodes, but now we only need five. These features boost performance even when we use commodity hardware.
TiDB is a MySQL-compatible database, so we can use it just as we used MySQL. Owing to TiDB's horizontal scalability, now we can freely scale our database even when we have more than one trillion records to cope with. Besides, TiDB's high availability and excellent performance have also strengthened our system.
Up to now, we've used quite a few open-source software in our applications. We also learned a lot about handling our system problems with TiDB. We decided to participate in developing open source tools, and participate in the long term growth of the community. Based on our joint efforts with PingCAP, TiDB will become more powerful and robust.