Industry: E-commerce
Author: Yihao Zhang (Senior Development Engineer at Xiaohongshu)
Transcreator: Caitin Chen; Editor: Tom Dewan
Xiaohongshu is a popular social media and e-commerce platform in China. Maybe you know us by our other name: RED or Little Red Book. The Xiaohongshu app allows users to post and share product reviews, travel blogs, and lifestyle stories via short videos and photos. By July 2019, we had over 300 million registered users.
Our business is growing quickly, and our application receives more than 100 million rows of data every day. We looked for a database that supports both transactional and analytical queries, real-time data analytics, and horizontal scalability. We found that TiDB, an open-source, distributed SQL database that supports Hybrid Transactional/Analytical Processing (HTAP) workloads, was a good solution.
In this post, I'll describe why we chose TiDB and how TiDB's real-time HTAP capabilities help us manage our data in some scenarios.
Like many companies, our business is data-driven. When we were looking for a database, we wanted one that had:
TiDB met each of our requirements, and it was the clear choice for our new database.
At Xiaohongshu, we're using TiDB in multiple scenarios, including:
To give you an idea of the kinds of problems TiDB solves for us, I have selected three typical application scenarios:
The figure below shows our infrastructure before we used TiDB. In terms of application logic, from top to bottom, we had the online application layer, the offline data warehouse layer, and the data service layer.
Let's review our three application scenarios in a non-TiDB world.
For data reports, we used the Hadoop data warehouse to do some pre-aggregation of the data, and then aggregated the high-dimensional data and put it in MySQL for query. For data reports, the data in Hadoop was pre-aggregated in T+1 mode through Hive and put into MySQL every day. Then, we built some business intelligence (BI) systems to graphically display the report queries, so our analysts could see their customized reports.
What's wrong with this approach? With the rapid growth of our business, the types of reports became more diverse. The scalability of MySQL was also a challenging issue. If we kept adding MySQL nodes, it would eventually become difficult to manage them all. If you've operated and maintained a database, you know that this can be quite cumbersome.
Let's look at the online sharded MySQL database in the primary-secondary architecture. We need to perform data queries on it without affecting the online application, so we can only check the secondary database. This secondary database is, of course, a sharded database.
A series of questions arise here. First, we still have operation and maintenance issues:
This is the complexity of operation and maintenance.
In addition, if I want to run a transaction on a sharded MySQL database, can I use a sharding middleware? If I want to do a JOIN
, or even a GROUP BY
aggregate query, is it possible with a sharding middleware? It may be possible, but it will not be simple, so we must find a solution that can easily do complex, distributed queries.
In the anti-fraud data analytics scenario, we want to reduce latency between data ingestion and processing. Before TiDB, we wrote data to the backend data warehouse in T+1 mode. The time to insights didn't meet our requirement. Fraud can happen very quickly. It's important to see the details of the issued coupon in seconds, so that we can take measures immediately.
To try and solve the challenges mentioned above, we introduced the TiDB 3.0 HTAP solution into our application architecture. As shown in the figure below, the data service layer with TiDB can provide all the data services our applications need.
Let's see how TiDB improved our data service capabilities in the three application scenarios mentioned above.
In the data report scenario, TiDB replaces MySQL, which solves the complicated problem of scaling out MySQL as our business grows. We were able to migrate seamlessly because TiDB is highly compatible with the MySQL protocol—an excellent decision by the TiDB team. Because TiDB is MySQL-compatible, the frontend BI tools team didn't have to develop a TiDB driver; they could use the MySQL driver directly. As for horizontal scalability, this is what TiDB is best at. We can add a node directly, and the data will be automatically rebalanced, which is very convenient.
As our application data quickly grew, a standalone MySQL database couldn't store very much data, so we sharded the data into 10,000 tables. But a single TiDB cluster can store the large amounts of data we have.
We replicated MySQL data to TiDB via the binlog in real time with replication latency less than one second and merged the sharded tables to a large table in TiDB.
Finally, we see only one large table in TiDB. No more shards. We use MySQL as the primary database and TiDB as the secondary database. We can query TiDB without affecting MySQL. In addition, TiDB also supports ACID transactions, JOIN
operations, and aggregate functions.
After we applied TiDB in our applications, we changed the T+1 mode for writes to writing with SQL statements in Apache Flink in real time. The tracked events are generated quickly, and their peak queries per second (QPS) can reach 30,000 or 40,000. A single table may receive about 500 million rows of data per day. If we save data for 10 days, the data in a single table will probably reach 5 billion rows. After writing the data, we mainly perform ad hoc queries. If an analyst wants to see how the coupons are used and to understand the distribution within minutes, we directly bypass the Hadoop data warehouse and do real-time queries in TiDB.
TiKV is TiDB's row-based storage engine. Compared with TiDB 3.0, TiDB 4.0 has improved its analytical query performance by introducing a columnar storage engine, TiFlash.
TiFlash is an extended analytical engine and columnar store for TiDB. A TiDB database that incorporates TiFlash lets users perform real-time HTAP analytics.
TiFlash has great design points:
We also tested TiFlash, taking the logistics scenario as an example. We evaluated 393 production queries and found:
GROUP BY
and SUM
, compared with TiKV, TiFlash's query performance improved from about three to ten times, and the average execution time was reduced by about 68%.TiDB 4.0 also introduces pessimistic locking. In logistics scenarios, many tables require a JOIN
, but that's expensive. To avoid performing a JOIN
, we'll combine these tables into a large, wide table in advance. For example, if I combine three tables into one large, wide table, three streams will update this table at the same time, and they might update the same row. TiDB 3.0 uses an optimistic concurrency control mechanism, which will cause transaction conflicts. TiDB 4.0 has pessimistic locking, which effectively solves this problem.
We have an excellent relationship with PingCAP‘s TiFlash team, and we often suggest new requirements. For example, an earlier version of TiFlash did not support DISTINCT COUNT
. The development team understood our need for optimization, and soon after our request, TiFlash supported DISTINCT COUNT
.
You may be more familiar with the ClickHouse column storage engine, so let's compare it with TiFlash. Although ClickHouse has faster computing performance than TiFlash, we chose TiFlash because ClickHouse has some issues. For example:
Many of our applications have transactions and need frequent updates. However, ClickHouse doesn't support updates. If we change the operations to APPEND
or INSERT
, the application team must make a lot of changes, like deduplicating the data. Because many scenarios require frequent updates, we chose TiFlash.
As our business quickly develops, more than 100 million rows of data flood into our application every day. We adopted TiDB, because it is a one-stop data warehouse for both OLTP and OLAP workloads, it supports real-time data analytics, and it's easy to scale out. Since 2017, we've used TiDB in our production environment.
If you'd like to learn more about our experience with TiDB, you can join the TiDB community on Slack. Now HTAP is supported in TiDB Cloud, the fully managed database service provided by PingCAP. You can apply for a two-week free trial here.