Industry: E-Commerce
Author: Ruixing Luo (Senior Big Data Engineer)
Yiguo.com is a fresh produce e-commerce platform, serving close to 5 million users and more than 1,000 enterprise customers. We have long devoted ourselves to providing fresh food for ordinary consumers and gained increasing popularity since our founding in 2005. With a rapidly growing user base and data, we need a highly performant, horizontally scalable realtime database system that can support both Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP) workloads, in order to make timely and accurate decisions to improve the quality of service to our users.
Combining TiDB and TiSpark, we got exactly that–a horizontally scalable, strongly consistent, and highly available data processing platform that supports both OLTP and OLAP workloads on quickly growing datasets in near real-time. By adopting TiDB and TiSpark, our applications now have a new competitive edge to deliver better services faster than ever before.
Previously, our data analysis system was built on top of the Hadoop ecosystem and SQL Server. We ran a near real-time system on SQL Server in the very beginning, implemented by our developers through writing and maintaining corresponding stored procedures. Because the data size was not that large, SQL Server was enough to meet our demand. As our business and user base grows, the amount of data has also grown to a point beyond what SQL Server could handle, to provide timely and accurate decision making. Our existing system became a bottleneck, and we had to look for a new solution.
Our requirements for a new system are:
Based on our requirements, we evaluated three options – Greenplum, Apache Kudu, and TiDB/TiSpark. We decided to go with TiDB/TiSpark because of the following reasons:
TiDB is an open source distributed HTAP database, inspired by Google Spanner/F1. It has the following core features:
TiSpark is a thin layer built for running Apache Spark on top of TiDB/TiKV to accelerate complex OLAP queries. It takes the performance boost of Spark and marries it with the rest of the TiDB cluster components to deliver a one-stop solution for both online transactions and analysis.
TiDB also has a wealth of other tools in its ecosystem, such as the Ansible scripts for quick deployment, Syncer for seamless migration from MySQL, Wormhole for migrating heterogeneous data, and TiDB Binlog, a tool to collect binlog files.
In mid-October 2017, after careful design and rigorous testing, we migrated our real-time analysis system to TiDB/TiSpark in production.
The architecture of our new system is as follows:
This platform works in the following way:
The data for transactions in the existing SQL Servers is ingested to TiDB through Flume, Kafka, and Spark Streaming in real time, while the data from MySQL is written to TiDB via binlog and TiDB Syncer, a data synchronization tool which can read data from MySQL in real-time.
Inside the TiDB project, we have several components:
In the architecture, TiDB/TiSpark serves as a hybrid database that provides the following benefits:
A real-time data warehouse. The upstream OLTP data write in real time using TiDB, and the downstream OLAP applications perform real-time analysis using TiDB/TiSpark. This enables our data scientists to make quicker decisions based on more current and transactionally consistent data.
A single data warehouse that accelerates T+1
(this means dumping data per day to the analytical system, where T
refers to the Transaction date, and +1
refers to 1
day after the Transaction date) asynchronous and ETL (Extract, Transform and Load) processing.
T+1
analysis can be extracted directly from TiDB using TiSpark, which is much faster than using Datax and Sqoop to read data from a traditional relational database.The compatibility of TiDB with MySQL has significantly reduced the migration cost. The transactional data from MySQL can flow to TiDB via Syncer (a tool in the TiDB ecosystem). In addition, we can reuse many of the exiting tools from the MySQL community without making too many changes to the existing applications.
The auto-failover and high availability features of TiDB help to guarantee the stability and availability of the entire system.
The design of TiDB enables even and dynamically balanced data distribution. We can use TiDB as a backup database for hot data, or directly migrate hot data to TiDB in the future.
We used TiSpark for complex data analysis on the entire dataset of Singles’ Day, China's largest online shopping day. We saw up to 4x performance improvement using TiSpark compared to SQL Server. For example, we have one complex SQL query that joins 12 tables together with several tables having over 1 million rows. This query took only 8 minutes to run on TiSpark whereas it would have taken more than 30 minutes on SQL Server.
The success of our TiDB/TiSpark based real-time data warehouse platform gives us confidence that more of our services will migrate to TiDB in the future. This unified and real-time hybrid database enables us to harness the power of data and focus on creating values for our users.
Our migration to TiDB/TiSpark was not without challenges. Here are some lessons learned:
It is highly recommended to use TiDB Ansible for deployment, upgrade and operations. In the test environment, we manually deployed TiDB using the binary package. It was easy, but the upgrade became troublesome after TiDB released its 1.0 version in October 2017. After consulting the TiDB support team, we re-deployed the 1.0 version using the TiDB Ansible scripts and the upgrade became much easier. Meanwhile, TiDB's Ansible scripts install Prometheus and Grafana monitoring components by default. TiDB provides a lot of excellent Grafana templates, which made monitoring and configuration much simpler.
To achieve the best performance, using machines with SSDs are highly recommended. Initially, we were using mechanical hard disks in our cluster. Later on, we realized the latency caused by the unstable write workload significantly impacted the performance. We contacted the TiDB support team and confirmed that the TiDB architecture is mainly designed for the storage performance of SSD, so we changed the hard disks to SSD/NVMe, and the write performance significantly improved.
For complex queries with more than millions of rows, the performance of TiSpark is much better than TiDB. In the very beginning, we were merely using TiDB for complex queries, but the performance of some complex scripts did not show any advantage over SQL Server. After tuning some parameters for parallel computing such as tidb_distsql_scan_concurrency
, tidb_index_serial_scan_concurrency
, and tidb_index_join_batch_size
, the analytical tasks that used to take 15 minutes took only half that time. But this still could not fully meet our needs. So we turned to TiSpark which turns out to be a perfect match for our scenario.
The online scheduling of the cluster cannot work together with offline scheduling. Each time the ETL script initiates some Spark parameters, it's quite time-consuming. We plan to use Spark Streaming as a scheduling tool in the future. Spark Streaming records the timestamp after each execution and only needs to monitor the timestamp changes, avoiding the time consumed by multiple initializations. With Spark monitoring, we can also clearly see the delay and some other states of the task.
TiDB and TiSpark together provide a scalable solution for both OLTP and OLAP workloads. TiDB and TiSpark have been adopted by many companies and proved to work well in many application scenarios. Moreover, TiDB and TiSpark reduces operation overhead and maintenance cost for our engineering team. The TiDB project has tremendous potential, and we hope it gains more adoption globally and serve as a critical piece of every company's infrastructure.
This post was originally published on Datanami.