From Li SHEN: shenli@pingcap.com
My last blog introduces the way that TiDB stores data, which is also the basic concepts of TiKV. In this article, I'll elaborate on how TiDB uses the bottom layer Key-Value to store data, maps the relational model to the Key-Value model and performs SQL computing.
Let's simplify the Relational Model to be just about Table and the SQL statements. What we need to think about is how to store Table and run the SQL statements on top of the Key-Value structure.
Assuming we have a Table as follows:
CREATE TABLE User {
ID int,
Name varchar(20),
Role varchar(20),
Age int,
PRIMARY KEY (ID),
Key idxAge (age)
};
Given the huge differences between the SQL and Key-Value structures, how to map SQL to Key-Value conveniently and efficiently becomes vital. The article will, first of all, go through the requirements and characteristics of data computing, which is essential to determine whether a mapping solution is good or not.
A Table contains three parts of data:
Note that metadata will not be discussed in this article.
Data can be stored either in rows or in columns, both have their advantages and disadvantages. The primary goal for TiDB is online transactional processing (OLTP), which supports read, save, update, and delete a row of data quickly. Therefore, row store seems better.
TiDB supports both the Primary Index and Secondary Index. The function of Index is for faster query, higher query performance, and the Constraints. There are two forms of query:
select name from user where id=1;
, locating a certain row of data through index.select name from user where age > 30 and age < 35;
, querying the data that the age is between 30 and 35 through idxAge
.There are two types of Indexes: Unique Index and Non-unique Index, both of which are supported by TiDB.
After analyzing the characteristics of the data to be stored, let's move on to what we need to do to manipulate the data, including the Insert/Update/Delete/Select statements.
Insert
statement writes Row into Key-Value and creates the index data.Update
statement updates Row and index data if necessary.Delete
statement deletes both Row and index.Select
statement deals with the most complicated situation:
Select * from user;
.A globally ordered and distributed Key-Value engine satisfies the above needs. The feature of being globally ordered helps us solve quite a few problems. Take the following as two examples:
Now let's see how this works in TiDB.
TiDB allocates a TableID
to each table, an IndexID
to each index, and a RowID
to each row. If the table has an integer Primary Key, then the value will be used as the RowID
. The TableID
is unique in the whole cluster while the IndexID/RowID
unique in the table. All of these ID are int64.
Each row of data is encoded into a Key-Value pair according to the following rule:
Key: tablePrefix{tableID}_recordPrefixSep{rowID}
Value: [col1, col2, col3, col4]
The tablePrefix
/recordPrefixSep
of the Key are specific string constants and used to differentiate other data in the Key-Value space.
Index data is encoded into a Key-Value pair according to the following rule:
Key: tablePrefix{tableID}_indexPrefixSep{indexID}_indexedColumnsValue
Value: rowID
The above encoding rule applies to Unique Index while it cannot create a unique Key for Non-unique Index. The reason is that the tablePrefix{tableID}_indexPrefixSep{indexID}
of an Index is the same. It's possible that the ColumnsValue of
multiple rows is also the same. Therefore, we've made some changes to encode the Non-unique Index:
Key: tablePrefix{tableID}_indexPrefixSep{indexID}_indexedColumnsValue_rowID
Value: null
In this way, the unique Key of each row of data can be created.
In the above rules, all xxPrefix
of the Keys are string constants with the function of differentiating the namespace to avoid the conflict between different types of data.
var(
tablePrefix = []byte{'t'}
recordPrefixSep = []byte("_r")
indexPrefixSep = []byte("_i")
)
Note that the Key encoding solution of either Row or Index has the same prefix. Specifically speaking, all Rows in a Table has the same prefix, so does data of Index. These data with the same prefix is arranged together in the Key space of TiKV. In other words, we just need to carefully design the encoding solution of the suffix, ensuing the comparison relation remains unchanged, then Row or Index data can be stored in TiKV orderly. The solution of maintaining the relation unchanged before and after encoding is called Memcomparable
. As for any type of value, the comparison result of two objects before encoding is consistent with that of the byte array after encoding (Note: both Key and Value of TiKV are the primitive byte array). For more detailed information, please refer to the codec package of TiDB. When adopting this encoding solution, all Row data of a table will be arranged in the Key space of TiKV according to the RowID order. So will the data of a certain Index, according to the ColumnValue order of Index.
Now we take the previous requirements and TiDB's mapping solution into consideration and verify the feasibility of the solution.
Up to now, we've already covered how to map Table onto Key-Value. Here is one more case with the same table structure. Assume that the table has three rows of data:
1, "TiDB", "SQL Layer", 10
2, "TiKV", "KV Engine", 20
3, "PD", "Manager", 30
First, each row of data will be mapped as a Key-Value pair. As this table has an Int Primary Key, the value of RowID is the value of this Primary Key. Assume that the TableID of this table is 10, its Row data is:
t10_r1 --> ["TiDB", "SQL Layer", 10]
t10_r2 --> ["TiKV", "KV Engine", 20]
t10_r3 --> ["PD", "Manager", 30]
In addition to Primary Key, this table also has an Index. Assume that the ID of Index is 1, its data is:
t10_i1_10_1 --> null
t10_i1_20_2 --> null
t10_i1_30_3 --> null
The previous encoding rules help you to understand the above example. We hope that you can realize the reason why we chose this mapping solution and the purpose of doing so.
After explaining how data and index of a table is mapped to Key-Value, this section introduces the storage of metadata.
Both Database and Table have metadata, which refers to its definition and various attributes. All of this information needs to be persisted and stored in TiKV. Each Database/Table gets a unique ID and this ID serves as the unique identification. When encoding into Key-Value, this ID will be encoded in Key, with a prefix of m_. In this way, a Key is created and the corresponding Value stores the serialized metadata.
Apart from this, a specialized Key-Value stores the version of the current Schema information. TiDB uses the Online Schema change algorithm of Google F1. A background thread constantly checks whether the Schema version stored on TiKV has changed. If so, it manages to get the updates within a certain period of time. For more detailed information, please refer to The Implementation of Asynchronous Schema Change of TiDB (In Chinese).
The following diagram shows the entire architecture of TiDB:
The main function of the TiKV Cluster is to serve as the Key-Value engine to store data, which is thoroughly introduced in the last column. This article focuses on the SQL layer, i.e. the TiDB Servers. Nodes of this layer are stateless, storing no data, and each of them is completely equivalent. TiDB Server is responsible for handling user requests and executing SQL operation logic.
The mapping solution from SQL to Key-Value shows how to store the relational data. Then we need to understand how to use these data to meet the query requests, which, in other words, the process of how a query statement accesses the data stored in the bottom layer.
The easiest way is to map the SQL query to Key-Value query through the mapping solution that I introduced in the last section, and then get the corresponding data through the Key-Value interface before executing any kind of operations.
As for the statement Select count(*) from user where name="TiDB";
, we need to read all the data in the table and then check if the Name field is TiDB. If so, return this row. The operation is transferred to the following Key-Value operation process:
See the following diagram for the process:
This solution works though it still has the following drawbacks:
It is simple to avoid the above drawbacks.
The following sketch shows how data returns layer by layer:
You can refer to MPP and SMP in TiDB (in Chinese) to know how TiDB makes the SQL statement run faster.
The previous sections introduce some functions of the SQL layer and I hope you have a basic idea about how to process the SQL statement. Actually, TiDB's SQL Layer is much complicated and has lots of modules and layers. The following diagram lists all important modules and the call relation:
The SQL requests will be sent directly or via a Load Balancer to tidb-server, which then parses the MySQL Protocol Packet for the request content. After that, it performs syntax parsing, makes and optimizes the query plan and executes the plan to access and process data. As all data is stored in the TiKV cluster. Tidb-server needs to interact with tikv-server for accessing data during the process. Finally, tidb-server needs to return the query result to users.
Up till now, we've known how data is stored and used for operation from the perspective of SQL. Information about the SQL layer, such as the principle of optimization and the detail of the distributed execution framework will be introduced in the future.
In the next article, we will dwell on information about PD, especially the cluster management and schedule. This is an interesting part as you will see things that are invisible when using TiDB but important to the whole cluster.