*** DRAFT ***
SQLite B-Tree Module
Table Of Contents
1 Document Overview
1.1 Scope and Purpose
1.2 Document and Requirements Organization
1.3 Glossary
2 Module Requirements
2.1 Functional Requirements
2.1.1 Opening and Closing Connections
2.1.2 New Database Image Configuration
2.1.3 Transaction and Savepoint Functions
2.1.4 Reading From the Database Image
2.1.5 Writing to the Database Image
2.1.6 Page-Cache Configuration Requirements
2.1.7 Multi-User Database Requirements
2.1.8 Backup/Vacuum API Requirements
2.1.9 Integrity Check Requirements
2.2 Other Requirements and Constraints
2.2.1 Caching and Memory Management Requirements
2.2.2 Exception Handling Requirements
2.2.3 Well-Formedness Requirements
3 Module API
3.1 Opening and Closing Connections
3.1.1 sqlite3BtreeOpen
3.1.2 sqlite3BtreeClose
3.2 Database Image Configuration
3.3 Connection Configuration
3.4 Query Interfaces
3.5 Mutex Functions
3.6 Transaction and Savepoint API
3.7 Reading and Traversing the Database Image
3.7.1 sqlite3BtreeMovetoUnpacked
3.7.2 sqlite3BtreeGetMeta
3.8 Modifying the Database Image
3.8.1 sqlite3BtreeCreateTable
3.8.2 sqlite3BtreeDropTable
3.8.3 sqlite3BtreeClearTable
3.8.4 sqlite3BtreeCursorHasMoved
3.8.5 sqlite3BtreePutData
3.8.6 sqlite3BtreeUpdateMeta
3.8.7 sqlite3BtreeDelete
3.8.8 sqlite3BtreeInsert
3.8.9 sqlite3BtreeIncrVacuum
3.9 Advisory B-Tree Locks
3.9.1 sqlite3BtreeLockTable
3.9.2 sqlite3BtreeSchemaLocked
3.10 What do these do?
3.11 APIs not branded sqlite3BtreeXXX()
4 Module Implementation
4.1 Database Image Traversal
4.2 Database Image Manipulation
4.2.1 Creating a B-Tree Structure
4.2.2 Clearing a B-Tree Structure
4.2.3 Deleting a B-Tree Structure
4.2.4 Inserting, Replacing and Deleting B-Tree Entries
4.2.4.1 B-Tree Insert/Replace Entry
4.2.4.2 B-Tree Delete Entry
4.2.5 B-Tree Balancing Algorithm
4.2.5.1 Balance Deeper
4.2.5.2 Balance Shallower
4.2.5.3 Balance Quick
4.2.5.4 Balance Siblings
4.2.6 Page Allocation and Deallocation
4.2.6.1 Moving an overflow-chain to the free-list
4.2.7 Incremental Vacuum Step
4.3 Transactions and Savepoints
5 References

1 Document Overview

1.1 Scope and Purpose

This document provides a description of the functionality of and public interface offered by the SQLite b-tree module. It also, to a certain extent, describes the algorithm and implementation techniques used internally by the module.

It is important to note that, even given the second bullet point above, the interfaces and sub-systems described in this document are not stable. They may be changed in any way with each new SQLite release. Any external software development that uses these interfaces must be prepared to adapt to interface refactoring without notice.

1.2 Document and Requirements Organization

Change the following so that those requirements that describe the API are "low-level" requirements.

Requirement ids Contents
H50**** Requirement statements specifying the functionality required of the B-Tree module.
H51**** Requirement statements specifying the API provided by the B-Tree module.
L****** Requirement statements specifying some details of the internal workings of the B-Tree module.

1.3 Glossary

Balance-Siblings Algorithm The balance-siblings algorithm is one of four algorithms that may be used used to redistribute data within a b-tree structure after an insert or delete operation that causes a b-tree node to become overfull or underfull. See section 4.2.5.4 for details.
B-Tree Cursor Define this.
B-Tree Database Connection A B-Tree database connection is a single client connection to an in-memory page cache, through which a single temporary or persistent database may be accessed. This term is used throughout this document to avoid confusing such connections with SQL level SQLite client connections, which are sometime simply termed "database connections".
Lazy-write cache Define this.
Overflow Cell Define this.
Page cache Define this.
Persistent database Define this.
Read-through cache Define this.
Shared-cache mode Define this.
SQLite Error Code Define this.
Temporary database Define this.

2 Module Requirements

The SQLite B-Tree module, the software module described by this document, is designed to query and modify a database stored using the database image format described in [1]. Database images may exist only in volatile main-memory (in-memory databases), or may be stored persistently within the file-system (also described in [1]). Or, a database image may be stored primarily in main-memory with the file-system used as secondary storage if the database image grows too large. Database images stored only in main-memory, and those stored primarily in main-memory with the file-system used only to provide secondary storage space are known collectively as temporary databases. Database images stored persistently in the file-system are termed persistent databases.

This module implements an in-memory page cache to manage database image content. The size of the pages managed by the cache are the same as the page-size of the database image. When operating on a persistent database, the cache operates as a read-through, lazy-write cache. When committing a database transaction, the user explicitly directs the cache to flush all dirty pages through to persistent storage. A single in-memory page cache used to access the content of a persistent database may support multiple logical client connections. Some brief explanation of what this means. And maybe a pointer to the "Multi-User Database Requirements" section.

When operating on a temporary database, there may only be one client for each page cache. Depending on the SQLite configuration, either the database or journal file, or both, may be omitted from the system.

Figure 1 - Role of Page Cache

Figure 1 depicts...

Roughly what is encapsulated by the module.

2.1 Functional Requirements

This section contains requirements describing the functionality required from the B-Tree module.

Figure out where integrity-check goes.

2.1.1 Opening and Closing Connections

The B-Tree module provides an interface to open new b-tree database connections.

The B-Tree module shall provide an interface to open a connection to either a named persistent database file, or an anonymous temporary database. (C: * H51001)

When opening a persistent database, the B-Tree module shall allow the user to specify that the connection be opened for read-only access.

When opening a persistent database, the B-Tree module shall allow the user to specify that the connection only be opened if the specified file exists.

If SQLite is configured to run in shared-cache mode, and a connection is opened to a persistent database file for which there exists already a page-cache within the current processes address space, then the connection opened shall be a connection to the existing page-cache.

If a new B-Tree database connection is opened and requirement H50040 does not apply, then a new page-cache shall be created within the processes address space. The opened connection shall be a connection to the new page-cache.

The B-Tree module also provides an interface to close existing b-tree database connections.

The B-Tree module shall provide an interface to close a B-Tree database connection.

If a B-Tree database connection is closed and this causes the associated page-cache to have zero connections to it, then the page-cache shall be closed and all associated resources released.

2.1.2 New Database Image Configuration

The following requirements describe database configuration options that are only applicable to new database images. For the purposes of the following requirements, a "new database image" is defined as one that is zero pages in size.

The B-Tree module shall provide an interface to configure the page-size of a new database image.

The B-Tree module shall provide an interface to configure whether or not a new database image is auto-vacuum capable.

2.1.3 Transaction and Savepoint Functions

This needs a lot of work...

All read and write operations performed on a database image via the B-Tree module interfaces occur within the context of a read or write transaction. Something about the ACID nature of transactions and how this applies to read and write transactions)

The B-Tree module shall provide an interface to open (start) a read-only transaction.

The B-Tree module shall provide an interface to close (finish) a read-only transaction.

Read/write:

The B-Tree module shall provide an interface to open a read/write transaction or to upgrade from a read-only transaction to a read/write transaction.

The B-Tree module shall provide an interface to commit a read/write transaction.

The B-Tree module shall provide an interface to rollback a read/write transaction.

Multi-file transaction support.

Transaction state query:

The B-Tree module shall provide an interface to query a B-Tree database connection to determine if there is an open transaction, and if so if the open transaction is read-only or read/write.

Savepoints:

Define "savepoint transactions" and fix the following requirements.

The B-Tree module shall provide an interface to open savepoint transactions.

The B-Tree module shall provide an interface to commit savepoint transactions.

The B-Tree module shall provide an interface to rollback savepoint transactions.

2.1.4 Reading From the Database Image

The B-Tree module allows the user to read a subset of the fields from the database image header. Each such field is stored in the header as a 4-byte unsigned big-endian integer. A complete description of each field and its interpretation may be found in [1].

The B-Tree module shall provide an interface to read the value of any of the 4-byte unsigned big-endian integer fields beginning at byte offset 36 of the database image. (C: H51015 H51016 H51017)

In other words, the database image header fields that may be read via this module are:

With the exception of the database image header fields described above, all data is read from the database image using B-Tree cursors. A B-Tree cursor is a control structure for traversing the contents of a single table or index b-tree structure within a database image. As well as "forward" and "back" operations, a B-Tree cursor supports fast seeking to a table entry identified by key value, or to the first or last entry in the table.

When a B-Tree cursor is created, the specific table or index b-tree that it is used to traverse is identified by the database image page number of its root page. Since the root-page of the schema table is always page 1, and the contents of the schema table includes the root page numbers of all other index and table b-tree structures in the database image, it is possible for the application to determine the set of valid root-page numbers by first traversing the schema table.

The B-Tree module shall provide an interface to open a B-Tree cursor on any table or index b-tree within the database image, given its root page number.

The B-Tree module shall provide an interface to close a B-Tree cursor.

The B-Tree module shall provide an interface to move an open B-Tree cursor to the entry associated with the largest key in the open b-tree structure.

The B-Tree module shall provide an interface to move an open B-Tree cursor to the entry associated with the smallest key in the open b-tree structure.

The B-Tree module shall provide an interface to move an open B-Tree cursor that currently points at a valid b-tree entry to the next entry in the b-tree structure, sorted in order of key value, if any.

The B-Tree module shall provide an interface to move an open B-Tree cursor that currently points at a valid b-tree entry to the previous entry in the b-tree structure, sorted in order of key value, if any.

The B-Tree module shall provide an interface to retrieve the key value associated with the b-tree structure entry that a B-Tree cursor is pointing to, if any.

The B-Tree module shall provide an interface to retrieve the blob of data (the database record) associated with the b-tree structure entry that a B-Tree cursor open on a table b-tree is pointing to, if any.

The B-Tree module shall provide an interface to return the number of entries currently stored in the b-tree structure that a B-Tree cursor is open on.

As well as traversing a b-tree structure using the operations enumerated by the above requirements, it is also possible to use a cursor to search a b-tree structure for a specified key value. If the key value can be found, the cursor is left pointing at the entry with the specified key value. Otherwise, the cursor is left pointing at either the entry with the largest key that is smaller than the specified key, or to the entry with the smallest key that is larger than the specified key. For table b-tree structures, where the key values are 64-bit integers, the definition of smaller, larger and equal to is straightforward. For index b-tree structures, where the key values are database records, the manner in which key values must be compared is more complicated. Refer to [1] for a full explanation.

There is a specific section in [1] devoted to record sort order in index b-tree structures. There needs to be some way to point to it. Or, better, to the requirement or range of requirements.

Maybe a system that automatically links text like H30100 to the corresponding requirement. Within a document if it can find it, or a summary page (hlreq.html for example).

Given a key value, the B-Tree module shall provide an interface to move a B-Tree cursor open on a b-tree structure to the B-Tree entry with the matching key value, if such an entry exists.

If the interface required by H50119 is used to search for a key value that is not present in the b-tree structure and the b-tree is not empty, the cursor shall be moved to an existing entry that would be adjacent to a hypothetical entry with the specified key value.

The interface required by H50119 shall provide an indication to the caller as to whether the cursor is left pointing at an entry with a key value that is smaller, larger or equal to the requested value, or if it is pointing to no entry at all (because the b-tree structure is empty).

Does it depend on the structure of the tree whether the cursor is left pointing to a smaller or larger entry after a failed search? Or is it possible to determine which it will be based only on the set of keys stored in the tree?

As well as the standard search operation described by the above requirements, cursors open on index b-tree structures are required to support several variants, as follows:

Finish the bullet points above and add HLR for each search mode.

More than one cursor can be open on a single b-tree structure at one time. It is also possible for a write-cursor to modify the contents of a b-tree structure while other cursors are open on it. The b-tree module does not include any type of row-locking mechanism. It is possible for a write-cursor to be used to delete an entry from a b-tree structure even if there are one or more other cursors currently pointing to the entry being deleted.

Requirements to do with how the above is handled. Traceability to sqlite3BtreeCursorHasMoved is required.

2.1.5 Writing to the Database Image

The B-Tree module allows the user to write values to a subset of the fields from the database image header. The set of writable fields is the same as the set of fields enumerated in section 2.1.4 that the B-Tree module is required to provide read access to by requirement H50109.

The B-Tree module shall provide an interface to write a value to any of the 4-byte unsigned big-endian integer fields beginning at byte offset 36 of the database image.

The B-Tree module also supports operations to create new b-tree structures within the database image. Existing b-tree structures may be deleted from the database image entirely, or their entire contents may be deleted, leaving an empty b-tree structure.

The B-Tree module shall provide an interface to create a new index or table b-tree structures within the database image. The interface shall automatically assign a root-page to the new b-tree structure.

The B-Tree module shall provide an interface to remove an existing index or table b-tree structure from the database image, given the root page number of the b-tree to remove.

The B-Tree module shall provide an interface to remove all entries from (delete the contents of) an index or table b-tree, given the root page number of the b-tree to empty.

As one would expect, the B-Tree module also provides an interface to insert and delete entries from b-tree structures. These operations are performed using a B-Tree write cursor, a special type of B-Tree cursor (see section 2.1.4).

When opening a B-Tree cursor using the interface required by H50110, it shall be possible to specify that the new cursor be a write cursor, or an ordinary read-only cursor.

The B-Tree module shall provide an interface that allows the user to delete the b-tree entry that a write cursor points to, if any. (C: L50013)

The B-Tree module shall provide an interface to insert new entries into a table or index B-Tree, given a write cursor open on the table or index b-tree the new entry is to be inserted into. (C: L50001 L50002 L50003 L50004 L50012)

Incremental vacuum step.

2.1.6 Page-Cache Configuration Requirements

A page-cache has a number of operational parameters that may be configured at run-time via an open b-tree database connection. Note that even though the interfaces provided by this module allow these parameters to be set via a b-tree database connection, they are properties of the page-cache, not the b-tree database connection. In situations where more than one b-tree database connection is connected to a single page-cache, writes made via one b-tree database connection may overwrite the values set by another. The following table summarizes the available configuration parameters.

Parameter Description Requirements
Locking-mode This! H50138, H50139, H50140
Journal-mode This! H50141, H50142, H50143, H50144, H50145, H50146
Journal-file size limit The journal-file size limit parameter may be set to any integer value within the range of a 64-bit signed integer. Any negative values is interpreted as "no limit". Otherwise, if the journal-file size limit is set to zero or a positive number, it represents an upper limit on the size of the journal file in bytes. If the application executes a database write operation that would normally cause the journal file to grow larger than this configured limit, the operation fails and an error is returned to the user. The default value of this parameter is -1 (no limit). H50147, H50148, H50149
Database-file size limit The database-image size limit parameter may be set to any integer value greater than zero within the range of a 32-bit signed integer. The configured value represents an upper limit on the size of the database image in pages. If the application executes a database write operation that would normally cause the database image to grow larger than this configured limit, the operation fails and an error is returned to the user. H50150, H50151, H50152
Cache size The cache-size parameter may be set to any integer value. How it affects operation depends on the specific P-Cache implementation used by the page-cache. Refer to details for the behaviour of the built-in default P-Cache. H50153
Safety level The safety-level parameter may be set to "none", "normal" or "full". Where will the effect of this defined/required? H50154, H50155

The B-Tree module shall provide an interface allowing the application to set the locking-mode of a page-cache to either "normal" or "exclusive", given an open b-tree database connection to that page-cache.

If the locking-mode of a page-cache is set to "normal" when a read/write or read-only transaction is ended, any locks held on the database file-system representation by the page-cache shall be relinquished.

If the locking-mode of a page-cache is set to "exclusive" when a read/write or read-only transaction is ended, any locks held on the database file-system representation by the page-cache shall be retained.

And if a read/write transaction is downgraded to a read-only transaction? This scenario should also be dealt with in section 2.1.3.

The B-Tree module shall provide an interface allowing the application to set the journal-mode of a page-cache to one of "off", "memory", "delete", "persist", or "truncate", given an open b-tree database connection to that page-cache.

If the journal-mode of a page-cache is set to "off" when a read/write transaction is opened, then the transaction shall use no journal file.

If the journal-mode of a page-cache is set to "memory" when a read/write transaction is opened, then instead of using the journal file located in the file-system, journal-file data shall be stored in main-memory.

If the journal-mode of a page-cache is set to "delete" when a read/write transaction is opened, then any journal file used by the transaction shall be deleted at the conclusion of the transaction.

If the journal-mode of a page-cache is set to "truncate" when a read/write transaction is opened, then any journal file used by the transaction shall be truncated to zero bytes in size at the conclusion of the transaction.

If the journal-mode of a page-cache is set to "persist" when a read/write transaction is opened, then any journal file used by the transaction shall remain in the file-system at the conclusion of the transaction.

The difference in functionality provided by "off", "memory" and the 3 modes that use a real journal file should also feature in 2.1.3.

The B-Tree module shall provide an interface to set the value of the journal-file size limit configuration parameter of a page-cache, given an open b-tree database connection to that page-cache.

The default value assigned to the journal-file size limit configuration of a page-cache shall be -1.

If the journal-file size limit parameter is set to a non-negative value, and the user executes a write operation that would otherwise require the journal file to be extended to a size greater than the configured value in bytes, then the operation shall fail and an error be returned to the user.

The B-Tree module shall provide an interface to set the value of the database-image size limit configuration parameter of a page-cache, given an open b-tree database connection to that page-cache.

The default value assigned to the database-image size limit configuration of a page-cache shall be the value of the compile time symbol SQLITE_MAX_PAGE_COUNT (1073741823 by default).

If the database-image size limit parameter is set to a non-negative value, and the user executes a write operation that would otherwise require the journal file to be extended to a size greater than the configured value in bytes, then the operation shall fail and an error be returned to the user.

The B-Tree module shall provide an interface to set the value of the cache-size configuration parameter of a page-cache, given an open b-tree database connection to that page-cache.

See section 2.2.1 for a description of and requirements specifying how the value of the cache-size parameter affects the operation of a page-cache. Check this reference is relevant after it is written. Refer to a specific requirement if possible too.

The B-Tree module shall provide an interface allowing the application to set the safety-level of a page-cache to one of "off", "normal" or "full", given an open b-tree database connection to that page-cache.

The default value assigned to the safety-level configuration parameter of a page-cache shall be "full".

Description of what the safety-level actually does. Or pointer to where a description and requirements can be found (transactions section?).

Interface to set the codec function (encryption).

The busy-handler. Where exactly does this come in? Transactions and savepoints section?

The six page-cache operational parameters listed above may also be queried. The following requirements specify the required query interfaces.

The B-Tree module shall provide an interface to query the current locking-mode of a page-cache, given an open b-tree database connection to that page-cache.

The B-Tree module shall provide an interface to query the current journal-mode of a page-cache, given an open b-tree database connection to that page-cache.

The B-Tree module shall provide an interface to query the current journal file size-limit of a page-cache, given an open b-tree database connection to that page-cache.

The B-Tree module shall provide an interface to query the current database file size-limit of a page-cache, given an open b-tree database connection to that page-cache.

The B-Tree module shall provide an interface to query the current cache-size of a page-cache, given an open b-tree database connection to that page-cache.

The B-Tree module shall provide an interface to query the current safety-level of a page-cache, given an open b-tree database connection to that page-cache.

It is also possible to interrogate a b-tree database handle to determine if it was opened on a temporary or persistent database. An b-tree database handle opened on a persistent database may be queried for the name of (full-path to) either the database or journal file associated with the open database.

The B-Tree module shall provide an interface to query an open b-tree database handle to determine if the underlying database is a persistent database or a temporary database.

The B-Tree module shall provide an interface allowing the application to query a b-tree database connection open on a persistent database for the name of the underlying database file within the file-system.

The B-Tree module shall provide an interface allowing the application to query a b-tree database connection open on a persistent database for the name of the underlying journal file within the file-system.

2.1.7 Multi-User Database Requirements

The b-tree module shall provide an interface allowing database clients to acquire advisory read (shared) or write (exclusive) locks on a specific b-tree structure within the database.

The b-tree module preventing deadlock (by always grabbing mutexes in order of BtShared pointer) should be required here.

2.1.8 Backup/Vacuum API Requirements

2.1.9 Integrity Check Requirements

2.2 Other Requirements and Constraints

2.2.1 Caching and Memory Management Requirements

2.2.2 Exception Handling Requirements

System failure. Do not corrupt the database image.

Three kinds of exception:

2.2.3 Well-Formedness Requirements

3 Module API

Description of the interface in btree.h. Also other interfaces accessed by external modules. Including release_memory() and those pager interfaces that are accessed directly by other modules. All of these requirements will be descended/derived from requirements in the previous sections. Some of the text could/should be pulled in from btree.h.

The name of sqlite3BtreeBeginStmt() should probably change to sqlite3BtreeOpenSavepoint(). Matches the pager layer and is a more accurate description of the function.

There are only a few places in which the pager object is used directly, always to call some trivial get/set configuration function. These should be replaced somehow with sqlite3BtreeXXX() APIs. Also, the current approach is probably Ok, but worth looking it over for thread-safety issues.

It would be easier to write up if the dependency between the B-Tree layer and the sqlite3 structure did not exist. At present, it is used for:
* The unlock-notify feature (arguments to sqlite3ConnectionBlocked() are database handles),
* Accessing the SQLITE_ReadUncommitted flag,
* Invoking the busy-handler callback,
* During sqlite3BtreeOpen(), to find the VFS to use,
* Accessing the SQLITE_SharedCache flag (for setting it),
* To check the same B-Tree is not attached more than once in shared-cache mode,
* To link the B-Tree into the pointer-order list of shared-cache b-trees used by the same handle (used for mutexes).
* To determine if an in-memory sub-journal should be used.
* To know how many savepoints are open in BtreeBeginTrans().
* Many, many times to assert() that the db mutex is held when the b-tree layer is accessed..

3.1 Opening and Closing Connections

This section describes the API offered by the B-Tree module to other SQLite sub-systems to open and close B-Tree database connections.

typedef struct Btree Btree;

A B-Tree database connection is accessed by other SQLite sub-systems using an opaque handle, modelled in C code using the type "Btree*".

3.1.1 sqlite3BtreeOpen

int sqlite3BtreeOpen(
  const char *zFilename,   /* Name of database file to open */
  sqlite3 *db,             /* Associated database connection */
  Btree **ppBtree,         /* Return open Btree* here */
  int flags,               /* Flags */
  int vfsFlags             /* Flags passed through to VFS open */
);

If successful, a call to the sqlite3BtreeOpen function shall return SQLITE_OK and set the value of *ppBtree to contain a new B-Tree database connection handle. (P: H50010)

If unsuccessful, a call to the sqlite3BtreeOpen function shall return an SQLite error code other than SQLITE_OK indicating the reason for the failure. The value of *ppBtree shall not be modified in this case.

If the zFilename parameter to a call to sqlite3BtreeOpen is NULL or a pointer to a buffer of which the first byte is a nul (0x00), then sqlite3BtreeOpen shall attempt to open a connection to a temporary database.

If the zFilename parameter to a call to sqlite3BtreeOpen is a pointer to a buffer containing a nul-terminated UTF-8 encoded string, sqlite3BtreeOpen shall attempt to open a connection to a persistent database.

The combination of the above two requirements implies that if the zFilename argument passed to sqlite3BtreeOpen is other than a NULL pointer or a pointer to a nul-terminated string, the type of or filename of the database that sqlite3BtreeOpen attempts to open a connection to are undefined.

Valid values for the flags argument to the sqlite3BtreeOpen function consist of the bitwise OR of zero or more of the following symbols.

#define BTREE_OMIT_JOURNAL  1  /* Do not use journal.  No argument */
#define BTREE_NO_READLOCK   2  /* Omit readlocks on readonly files */

If the BTREE_OMIT_JOURNAL bit is set in the flags parameter passed to a successful call to sqlite3BtreeOpen to open a temporary database, then the page-cache created as a result shall not open or use a journal file for any purpose.

When opening a connection to a persistent database, the value of the BTREE_OMIT_JOURNAL bit in the flags parameter is ignored by sqlite3BtreeOpen.

If the BTREE_NO_READLOCK bit is set in the flags parameter passed to a successful call to sqlite3BtreeOpen to open a persistent database and a new page-cache is created as a result of the call, then the new page-cache shall only lock the database file-system representation when writing to it.

When opening a connection to a temporary database, the value of the BTREE_NO_READLOCK bit in the flags parameter is ignored, as temporary databases are never locked for either reading or writing (reference to some requirement for this statement.). Whether or not a new page-cache is created when a connection to a persistent database is opened is governed by requirements H50040 and H50050.

If the sqlite3BtreeOpen function is called to open a connection to a persistent database, and the call causes a new page-cache to be created, when opening the database file using the VFS interface xOpen method the 4th parameter passed to xOpen (flags) shall be a copy of the vfsFlags value passed to sqlite3BtreeOpen.

If the sqlite3BtreeOpen function is called to open a connection to a temporary database, if and when a temporary file is opened to use as secondary storage using the VFS interface xOpen method the 4th parameter passed to xOpen (flags) shall be a copy of the vfsFlags value passed to sqlite3BtreeOpen with the SQLITE_OPEN_READWRITE, SQLITE_OPEN_CREATE, SQLITE_OPEN_EXCLUSIVE and SQLITE_OPEN_DELETEONCLOSE bits also set.

Requirements explaining how the db parameter to sqlite3BtreeOpen is used. Must be there for something.

3.1.2 sqlite3BtreeClose

int sqlite3BtreeClose(Btree*);

A call to the sqlite3BtreeClose function with a valid b-tree database connection handle passed as the only argument shall invalidate the handle, close the b-tree database connection and release all associated resources.

If a call to sqlite3BtreeClose is made with a value that is not a valid b-tree database connection handle passed as the only argument, the results are undefined.

If a call to sqlite3BtreeClose is made to close a b-tree database connection while there exist open B-Tree cursors that were opened using the specified b-tree database connection, they shall be closed automatically from within sqlite3BtreeClose, just as if their handles were passed to sqlite3BtreeCloseCursor.

See also requirement H50070.

3.2 Database Image Configuration

This category doesn't work all that well. These APIs are used for other things too (i.e. switching to incremental-vacuum mode).

#define BTREE_AUTOVACUUM_NONE 0        /* Do not do auto-vacuum */
#define BTREE_AUTOVACUUM_FULL 1        /* Do full auto-vacuum */
#define BTREE_AUTOVACUUM_INCR 2        /* Incremental vacuum */
int sqlite3BtreeSetAutoVacuum(Btree *, int);
int sqlite3BtreeSetPageSize(Btree *p, int nPagesize, int nReserve, int eFix);

Queries:

int sqlite3BtreeGetPageSize(Btree*);
int sqlite3BtreeGetReserve(Btree*);
int sqlite3BtreeGetAutoVacuum(Btree *);

3.3 Connection Configuration

int sqlite3BtreeSetCacheSize(Btree*,int);
int sqlite3BtreeSetSafetyLevel(Btree*,int,int);
int sqlite3BtreeMaxPageCount(Btree*,int);

And functions to query the current configuration:

int sqlite3BtreeSyncDisabled(Btree*);

3.4 Query Interfaces

const char *sqlite3BtreeGetFilename(Btree *);

A call to the sqlite3BtreeGetFilename function with a valid B-Tree database connection handle opened on a persistent database as the first argument shall return a pointer to a buffer containing the full-path of the database file formatted as a nul-terminated, UTF-8 string.

A call to the sqlite3BtreeGetFilename function with a valid B-Tree database connection handle opened on a temporary database as the first argument shall return a pointer to a buffer to a nul-terminated string zero bytes in length (i.e. the first byte of the buffer shall be 0x00).

const char *sqlite3BtreeGetJournalname(Btree *);

A call to the sqlite3BtreeGetJournalname function with a valid B-Tree database connection handle opened on a persistent database as the first argument shall return a pointer to a buffer containing the full-path of the journal file formatted as a nul-terminated, UTF-8 string.

A call to the sqlite3BtreeGetJournalname function with a valid B-Tree database connection handle opened on a temporary database as the first argument shall return a pointer to a buffer to a nul-terminated string zero bytes in length (i.e. the first byte of the buffer shall be 0x00).

Requirement H51013 holds true even if the B-Tree database connection is configured to use an in-memory journal file or no journal file at all (ref requirements). In these cases the buffer returned contains the full-path of the journal file that would be used if the connection were configured to use a journal file.

3.5 Mutex Functions

typedef struct BtreeMutexArray BtreeMutexArray;
struct BtreeMutexArray {
  int nMutex;
  Btree *aBtree}; hd_resolve_one {SQLITE_MAX_ATTACHED+1}; hd_puts {;
};
void sqlite3BtreeEnter(Btree*);
void sqlite3BtreeEnterAll(sqlite3*);
void sqlite3BtreeLeave(Btree*);
void sqlite3BtreeEnterCursor(BtCursor*);
void sqlite3BtreeLeaveCursor(BtCursor*);
void sqlite3BtreeLeaveAll(sqlite3*);
void sqlite3BtreeMutexArrayEnter(BtreeMutexArray*);
void sqlite3BtreeMutexArrayLeave(BtreeMutexArray*);
void sqlite3BtreeMutexArrayInsert(BtreeMutexArray*, Btree*);

3.6 Transaction and Savepoint API

int sqlite3BtreeBeginTrans(Btree*,int);
int sqlite3BtreeCommitPhaseOne(Btree*, const char *zMaster);
int sqlite3BtreeCommitPhaseTwo(Btree*);
int sqlite3BtreeCommit(Btree*);
int sqlite3BtreeRollback(Btree*);
int sqlite3BtreeBeginStmt(Btree*,int);
int sqlite3BtreeSavepoint(Btree *, int, int);
int sqlite3BtreeIsInTrans(Btree*);
int sqlite3BtreeIsInReadTrans(Btree*);
int sqlite3BtreeIsInBackup(Btree*);

3.7 Reading and Traversing the Database Image

sqlite3BtreeMoveto is never called from outside of the b-tree layer. It could/should be removed from the API.

The "bias" argument to sqlite3BtreeMovetoUnpacked is only ever true when it is called from within sqlite3BtreeInsert. This argument could/should also be removed from the API, if only to make it simpler to describe.

typedef struct BtCursor BtCursor;
int sqlite3BtreeCursor(
  Btree*,                              /* BTree containing table to open */
  int iTable,                          /* Index of root page */
  int wrFlag,                          /* 1 for writing.  0 for read-only */
  struct KeyInfo*,                     /* First argument to compare function */
  BtCursor *pCursor                    /* Space to write cursor structure */
);
int sqlite3BtreeCursorSize(void);
int sqlite3BtreeCloseCursor(BtCursor*);
void sqlite3BtreeClearCursor(BtCursor *);
int sqlite3BtreeFirst(BtCursor*, int *pRes);
int sqlite3BtreeLast(BtCursor*, int *pRes);
int sqlite3BtreeNext(BtCursor*, int *pRes);
int sqlite3BtreePrevious(BtCursor*, int *pRes);
int sqlite3BtreeEof(BtCursor*);
int sqlite3BtreeKeySize(BtCursor*, i64 *pSize);
int sqlite3BtreeKey(BtCursor*, u32 offset, u32 amt, void*);
const void *sqlite3BtreeKeyFetch(BtCursor*, int *pAmt);
const void *sqlite3BtreeDataFetch(BtCursor*, int *pAmt);
int sqlite3BtreeDataSize(BtCursor*, u32 *pSize);
int sqlite3BtreeData(BtCursor*, u32 offset, u32 amt, void*);
int sqlite3BtreeCount(BtCursor *, i64 *);

3.7.1 sqlite3BtreeMovetoUnpacked

The sqlite3BtreeMovetoUnpacked function is used to

int sqlite3BtreeMovetoUnpacked(
  BtCursor*,
  UnpackedRecord *pUnKey,
  i64 intKey,
  int bias,
  int *pRes
);

The following requirements specify exactly how a b-tree cursor is to be moved by a successful call to sqlite3BtreeMovetoUnpacked.

If a call is made to sqlite3BtreeMovetoUnpacked specifying a key value for which there exists an entry with a matching key value in the b-tree structure, the b-tree cursor shall be moved to point to this entry. In this case *pRes (the value of the "int" variable pointed to by the pointer passed as the fifth parameter to sqlite3BtreeMovetoUnpacked) shall be set to 0 before returning.

If a call is made to sqlite3BtreeMovetoUnpacked specifying a key value for which there does not exist an entry with a matching key value in the b-tree structure, the b-tree cursor shall be moved to point to an entry located on the leaf page that would contain the requested entry, were it present.

If the condition specified in L50009 is met and the b-tree structure contains one or more entries (is not empty), the b-tree cursor shall be left pointing to an entry that would lie adjacent (immediately before or after in order by key) to the requested entry on the leaf page, were it present.

If the condition specified in L50009 is met and the b-tree cursor is left pointing to an entry with a smaller key than that requested, or the cursor is left pointing a no entry at all because the b-tree structure is completely empty, *pRes (the value of the "int" variable pointed to by the pointer passed as the fifth parameter to sqlite3BtreeMovetoUnpacked) shall be set to -1. Otherwise, if the b-tree cursor is left pointing to an entry with a larger key than that requested, *pRes shall be set to 1.

Not clear how to deal with these. Define an external module to encapsulate these and define sorting order etc.? That's tricky as things are because the UnpackedRecord.flags field defines the "search mode" used by sqlite3BtreeMovetoUnpacked.

typedef struct KeyInfo KeyInfo;
struct KeyInfo {
  sqlite3 *db;        /* The database connection */
  u8 enc;             /* Text encoding - one of the SQLITE_UTF* values */
  u16 nField;         /* Number of entries in aColl}; hd_resolve_one {}; hd_puts { */
  u8 *aSortOrder;     /* Sort order for each column.  May be NULL */
  CollSeq *aColl}; hd_resolve_one {1}; hd_puts {;  /* Collating sequence for each term of the key */
};
typedef struct UnpackedRecord UnpackedRecord;
struct UnpackedRecord {
  KeyInfo *pKeyInfo;  /* Collation and sort-order information */
  u16 nField;         /* Number of entries in apMem}; hd_resolve_one {}; hd_puts { */
  u16 flags;          /* Boolean settings.  UNPACKED_... below */
  i64 rowid;          /* Used by UNPACKED_PREFIX_SEARCH */
  Mem *aMem;          /* Values */
};

3.7.2 sqlite3BtreeGetMeta

The sqlite3BtreeGetMeta interface may be used to retrieve the current value of certain fields from the database image header.

void sqlite3BtreeGetMeta(Btree *pBtree, int idx, u32 *pValue);

If the first parameter is a b-tree database connection handle with an open read-only or read-write transaction, and the second parameter is an integer between 0 and 7 inclusive, and the database image consists of zero pages, a call to the sqlite3BtreeGetMeta function shall set the value of *pValue to zero. (P: H50109)

If the first parameter is a b-tree database connection handle with an open read-only or read-write transaction, and the second parameter is an integer between 0 and 7 inclusive, and the database image consists of one or more pages, a call to the sqlite3BtreeGetMeta function shall set the value of *pValue to the current value of the specified 32-bit unsigned integer in the database image database header. (P: H50109)

The two requirements above imply that if sqlite3BtreeGetMeta is called with anything other than a b-tree database connection handle with an open read-only or read-write transaction as the first argument, or with anything other than an integer between 0 and 7 (inclusive) as the second, the results are undefined.

#define BTREE_FREE_PAGE_COUNT     0
#define BTREE_SCHEMA_VERSION      1
#define BTREE_FILE_FORMAT         2
#define BTREE_DEFAULT_CACHE_SIZE  3
#define BTREE_LARGEST_ROOT_PAGE   4
#define BTREE_TEXT_ENCODING       5
#define BTREE_USER_VERSION        6
#define BTREE_INCR_VACUUM         7

The database header field read from the database image by a call to sqlite3BtreeGetMeta in the situation specified by H51016 shall be the 32-bit unsigned integer header field stored at byte offset (36 + 4 * idx) of the database header, where idx is the value of the second parameter passed to sqlite3BtreeGetMeta. (P: H50109)

3.8 Modifying the Database Image

3.8.1 sqlite3BtreeCreateTable

int sqlite3BtreeCreateTable(Btree*, int*, int flags);
#define BTREE_INTKEY     1    /* Table has only 64-bit signed integer keys */
#define BTREE_ZERODATA   2    /* Table has keys only - no data */
#define BTREE_LEAFDATA   4    /* Data stored in leaves only.  Implies INTKEY */

3.8.2 sqlite3BtreeDropTable

int sqlite3BtreeDropTable(Btree*, int, int*);

3.8.3 sqlite3BtreeClearTable

int sqlite3BtreeClearTable(Btree*, int, int*);

3.8.4 sqlite3BtreeCursorHasMoved

int sqlite3BtreeCursorHasMoved(BtCursor*, int*);

3.8.5 sqlite3BtreePutData

int sqlite3BtreePutData(BtCursor*, u32 offset, u32 amt, void*);

3.8.6 sqlite3BtreeUpdateMeta

int sqlite3BtreeUpdateMeta(Btree*, int idx, u32 value);

3.8.7 sqlite3BtreeDelete

int sqlite3BtreeDelete(BtCursor*);

A successful call to the sqlite3BtreeDelete function made with a read/write b-tree cursor passed as the first argument shall remove the entry pointed to by the b-tree cursor from the b-tree structure. (P: H50127)

Effect of a delete operation on other cursors that are pointing to the deleted b-tree entry.

Malloc and IO error handling. Same as for sqlite3BtreeInsert.

3.8.8 sqlite3BtreeInsert

int sqlite3BtreeInsert(BtCursor*, const void *pKey, i64 nKey,
                                  const void *pData, int nData,
                                  int nZero, int bias, int seekResult);

A successful call to the sqlite3BtreeInsert function made with a read/write b-tree cursor passed as the first argument shall insert a new entry into the b-tree structure the b-tree cursor is open on. (P: H50128)

The requirement above implies that the results of passing anything else as the first argument to sqlite3BtreeInsert, for example a read-only b-tree cursor, are undefined.

If a call to sqlite3BtreeInsert is made to insert an entry specifying a key value for which there already exists a matching key within the b-tree structure, the entry with the matching key shall be removed from the b-tree structure before the new entry is inserted. (P: H50128)

In other words, the sqlite3BtreeInsert API could easily be renamed sqlite3BtreeInsertOrReplace. We will probably need a module requirement for the "replace" operation.

If the b-tree cursor passed to sqlite3BtreeInsert as the first argument is open on a table b-tree, then the values passed as the second parameter (pKey) shall be ignored. The value passed as the third parameter (nKey) shall be used as the integer key for the new entry. (P: H50128)

If the b-tree cursor passed to sqlite3BtreeInsert as the first argument is open on a table b-tree, then the database record associated with the new entry shall consist of a copy of the first nData bytes of the buffer pointed to by pData followed by nZero zero (0x00) bytes, where pData, nData and nZero are the fourth, fifth and sixth parameters passed to sqlite3BtreeInsert, respectively. (P: H50128)

If the b-tree cursor passed to sqlite3BtreeInsert as the first argument is open on an index b-tree, then the values passed as the fourth, fifth and sixth parameters shall be ignored. The key (a database record) used by the new entry shall consist of the first nKey bytes of the buffer pointed to by pKey, where pKey and nKey are the second and third parameters passed to sqlite3BtreeInsert, respectively. (P: H50128)

The following requirements describe the seventh and eighth paramaters passed to the sqlite3BtreeInsert function. Both of these are used to provide extra information used by sqlite3BtreeInsert to optimize the insert operation. They may be safely ignored by alternative b-tree implementations.

There should be some rationalization for these, eventually. Some traceability from somewhere to show how the b-tree module offering these slightly esoteric interfaces is helpful to SQLite overall.

If the value passed as the seventh parameter to a call to sqlite3BtreeInsert is non-zero, sqlite3BtreeInsert shall interpret this to mean that it is likely (but not certain) that the key belonging to the new entry is larger than the largest key currently stored in the b-tree structure, and optimize accordingly.

If the value passed as the eighth parameter to a call to sqlite3BtreeInsert is non-zero, then the B-Tree module shall interpret this to mean that the the b-tree cursor has already been positioned by a successful call to sqlite3BtreeMovetoUnpacked specifying the same key value as is being inserted, and that sqlite3BtreeMovetoUnpacked has set the output value required by L50011 to this value.

If a non-zero value is passed as the eighth parameter to sqlite3BtreeInsert and the b-tree cursor has not been positioned as assumed by L50006, the results are undefined.

Malloc and IO error handling. Maybe these should be grouped together for a whole bunch of APIs. And hook into the above via a definition of "successful call".

3.8.9 sqlite3BtreeIncrVacuum

int sqlite3BtreeIncrVacuum(Btree *);

3.9 Advisory B-Tree Locks

This section describes the b-tree module interfaces used for acquiring and querying the advisory locks that can be placed on database image pages. The locking mechanisms described in this section are only used to arbitrate between multiple clients of the same in-memory page-cache. The locking mechanism used to control access to a file-system representation of the database when multiple in-memory page caches (possibly located in different OS processes) are open on it is described in this.

As well as obtaining advisory locks explicitly using the sqlite3BtreeLockTable API (see below), a read-lock on page 1 of the database image is automatically obtained whenever a b-tree database connection opens a read-only or read-write transaction (see requirement number). Note that this means that a write-lock on page 1 is effectively an exclusive lock on the entire page-cache, as it prevents any other connection from opening a transaction of any kind.

3.9.1 sqlite3BtreeLockTable

int sqlite3BtreeLockTable(Btree *pBtree, int iTab, u8 isWriteLock);

The sqlite3BtreeLockTable API allows database clients to place advisory read or write locks on a specified page of the database image. The specified page need not exist within the database image. By convention, SQLite acquires read and write locks on the root pages of table b-trees only, but this is not required to be enforced by the b-tree module. Locks may only be obtained when a database client has an open transaction. All locks are automatically released when the open transaction is concluded.

A call to sqlite3BtreeLockTable, specifying a b-tree database connection handle with an open read-only or read-write transaction as the first parameter, and zero as the third parameter, shall attempt to obtain a read-lock on the database page specified by the second parameter.

A call to sqlite3BtreeLockTable, specifying a b-tree database connection handle with an open read-write transaction as the first parameter, and a non-zero value as the third parameter, shall attempt to obtain a write-lock on the database page specified by the second parameter.

The two requirements above imply that the results of calling sqlite3BtreeLockTable on a b-tree database connection handle that does not currently have an open transaction, or attempting to obtain a write-lock using a b-tree database connection handle that only has a read-only transaction open are undefined.

If, when attempting to obtain a read-lock as described in L50016, there exists another b-tree database connection connected to the same page-cache that is holding a write-lock on the same database image page, the read-lock shall not be granted and the call to sqlite3BtreeLockTable shall return SQLITE_LOCKED_SHAREDCACHE.

If, when attempting to obtain a write-lock as described in L50017, there exists another b-tree database connection connected to the same page-cache that is holding a read or write-lock on the same database image page, the write-lock shall not be granted and the call to sqlite3BtreeLockTable shall return SQLITE_LOCKED_SHAREDCACHE.

Requirement L50020 is overly conservative. Because a write-lock may only be requested if the b-tree database connection has an open read-write transaction (L50017), and at most a single b-tree database connection may have such an open transaction at one time, it is not possible for a request for a write-lock to fail because another connection is holding a write-lock on the same b-tree database image page. It may, however, fail because another connection is holding a read-lock.

All locks are held until the current transaction is concluded. Or, if a read-write transaction is downgraded to a read-only transaction, all currently held write-locks are changed to read-locks.

When a read-only or read-write transaction is concluded, all advisory b-tree locks held by the b-tree database connection shall be relinquished.

When a read-write transaction is downgraded to a read-only transaction, all advisory b-tree write-locks held by the b-tree database connection shall be changed to read-locks.

The requirement above should refer to some other requirement that defines when a read-write transaction is downgraded to a read-only transaction.

Malloc failure?

Read uncommitted flag. Maybe this should be handled outside of the b-tree module. Is there anything to stop connections with this flag set simply not obtaining read locks? There are assert() statements in the b-tree module that need to take this flag into account, but not actual functionality.

3.9.2 sqlite3BtreeSchemaLocked

int sqlite3BtreeSchemaLocked(Btree *pBtree);

A call to the sqlite3BtreeSchemaLocked function with a valid b-tree database connection as the only argument shall return SQLITE_LOCKED_SHAREDCACHE if there exists another b-tree database connection connected to the same page-cache that currently holds a write-lock on database image page 1.

A call to the sqlite3BtreeSchemaLocked function with a valid b-tree database connection as the only argument shall return SQLITE_OK if H51017 does not apply.

3.10 What do these do?

Where do the following go?

char *sqlite3BtreeIntegrityCheck(Btree*, int *aRoot, int nRoot, int, int*);
struct Pager *sqlite3BtreePager(Btree*);
int sqlite3BtreeCopyFile(Btree *, Btree *);
void *sqlite3BtreeSchema(Btree *, int, void(*)(void *));
int sqlite3BtreeSchemaLocked(Btree *pBtree);
int sqlite3BtreeLockTable(Btree *pBtree, int iTab, u8 isWriteLock);
void sqlite3BtreeTripAllCursors(Btree*, int);

I know what the following do, but is this mechanism ever used? Or has it been superceded by other tricks in OP_NewRowid?

void sqlite3BtreeSetCachedRowid(BtCursor*, sqlite3_int64);
sqlite3_int64 sqlite3BtreeGetCachedRowid(BtCursor*);

Should move to btreeInt.h

typedef struct BtShared BtShared;

3.11 APIs not branded sqlite3BtreeXXX()

4 Module Implementation

4.1 Database Image Traversal

4.2 Database Image Manipulation

This section should describe exactly how bits and bytes are shifted around when the database image is traversed and modified. i.e. how the b-tree balancing works, deleting an internal cell from an index b-tree etc.

4.2.1 Creating a B-Tree Structure

4.2.2 Clearing a B-Tree Structure

4.2.3 Deleting a B-Tree Structure

4.2.4 Inserting, Replacing and Deleting B-Tree Entries

The following two sections describe the way entries are added and removed from B-Tree structures within a database image.

As one might expect, the algorithms described in the following sections involve adding and removing b-tree cells to and from b-tree node pages. The format of b-tree node pages is described in detail in [1]. This document does not describe the exact way in which content is manipulated within a page, as these details are considered not considered high-level enough to be documented outside of the SQLite source code itself. For the purposes of the descriptions in the following sections, a b-tree node page is considered to be a container for an ordered list of b-tree cells. Cells may be inserted into or removed from any position in the ordered list as required.

A b-tree node page has a finite capacity. If one of the algorithms described here is required to insert a cell into a b-tree node page, and there is not enough free space within the page to accommodate the cell, it is still nominally inserted into the requested position within the node, but becomes an overflow cell. Overflow cells never remain so for very long. If an insert, replace or delete entry operation creates one or more overflow cells, the b-tree structure is rearranged so that all cells are stored within the body of a b-tree node page before the operation is considered complete. This process of rearranging the b-tree structure is termed b-tree balancing, and is described in section 4.2.5.

4.2.4.1 B-Tree Insert/Replace Entry

This section describes the way in which new entries may be inserted into a b-tree structure, and how existing entries may be replaced. Both of these operations are accessed using the sqlite3BtreeInsert API.

An insert/replace operation involves the following steps:

  1. Based on the supplied key and value, and the type of b-tree being inserted into, allocate and populate any required overflow pages. Should reference file-format requirements that provide the formula for doing this.
  2. Attempt to move the b-tree write cursor to an entry with a key that matches the new key being inserted. If a matching entry is found, then the operation is a replace. Otherwise, if the key is not found, an insert.
    1. Requirements L50008, L50009, L50010 and L50011 apply to the cursor seek operation here. This ensures that if the search does not find an exact match, the cursor is left pointing to the leaf page that the new entry should be added into.
    2. As specified by L50006, the cursor may already be positioned. In this case the seek operation is not required.
  3. If a matching key was found in the b-tree, then it must be removed and the new entry added in its place.
    1. If there are one or more overflow pages associated with the entry being replaced, they are moved to the free-list.
    2. The cell corresponding to the entry being removed is removed from the b-tree node page.
    3. The new cell is inserted in the position previously occupied by the cell removed in the previous step. If the page is not a leaf page, then the first four-bytes (the child-page pointer) of the old cell are copied to the first four bytes of the new cell. If the new cell is larger than the cell that it replaced, then it may become an overflow cell.
  4. If no matching key was found in the b-tree, then the new cell is inserted into the leaf page that the cursor was left pointing to by step 1. The new cell may become an overflow cell.
  5. If the new cell is now an overflow cell, then the balancing algorithm (see section 4.2.5) is run on the overflowing b-tree node page.

4.2.4.2 B-Tree Delete Entry

This section describes the way in which entries may be removed from a b-tree structure, as required when the sqlite3BtreeDelete (section 3.8.7) API is invoked. Removing an entry from a b-tree table involves the following steps:

  1. All overflow pages in the overflow page chain (if any) associated with the entry must be moved to the database free-list. If the database image is an autovacuum database, the pointer-map entries that correspond to each overflow page in the chain must be updated.
  2. The b-tree cell corresponding to the entry must be removed from the b-tree structure.

Note about the optimization that makes it possible to move overflow pages to the free-list without reading their contents (i.e. without loading them into the cache).

If the b-tree entry being removed is located on a leaf page (as is always the case with table b-tree structures), then deleting an entry from a b-tree is quite simple.

Figure 2 - Delete from an Internal Node

4.2.5 B-Tree Balancing Algorithm

4.2.5.1 Balance Deeper

  1. Allocate a new page (the child-page).
  2. Copy page data from root-page to child-page (including overflow cells).
  3. Fix pointer map entries associated with new child-page content.
  4. Zero the root-page.
  5. Set the right-child pointer of the root-page to point to the new child-page.
  6. Set the pointer map entry for the new child page.
  7. Execute the balance procedure on the new child page.

Figure 3 - Example Balance Deeper Transform

4.2.5.2 Balance Shallower

  1. Copy node data from child-page to root-page.
  2. Fix pointer map entries associated with new root-page content.
  3. Move child-page to database image free-list.

Figure 4 - Example Balance Shallower Transform

4.2.5.3 Balance Quick

  1. Allocate a new page (the new sibling-page).
  2. Populate the new sibling page with the new b-tree entry.
  3. Add a new divider cell to the parent. The divider cell contains a pointer to the page that is currently the right-child of the parent. The key in the new divider cell is a copy of the largest key in the page that is currently the right-child of the parent.
  4. Set the right-child of the parent page to point to the new sibling page.
  5. If the database is an auto-vacuum database, set the pointer map entry associated with the new sibling page. If the cell on the new sibling page contains a pointer to an overflow page, set the pointer map entry associated with the overflow page.
  6. Execute the balance procedure on the parent page.

Figure 5 - Example Balance Quick Transform

4.2.5.4 Balance Siblings

The balance-siblings algorithm shall redistribute the b-tree cells currently stored on a overfull or underfull page and up to two sibling pages, adding or removing siblings as required, such that no sibling page is overfull and the minimum possible number of sibling pages is used to store the redistributed b-tree cells.

The following description describes how balance() is to be implemented. This represents (I think) the lowest level of detail that should be in this document. One skilled in the art could use this description to reimplement SQLite's balance-siblings algorithm. We also need requirements at a higher level of detail in this section. Something to test!

The balance-siblings algorithm, as implemented by SQLite, is described as a series of steps below. there are a few terms used below that need definitions/clarifications.

  1. Determine the set of sibling pages to redistribute the cells of, using the following rules:
    1. If the parent page has three or fewer child pages, then all child pages are deemed to be sibling pages for the purposes of the balance-siblings algorithm.
    2. If the page being balanced is the left-most child of the parent page, then the three left-most child pages are used as the siblings.
    3. If the page being balanced is the right-most child of the parent page, then the three right-most child pages are used as the siblings.
    4. Otherwise, if none of the above three conditions are true, then the sibling pages are page being balanced and the child pages immediately to the left and right of it.
  2. Determine an ordered list of cells to redistribute. There are several variations of this step, depending on the type of page being balanced.
    1. If the page being balanced is a leaf page of a table b-tree, then the list of cells to redistribute is simply the concatenation of the ordered lists of cells stored on each sibling page, in order from left-most sibling to right-most.
    2. If the page being balanced is a leaf page of an index b-tree, then the list of cells to redistribute is comprised of the cells on each of the sibling pages and the divider cells in the parent page that contain the pointers to each sibling page except the right-most. The list is arranged so that it contains:
      • The cells from the left-most sibling page, in order, followed by
      • the divider cell from the parent page that contains the pointer to the left-most sibling (if there is more than one sibling page), followed by
      • the divider cell that contains the pointer to the second left-most sibling and the cells from the remaining sibling page (if there are three sibling pages).
    3. If the page being balanced is an internal b-tree node, then the list of cells to redistribute is determined as described in the previous case. However, when balancing an internal node each cell is associated with the page number of a child page of one of the sibling pages. The page number associated with cells stored on a sibling page is the same as the page number stored as the first four bytes of the cell. The page number associated with a divider cell within the parent page is the page number of the right-child page of the sibling page to which the divider cell contains a pointer.
  3. Determine the new cell distribution, using the following steps:
    1. Assign as may cells as will fit from the start of the ordered list of cells to the left-most sibling page. Then, if any cells remain, assign one to be a divider cell, and as many as will fit to the next sibling page. Repeat until all cells have been assigned a location. no divider cells for table b-tree leaf balance
    2. The previous step generates a distribution that is biased towards the left-hand side. The right-most sibling may even be completely empty (if the last cell in the ordered list was assigned to be a divider cell). To rectify this, cells are moved out of the second right-most sibling page and into the right-most, one at a time, until there is at least one cell in the right-most sibling page and to move another cell would mean that the right-most sibling page is more full than the next to right-most sibling page. This is repeated for the next right-most pair of sibling pages, shifting cells out of the third right-most sibling page and into the second right-most, and so on. note about divider cells
  4. Determine the set of database pages to use as the new sibling pages.
    1. If there were an equal or greater number of siblings identified in step 1 than are required by the distribution calculated in step 3, reuse as many as possible, starting with the left-most. If step 3 calculated a distribution that requires more sibling pages than were identified in step 1, allocate the required extra pages using the Refer to ??? algorithm.
    2. Arrange the new sibling pages from left to right in ascending page number order. The new sibling page with the smallest page number becomes the left-most sibling page, and so forth.
  5. Populate the new sibling pages.
    1. Populate each new sibling page with the required set of cells. If the page being balanced is not a leaf page, then the child-page pointer field of each cell is populated with the page-number associated with the cell as part of step 2 above.
    2. If the page being balanced is not a leaf page, then the right-child pointer stored in the page header of each new sibling page must also be populated. For each new sibling page except the right-most, this field is set to the page number associated with the cell that immediately follows the cells stored on the page (the cell that was assigned to be divider cell in step 3). For the right-most sibling page, the right-child pointer is set to the value that was stored in the right-child pointer of the right-most original sibling page identified in step 1.
  6. Populate the parent page.
    1. If the page being balanced is (was) not a leaf page of a table b-tree, the cells that contained pointers to the old sibling pages are replaced by the cells designated as divider cells as part of step 3. The right-child pointer field of the first divider cell is overwritten with the page number of the first new sibling page, and so on.
    2. If the page being balanced is (was) a leaf page of a table b-tree, the cells that contained pointers to the old sibling pages are replaced by a divider cell associated with all but the right-most sibling page. The child-page number stored in each divider cell is set to the page number of the associated sibling. The ingeger key value stored in each divider cell is a copy of the largest integer key value stored on the associated sibling page.
    3. Before balancing, the parent page contained a pointer to the right-most sibling page, either as part of a cell or as the right-child pointer stored in the page header. Either way, this value must be overwritten with the page number of the new right-most sibling page.
  7. Populate pointer map entries.
    1. For each sibling page that was not also an original sibling page, the associated pointer-map entry must be updated. Similarly, the pointer-map entry associated with each original sibling page that is no longer a sibling page must be updated.
    2. For each cell containing an overflow pointer that has been moved from one page to another, the pointer-map entry associated with the overflow page must be updated.
    3. If the page being balanced is (was) not a leaf, then for each cell that has moved from one page to another the pointer-map entry associated with the cell's child page must be updated.
    4. If the page being balanced is (was) not a leaf, then the pointer-map entry associated with each sibling's right-child page may need to be updated.

4.2.6 Page Allocation and Deallocation

Amongst other things, this section needs to explain our old pals the DontWrite() and DontRollback() optimizations.

4.2.6.1 Moving an overflow-chain to the free-list

Describe how this can sometimes be done without reading the content of overflow pages.

4.2.7 Incremental Vacuum Step

4.3 Transactions and Savepoints

Requirements surrounding how transactions are made atomic and isolated. Also how savepoints are implemented. What happens to active cursors after a rollback or savepoint-rollback.

5 References

[1] SQLite Online Documentation,SQLite Database File Format, http://www.sqlite.org/fileformat.html.
[2] SQLite Online Documentation,Application Defined Page Cache, http://www.sqlite.org/c3ref/pcache_methods.html.
[3] SQLite Online Documentation,OS Interface Object, http://www.sqlite.org/c3ref/vfs.html.

*** DRAFT ***