Database Theory
Types of databases
There are many different types of databases, each designed to handle specific types of data and workloads. Most databases fit into multiple categories below; not all categories are mutually exclusive.
Relational databases (RDBMS)
-
Structure: Data is stored in tables with rows and columns. There is a relationship between tables.
-
Examples: MySQL, PostgreSQL, Oracle Database, Microsoft SQL Server, and Redshift.
NoSQL databases
NoSQL were originally intended to address the limitations of relational databases in terms of scalability, flexibility, and performance.
The term NoSQL is somewhat vague and can refer to a wide range of database technologies that are different from traditional relational databases.
NoSQL databases are designed for unstructured or semi-structured data.
Some people use the term Schema-less to describe NoSQL databases, however, this is not entirely accurate; there is an implicit schema at read time. Applications will expect data to be in a certain format.
Origin
NoSQL was originally intended just a Twitter hashtag for advertising a meeting about “open-source, distributed, non-relational databases”. It has since been used to refer to a wide range of database technologies that are different from traditional relational databases. The term has been retrofitted to mean “Not Only SQL”.
Types
A non-exhaustive list of NoSQL database types includes:
-
Key–value cache: Apache Ignite, Couchbase, Coherence, eXtreme Scale, Hazelcast, Infinispan, Memcached, Redis, Velocity.
-
Key–value store: Azure Cosmos DB, ArangoDB, Amazon DynamoDB, Aerospike, Couchbase, ScyllaDB.
-
Key–value store (eventually consistent): Azure Cosmos DB, Oracle NoSQL Database, Riak, Voldemort.
-
Key–value store (ordered): FoundationDB, InfinityDB, LMDB, MemcacheDB.
-
Tuple store: Apache River, GigaSpaces, Tarantool, TIBCO ActiveSpaces, OpenLink Virtuoso.
-
Triplestore: AllegroGraph, MarkLogic, Ontotext-OWLIM, Oracle NoSQL Database, Profium Sense, Virtuoso Universal Server.
-
Object database: Objectivity/DB, Perst, ZODB, db4o, GemStone/S, InterSystems Caché, JADE, ObjectDatabase++, ObjectDB, ObjectStore, ODABA, Realm, OpenLink Virtuoso, Versant Object Database.
-
Document store: Azure Cosmos DB, ArangoDB, BaseX, Clusterpoint, Couchbase, CouchDB, DocumentDB, eXist-db, Google Cloud Firestore, IBM Domino, MarkLogic, MongoDB, RavenDB, Qizx, RethinkDB, Elasticsearch, OrientDB.
-
Wide-column store: Azure Cosmos DB, Amazon DynamoDB, Bigtable, Cassandra, Google Cloud Datastore, HBase, Hypertable, ScyllaDB.
-
Native multi-model database: ArangoDB, Azure Cosmos DB, OrientDB, MarkLogic, Apache Ignite, Couchbase, FoundationDB, Oracle Database.
-
Graph database: Azure Cosmos DB, AllegroGraph, ArangoDB, InfiniteGraph, Apache Giraph, MarkLogic, Neo4J, OrientDB, Virtuoso.
-
Multivalue database: D3 Pick database, Extensible Storage Engine (ESE/NT), InfinityDB, InterSystems Caché, jBASE Pick database, mvBase Rocket Software, mvEnterprise Rocket Software, Northgate Information Solutions Reality, OpenQM, Revelation Software’s OpenInsight, UniData Rocket U2, UniVerse Rocket U2.
In-Memory databases
- Structure: Store data in memory for fast access.
- Examples: Redis, Memcached.
Embedded databases
- Structure: Embedded within an application and run within the same process.
- Examples: SQLite, H2, HSQLDB.
Time-series databases
- Structure: Optimized for time-stamped data.
- Examples: InfluxDB, TimescaleDB.
Hierarchical databases
- Structure: Data is organized in a tree-like structure.
- Examples: IBM IMS (Information Management System).
Network databases
- Structure: Use a graph-like structure to represent relationships.
- Examples: Integrated Data Store (IDS).
NewSQL databases
- Structure: Combine features of SQL and NoSQL databases for scalability and consistency.
- Examples: Google Spanner, CockroachDB.
Spatial databases
- Structure: Designed to handle geographic and spatial data.
- Examples: PostGIS, Oracle Spatial.
Blockchain databases
- Structure: Immutable ledger storing data in cryptographic blocks.
- Examples: Ethereum, Hyperledger Fabric.
OLTP vs OLAP databases
Some databases are optimized for Online Transaction Processing (OLTP), while others are optimized for Online Analytical Processing (OLAP). Some databases can handle both types of workloads with varying degrees of efficiency; usually they are optimized for one or the other.
There are databases that are niether OLTP nor OLAP databases, such as document-oriented databases.
The primary purpose of online analytical processing (OLAP) is to analyze aggregated data, while the primary purpose of online transaction processing (OLTP) is to process database transactions.
OLAP use cases include real-time analytics, search engines, timeseries, and analysis of large datasets.
Comparison
Data Formatting
-
OLAP: Uses multidimensional data models and stores data in a cube format. Each dimension represents a different attribute, allowing for complex queries from different perspectives.
-
OLTP: Uses a unidimensional relational database model, organizing data into tables with rows and columns, focusing on individual transactions.
Data Architecture
-
OLAP: Prioritizes read operations over write operations, suitable for complex queries on large data volumes, with low priority on availability.
-
OLTP: Optimized for high-frequency, high-volume write operations, ensuring data integrity and high availability through multiple backups.
Performance
-
OLAP: Processing times range from minutes to hours, with periodic batch updates.
-
OLTP: Processing times are in milliseconds, with real-time updates triggered by users.
Requirements
-
OLAP: Centralized data store requiring terabytes to petabytes of storage and high-performing servers for compute-intensive reads.
-
OLTP: Storage requirements in gigabytes, with high compute requirements, and data often cleared after being loaded into OLAP systems.
Notes
-
OLTP examples: MySQL, MariaDB, Microsoft SQL Server, and PostgreSQL.
-
OLAP examples: Apache Doris, Apache Druid, Apache HBase, Apache Pinot, Clickhouse, StarRocks, Dremio, Elasticsearch, Meilisearch, OpenSearch, Quickwit, Typesense, Citus, TiDB, Grafana Mimir, TimeScaleDB, AWS Redshift, Azure Synapse Analytics, Databricks, Firebolt, Google Big Query, and Snowflake.
-
Transactional processing implies allowing clients to make low-latency requests to read and write data, as opposed to batch processing. When they were invented, batch processing was the alternative. OLAP databases came later.
-
For OLAP think analytics and for OLTP think transactions.
-
Not all OLTP databases are inherently ACID compliant. While many OLTP databases strive to be ACID compliant to ensure data integrity, some may offer configurations or features that relax one or more of the ACID properties to achieve higher performance or scalability. It is essential to check the specific database’s documentation and configuration settings to determine its level of ACID compliance.
Schemas for OLAP databases
-
OLAP databases can use a star schema or snowflake schema. These schemas are optimized for Online Analytical Processing (OLAP) workloads.
-
There is a central fact table containing the data to be analyzed.
-
A fact table is surrounded by dimension tables that provide context for the data in the fact table.
-
The fact table contains foreign keys that reference the dimension tables.
-
This is known as a star schema because the fact table is in the centre with dimension tables surrounding it.
-
A snowflake schema is similar to a star schema, but the dimension tables can be further normalized into sub-dimension tables.
-
Data warehouses and data lakes
Both data warehouses and data lakes are meant to support Online Analytical Processing (OLAP).
A data warehouse is a system that stores highly structured information from various sources. Data warehouses typically store current and historical data from one or more systems. The goal of using a data warehouse is to combine disparate data sources in order to analyze the data, look for insights, and create business intelligence (BI) in the form of reports and dashboards.
You might be wondering, “Is a data warehouse a database?” Yes, a data warehouse is a giant database that is optimized for analytics.
Extract, transform, load (ETL) processes move data from its original source to the data warehouse. The ETL processes move data on a regular schedule (for example, hourly or daily), so data in the data warehouse may not reflect the most up-to-date state of the systems.
Data warehouses typically have a pre-defined and fixed relational schema. Therefore, they work well with structured data. Some data warehouses also support semi-structured data.
In practice, a data warehouse is usually also an OLAP database (e.g. BigQuery). However, not all OLAP databases are data warehouses.
Data warehouses are also usually columnar databases but this is not a rule. Row-oriented databases (and other types) can also be used as data warehouses.
- Examples: Amazon Redshift, Google BigQuery, IBM Db2 Warehouse, Microsoft Azure Synapse, Oracle Autonomous Data Warehouse, Snowflake, and Teradata Vantage.
A data lake is a repository of data from disparate sources that is stored in its original, raw format. Like data warehouses, data lakes store large amounts of current and historical data. What sets data lakes apart is their ability to store data in a variety of formats including JSON, BSON, CSV, TSV, Avro, ORC, and Parquet.
Data lakes are a cost-effective way to store huge amounts of data. Use a data lake when you want to gain insights into your current and historical data in its raw form without having to transform and move it. Data lakes also support machine learning and predictive analytics.
-
Examples: Amazon S3, Azure Data Lake Storage, and Google Cloud Storage.
-
Think raw unstructured data for data lakes and structured data for data warehouses.
Factors to consider when choosing a database
High-level considerations
Consider all the general systesm design principles when choosing a database list here(/theory/systems_design.html).
Specifically, zooming in on the database, consider the following:
-
Data model: Choose a database that fits your data model (e.g. relational, document, key-value, etc).
-
Consistency: Decide between strong consistency and eventual consistency. Eventual consistency is a vague term and you should be more specific about the consistency model you need.
-
Storage engine: Consider the storage engine used by the database.
-
Replication: Consider how the database handles replication.
-
Indexing: Consider the indexing options provided by the database.
Data models
-
If you are storing documents(e.g. JSON or XML), a document store might seem a good fit.
-
If you need to join data from multiple tables, a relational database might be a better choice.
-
Picking the wrong data model will lead to excessively complex application code and possibly poor performance.
-
If you need flexible schemas, a NoSQL database might be a better choice.
-
If you need ACID transactions, a relational database might be a better choice.
-
The there is no relationship between records or mostly one-to-many a document store might be a good fit.
-
If there are lots of many-to-many relationships, a graph database might be a good fit.
Storage engines
There are two main types of storage engines:
-
Log-structured storage engines: Write-ahead logging is used to append data to the log, which is then periodically compacted into a more efficient format. SSTables and LSM-trees are commonly used.
- Log structured storage engines are a newer technology; They are designed to be more efficient for write-heavy workloads on SSDs.
-
Page-oriented storage engines: Data is stored in fixed-size pages, which are read and written to disk in a random-access manner. B-trees are commonly used.
See the section on database internals for more information. Particularly, see the comparison of LSM-trees and B-trees.
Data orientation (on disk)
Row-oriented
Given a table:
| column 1 | column 2 | column 3 |
|---|---|---|
| item 11 | item 12 | item 13 |
| item 21 | item 22 | item 23 |
The values of each row item are stored together one after the other. Note the values from the first row are followed by the values from the second row.
Structure on disk:
item 11, item 12, item 13, item 21, item 22, item 23
- Used in: Traditional relational databases like MySQL and PostgreSQL.
Column-oriented
Given the same table:
| column 1 | column 2 | column 3 |
|---|---|---|
| item 11 | item 12 | item 13 |
| item 21 | item 22 | item 23 |
The values of each column are stored together. Note the values from the first column are followed by the values from the second column.
Structure on disk:
item 11, item 21, item 12, item 22, item 13, item 23
-
Used in: Apache Cassandra and Amazon Redshift.
-
Write performance is impacted as an individual row needs to be written to multiple columns.
-
Additionally, columns should be sorted by the values in a specific column to improve read performance. As columns must be stored in the same order, this can also impact write performance.
-
Data can be easily encoded/compressed using bitmap encoding to reduce disk space usage.
- In addition to reducing disk space, bitmap encoding may also improve performance because a common bottleneck for these types of workloads is between the disk, memory, L1 cache, and CPU.
-
Columnar databases may also store the same data in multiple places. Each different set of data is optimized for a different query pattern.
-
LSM-trees (see database internals) are often used to store the data on disk. Writes might go to a sorted in-memory store first and then be flushed to disk when the in-memory store is full.
Trade-offs
Random Access
Row-oriented benefits from fast random access of rows. Column-oriented benefits from fast random access of columns. In both cases, this is the result of fewer page or cache misses when accessing the data.
Insert
Row-oriented benefits from fast insertion of a new row. Column-oriented benefits from fast insertion of a new column.
This dimension is an important reason why row-oriented formats are more commonly used in Online Transaction Processing (OLTP), as it results in faster transactions compared to column-oriented formats.
Conditional Access
Row-oriented benefits from fast access under a filter. Column-oriented benefits from fast access under a projection.
Compute Performance
Column-oriented benefits from fast analytics operations. This is the result of being able to leverage SIMD instructions.
Uncompressed Size
Column-oriented benefits from a smaller uncompressed size. This is due to the possibility of using dedicated encodings for certain data types.
For example:
-
A table of 128 rows with a Boolean column requires 128 bytes in a row-oriented format (one byte per Boolean) but 128 bits (16 bytes) in a column-oriented format (via a bitmap).
-
Another example is the use of run-length encoding to compress a column.
Compressed Size
Column-oriented benefits from a smaller compressed size. This is due to the higher homogeneity within a column than within multiple rows.
Database concepts
Schema
A database schema is a structured framework that defines how data is organized and stored in a database. It specifies the blueprint or architecture of the database, including:
- Tables: The structure of tables, including their names, columns, and data types.
- Relationships: The connections between tables (e.g., one-to-one, one-to-many, or many-to-many relationships).
- Constraints: Rules that ensure data integrity, such as primary keys, foreign keys, unique constraints, and not-null constraints.
- Indexes: Structures that improve the efficiency of data retrieval.
- Views: Virtual tables that represent the result of a query.
- Stored Procedures and Triggers: Predefined functions and actions that the database performs automatically in response to events.
Relational databases usually have a (formal) schema (as outlined above). NoSQL databases may have a schema, but it is often more flexible and can be schema-less. In practice, most NoSQL databases have an implicit schema at read time.
Types of database schemas
Physical schema
-
Defines how data is physically stored in the database.
-
Includes file storage, partitions, and access methods.
Logical schema
-
Represents the structure of the database at a logical level.
-
Includes tables, columns, relationships, and constraints but not physical storage details.
External schema
- Describes how end-users view the data.
- Can include multiple views tailored for different users or applications.
Example of a simple logical schema
##### Logical Schema
```sql
CREATE TABLE Customers (
CustomerID INT PRIMARY KEY,
Name VARCHAR(100) NOT NULL,
Email VARCHAR(100) UNIQUE,
Phone VARCHAR(15)
);
CREATE TABLE Orders (
OrderID INT PRIMARY KEY,
CustomerID INT,
OrderDate DATE,
TotalAmount DECIMAL(10, 2),
FOREIGN KEY (CustomerID) REFERENCES Customers(CustomerID)
);
Migrations
-
Schema migrations are feasible and common.
-
Rewriting the entire database into a new schema is feasible but not common because it is very costly, complex, and risky.
Replication
Replication is the process of keeping multiple copies of the same data on different nodes. Replication provides redundancy and can improve performance.
Replication is generally done for three reasons:
-
To keep data geographically close to users.
-
To increase availability.
-
To increase read throughput.
Each copy of the data is called a replica.
There are several different replication strategies.
Active/passive replication
One of the replicas is designated the leader (also called the master, primary, active, or writer). The other replicas are called the followers (also called the slave, standby, secondary, or passive).
The leader handles all writes and allows reads. The followers replicate the leader’s writes and can handle reads.
The leader sends data to the followers as part of a replication log (also called a change stream or replication stream).
Also called master/slave replication, primary/replica replication, leader/follower replication, leader-based replication or primary/secondary replication.
Replication be synchronous or asynchronous.
Synchronous replication guarantees that a follower has received a write before the write is acknowledged to the client. Synchronous replication is not scalable because it requires waiting for the slowest follower to acknowledge the write. In practice, synchronous replication is only used for a small number of replicas.
Asynchronous replication acknowledges the write before the follower has received it. Asynchronous replication is more scalable because it does not require waiting for the slowest follower to acknowledge the write. However, asynchronous replication can impact data durability because writes can be lost if the leader crashes before the followers have received the write.
Semi-synchronous replication is the use of a combination of synchronous and asynchronous replication. Usually a single follower is synchronous and the rest are asynchronous.
Partitioning
Partitioning is the database process where very large tables are divided into multiple smaller parts.
In horizontal partitioning, the data is divided into a smaller (but complete) data set. For example, a table with a billion rows might be divided into 10 tables with 100 million rows each. This is also known as sharding.
In vertical partitioning, tables are split into smaller tables with different columns. This is done to improve performance by avoiding reading columns containing long text or BLOB data. In can also be done to split sensitive data (e.g. password) into a separate table.
Views
A view is a temporary table created by transforming and combining the data across multiple base tables. It’s a virtual table that does not store any data itself.
Materialized views
A materialized view is a duplicate data table created by combining data from multiple existing tables for faster data retrieval. Unlike a view, a materialized view stores the data physically.
Materialized views are useful for improving query performance by pre-computing and storing the results of complex or common queries.
Materialized views are updated periodically to reflect changes in the base tables. As such, they can be stale if not refreshed frequently and they impact write performance and storage.
Data cubes (OLAP cubes)
An extension of the concept of a materialized view, a data cube is a multi-dimensional representation of data.
You might have a materialized view that stores the total sales for each product category by date. As the table contains totals, not only is it easy to look the total total sales for each category (and total sales), but also the total sales for each category for each month.
Database internals (logs and indexes)
Many databases internally use a log, which is an append-only sequence of records. Appending to a file is much faster than updating a file because you don’t have to seek to the end of the file before writing. The term log here is a different, more general concept, than application logging.
Seeking through an entire database is very inefficient (time complexity O(n)). To avoid this, databases use indexes to quickly locate records.
An index is an additional structure is derived from the primary data. Maintaining additional structures can slow down writes, so indexes are a trade-off between read and write performance.
Hash indexes
A hash index is an index that maps keys to values using a hash function. An in-memory hash map might be mapped to a log-structured file on disk. The hash function is used to map the key to a position in the file. Bitcask uses a hash index.
Hash maps are inefficient for range queries, as they require scanning the entire map. In addition, the hash table must fit in memory, which can be a limitation.
Compaction
As a log-file is append-only, it can grow indefinitely, so may eventually fill up the disk. To prevent this, databases use compaction to merge log files together. Compaction can be done in the background, so it doesn’t block writes.
Compaction throws away duplicate keys in the log file, and keeps only the most recent value.
Merging segments
Merging is also usually performed at the same time as compaction. Merging creates new segments that contain only the most recent value for each key. The old segments are then deleted.
Problems with naive compaction and merging
So far, the discussion has involved a very simple theoretical model. In reality, there are many complications:
-
File format: Binary formats are more efficient than text formats.
-
Deletion: Deleting records requires a special record called a tombstone.
-
Crash recovery: When a database crashes, the in-memory data structures are lost. Restoring them from disk can be slow therefore there needs to be a way to recover more quickly.
-
Partial writes: If the database crashes while writing a record, the record may be only partially written.
-
Concurrency control: Writes needed to be appended to the log in strictly sequential order; this makes concurrent writes difficult. One solution is to have a s single writer thread and multiple reader threads.
Sorted String Table (SSTable)
An SSTable log file is sorted by key. SSTables are considered a file format.
Compared to hash indexes, you no longer need to maintain an index of the entire log file in memory. Instead, you an maintain a sparse (partial) index.
If the index does not contain the key you are looking for, you can use existing keys that are close to the key you are looking for to scan the log file efficiently.
SSTables has several advantages over a hash index:
-
Merging log files is more efficient because the files are sorted.
-
Memory usage is reduced because you only need to store a sparse index.
-
Compression can occur (efficiently) at the same time as keys are scanned. You have to scan keys anyway.
As SSTable log files are sorted, data cannot be immediately appended to the end.
While you could use an algorithm (such as red-black trees or AVL trees) to maintain a sorted log file on disk only, maintaining a structure in-memory is easier.
An in-memory balanced tree is data structure is called a memtable. When a memtable reaches a certain size, it is written to disk as an SSTable. New writes are then written to a new memtable in the meantime.
Reads are performed by checking the memtable first, then the SSTables.
Merging and compaction are performed in the background as before.
The major problem with this approach is that the memtable is lost if the system crashes. To solve this problem, every write to the memtable is also written to a separate append-only log file that is not sorted. This file is only used to recover the memtable in the event of a crash.
The terms SSTable and memtable came out of Google’s Bigtable paper. Cassandra and HBase are examples of databases that use SSTables. Lucene, an indexing engine used by Elasticsearch and Solr, also uses SSTables.
The approach described above is called Log-Structured Merge-Tree (LSM-Tree). Storage engines that use this approach are called LSM storage engines.
Performance considerations
The LSM-tree approach is not very efficient when a key does not exist in the database. The database must check the memtable and all SSTables to determine that the key does not exist. Bloom filters can be used to reduce the number of SSTables that need to be checked.
There are also different types of compaction.
In size-tiered compaction, newer and smaller SSTables are merged into larger and older SSTables.
In leveled compaction, the key range is split up into smaller SSTables and older data is moved to separate “levels”, which allows the compaction to proceed more incrementally and use less disk space.
B-trees
B-trees break the database down into fixed-size blocks called pages, traditionally 4KB in size. Only one page is read or written at a time.
Like SSTables, B-trees are sorted by key.
Each page has its own address. There are references within pages to other pages, which can be used to construct a tree of pages.
Pages contain references to other pages or values.
One page is designated as the root page.
When a value needs to be inserted, the database traverses the tree from the root page to the leaf page, where the value is inserted.
The number of references to a child pages in one page is called the branching factor. Typically, the branching factor is several hundred.
To change a value in a B-tree, the database must read the page into memory, change the value, and write the page back to disk.
If the page is full, the database must split the page into two pages and update the parent page to point to the new pages.
This process is called page splitting.
The algorithm ensures the tree remains balanced. A B-tree with n keys has a depth of O(log n). Most databases can fit into a B-tree that is three or four levels deep.
Problems
Overwriting pages in a B-tree can be slow because the database must read the page into memory, change the value, and write the page back to disk. Therefore, particularly with resource-heavy operations, there is some risk of data/index loss or corruption.
A write-ahead log (WAL) can be used to ensure that the database can recover from a crash. This is an append-only log that records all changes to the B-tree before they are applied. A WAL is used to replay changes to the B-tree in the event of a crash.
Another problem is concurrency control. If multiple threads are at play, they may see the tree in different states. Latches (lightweight locks) can be used to prevent this.
Optimizations
-
Instead of using a WAL, some databases use copy-on-write.
-
Keys can be abbreviated to save space.
-
Many B-tree implementations try to ensure leaf pages are laid out in a contiguous block on disk; making scans faster.
-
Additional pointers can be added to the tree to make searches faster.
-
Fractal trees can be used to reduce disk seeks.
Comparison of LSM-trees and B-trees
For both types of engine, writing a single record will result in multiple writes to disk, due to the inherent way these algorithms work (see above). These multiple writes are known as write amplification.
-
LSM-trees:
-
Are typically faster for writes because they only need to append to the log file.
-
Reads can be slower because they need to check several data structures.
-
Can be compressed more efficiently to save disk space.
-
Use disk space more efficiently because SSTables rewrites remove fragmentation.
-
In some instances, background compaction can sometimes interfere with foreground reads and writes.
-
More disk bandwidth is required for compaction for larger databases.
-
When not configured correctly, compaction may not be able to keep up with the write load. This results in slower reads and may result in a full disk.
-
-
B-trees:
-
Are typically faster for reads.
-
Leave some disk space unused due to fragmentation.
-
Keys exist in exactly one place in the tree, which is advantageous for implementing transactions.
-
Ultimately, these are generalizations and the performance should be tested on a case-by-case basis.
Indexing continued
-
Indexes can store the entire record or just a reference to the record.
-
If indexes store a reference to the record, the place where the rows are stored is called the heap file.
-
Using heap files can be efficient if the record be overwritten in place. However, is not so efficient if the record cannot be overwritten.
-
The extra hop for the heap file can be too much of a performance hit for some read operations.
-
Storing the indexed row in the index itself is called a clustered index.
-
A covering index (or index with included columns) is an index that stores some of the columns of the table in the index itself.
-
Clustered indexes and covering indexes make it more difficult to implement transactions.
-
A concatenated index is an index made up of multiple columns.
-
LSM-trees and B-trees cannot answer two-dimensional queries efficiently. R-trees can be used for this purpose. Alternatively, two-dimensional data can be converted to one-dimensional data using a space-filling curve.
CAP theorem
Definition
In a distributed system, it is impossible to simultaneously achieve all three of the following guarantees:
- Consistency
- Availability
- Partition tolerance
One of these guarantees must be sacrificed in order to achieve the other two.
Consistency
All nodes in the system have the same data at the same time. Any data written to the system is immediately visible to all nodes.
Consistency is also known as linearizability, which is informally defined as:
If operation B started after operation A successfully completed, then operation B must see the the system in the same state as it was on completion of operation A, or a newer state.
Consistency in CAP Theorem is different from the consistency in ACID transactions.
Availability
Every request received by a non-failing [database] node in the system must result in a [non-error] response.
Availability in CAP Theorem is different from High Availability in the context of system reliability.
Partition tolerance
The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes.
In this context, a partition refers to a network failure that prevents some nodes from communicating with each other.
When a partition occurs, the system must choose between consistency and availability.
Relevance
The CAP theorm has a very specific model and that means it is one of many possible trade-offs in distributed systems; specifically:
-
The CAP system model is a single, read-write register – that’s all. For example, the CAP theorem says nothing about transactions that touch multiple objects: they are simply out of scope of the theorem, unless you can somehow reduce them down to a single register.
-
The only fault considered by the CAP theorem is a network partition (i.e. nodes remain up, but the network between some of them is not working). That kind of fault absolutely does happen, but it’s not the only kind of thing that can go wrong: nodes can crash or be rebooted, you can run out of disk space, you can hit a bug in the software, etc. In building distributed systems, you need to consider a much wider range of trade-offs, and focussing too much on the CAP theorem leads to ignoring other important issues.
-
Also, the CAP theorem says nothing about latency, which people tend to care about more than availability. In fact, CAP-available systems are allowed to be arbitrarily slow to respond, and can still be called “available”. Going out on a limb, I’d guess that your users wouldn’t call your system “available” if it takes 2 minutes to load a page.
Types of systems
Attempts have been made to classify distributed systems based on the CAP theorem into three categories:
- CP systems: Consistency and partition tolerance.
- AP systems: Availability and partition tolerance.
- CA systems: Consistency and availability.
These do not really exist in practice; they are theoretical systems that sacrifice partition tolerance.
In reality, most systems today cannot easily be categorized as CP, AP, or CA; they may not meet the strict definitions of these categoriesor may exhibit a mixture of characteristics.
Eventual vs strong consistency
-
Eventual Consistency: Under this model, updates made to the system will propagate and reach all nodes eventually. While this means that the system may not immediately reflect the latest changes, it is still considered “consistent.”
-
Strong Consistency: Unlike eventual consistency, systems following strong consistency guarantee that all nodes will have the most recent version of data at all times. This approach prioritizes consistency over availability in the face of network partitions.
PACELC theorem
The PACELC theorem is an extension to the CAP theorem. It states that in case of network partitioning (P) in a distributed computer system, one has to choose between availability (A) and consistency (C) (as per the CAP theorem), but else (E), even when the system is running normally in the absence of partitions, one has to choose between latency (L) and loss of consistency (C).
The PACELC theorem is not very useful in practice because real world systems always have to deal with network partitions.
Criticism
The CAP theorem has been criticized for being too simplistic and for not providing a complete picture of the trade-offs involved in distributed systems design. It may have outlived its usefulness as a guiding principle for architecting modern distributed systems.
Useful resources
- https://martin.kleppmann.com/2015/05/11/please-stop-calling-databases-cp-or-ap.html
- https://codahale.com/you-cant-sacrifice-partition-tolerance/