Database scaling
Vertical vs Horizontal
Vertical: increase the capacity of a single server by adding more powerful hardware resources such as CPU, RAM, and storage
- Pros
- Less complexity
- simpler to scale up by adding more resources to a single server
- no additional work to maintain data across machines like horizontal scaling does
- Data Consistency
- Cons
- Limited Growth
- There are physical limits to how much you can scale a single server. Once those limits are reached, you’ll need to look at horizontal scaling or other options.
- Single point of failure
- In between
- Cost
- Low-cost initially; less cost-effective over time
- High-end hardware can also be expensive
- Downtime
- You might need a backup server to handle operations while scaling
- With some platforms, you can add resources without causing downtime
- User cases
- If you are operating within a limited budget and require a rapid and cost-effective expansion of your infrastructure, vertical scaling is a good option
- You don’t know how much traffic or users you will get
- infrequent upgdade
Horizontal: distribute the database load across multiple servers by adding more instances to the existing pool of servers
- pros
- Reliability
- If one server fails, other servers can take over the load as you are not relying on a single machine
- Flexibility
- Scalability
- Horizontal scaling allows you to handle larger loads by simply adding more servers
- Topographic distribution
- to server nationwide or global clients
- cons
- More Complexity
- It requires changes to the application architecture such as implementing sharding or replication strategies
- Data partitioning and replication strategies can be complex to implement and maintain. Data integrity and avoiding data duplication can also be challenging
- Data inconsistency
- ensuring data consistency can be challenging due to issues like network partitioning and latency
- require more complex coordination and consensus protocols
- In between
- Cost
- High costs initially; optimal over time
- Downtime
- operation such as addition or removal of nodes doesn’t require downtime
- as it doesn’t need to switch the old machine off like vertical scaling does
- However, changes such as data sharding, repartitioning, or rebalancing data across nodes can potentially require downtime
- Performance
- The traffic can be shared across connections to multiple machines
- However, more servers mean more network traffic and potential for latency as data may need to be synchronized between servers.
- Use cases
- Serve your clients in different geographical locations at lower latency
- Expect your service will be in high demand
Data consistency
- Strong consistency:
- it requires a synchronisation process between all replica servers before a write operation can be considered complete
- Use cases
- a banking application might require strong consistency to prevent issues like double spending
- Eventual Consistency:
- the write operation might take some time to propagate to all servers
- a read operation might return the old data before the write operation propagates to the replica server
- Use cases
- a social media application might be fine with eventual consistency
Note that MySQL replication only provides eventual consistency, not strong consistency
Sharding and Partitioning
- sharding
- the data is spread across multiple machines
- partitioning
- break up a large data set into smaller subsets
breaking up large tables into partitions and storing each partition on separate servers is called “sharding”
Common methods of database sharding
- Geo-based sharding
- based on geo the user’s location
- pros
- cons
- might not be even distribution of users
- Range-based sharding
- split rows based on a range of values, for example split by the initial of Name (A to I, J to S, T to Z)
- pros
- it is easier to implement
- cons
- can result in the overloading of data on a single physical node
- for example shard A might contain a much larger number of rows of data than shard C
- Hash-base sharding
- splites rows by assigning the shard key (hash value) to each row of the database
- for example 4 rows can be grouped by 2 different hash value
- pros
- cons
- difficult to assign the hash value when adding more physical shards
- List-base sharding
- each shard corresponds to a particular value in a list
- pros
- easy to understand where data is located
- cons
- may lead to unbalanced shards
The pros and cons of sharding
pros
- scalability
- smaller data on each shard means better performance
- more reliable than a single database
- more affordable
cons
- Not all data can be sharded e.g. foreign key relationship can only be maintained in a single shard
- manual sharding, each shard runs on seperate servers and cross-shard query such as table join can be difficult to achieve
- once sharding is set up, it can be difficult to undo
How to do partitioning and sharding?
- Partitioning: MySQL supports the syntax
PARTITION BY RANGE( store_id )
- Sharding: MySQL doesn’t support it, but you can implement it at the application layer
Query a partitioned table in mysql
- Partition key
=
column in the ORDER BY
- only read data from the relevant partitions, aka partition pruning
- Partition key
!=
column in the ORDER BY
- MySQL has to sort the results after retrieving data from the relevant partitions
If the query involves joining partitioned tables or combining data from multiple partitions, MySQL might create temporary tables to process the query
How to read data a shard server?
Before sending a query, we need to determine the shard key so that we know which shard we want to query
For example, use the hash function for user_id to get the shard key, then query the appropriate shard based on the shard key
code example
# Determine the shard
shard = hash_function(key)
# Find the server for this shard
server = metadata[shard]
data = connect_and_retrieve(server, key)
How to read data across shard servers?
cross-shard query
- Not all databases support this operation
- like a single, unified dataset, while the actual data may be distributed across multiple nodes
- e.g. Google Spanner, CockroachDB
There are a few ways to achieve cross-shard query if the database doesn’t support
- Application-Level Joins
- retrieve the data seperately from each shard and then combine it
- Denormalization or Data Duplication
- make the data that needed to answer a particular query be in a single shard
- this might involve duplicating some data across shards
- BUT, it increases the complexity
- Use a Database that Supports Cross-Shard Queries
CAP theorem
It’s impossible for a distributed data store to simultaneously provide more than two out of these three guarantee:
- Consistency: All nodes see the same data at the same time. The write operation will only be seemed to be successful after the operation is replicated across all nodes.
- Availability: Every request receives a (non-error) response which doesn’t guarantee the most updated data.
- Partition Tolerance: The system should continue to function correctly despite network failures among partitions
Eventually consistent: The database prioritise Availability and Partition Tolerance over Consistency
- CA database: MySQL, PostgreSQL, MariaDB, Oracle Database, SQL server
- CP database: Google’s Spanner, MongoDB, HBase, Cosmos DB
- AP database: Cassandra, CouchDB, DynamoDB, Riak
ACID
A set of properties that guarantee that database transactions are processed reliably
- Atomicity: Guarantees that all operations in a transaction are executed or none are. If a transaction fails, all changes are rolled back.
- Consistency: Ensures any data in a transaction will be valid according to all defined rules including SQL constraints, maintaining data integrity.
- constraints are sued to specify rules for the data e.g.
NOT NULL
, FOREIGN KEY
, DEFAULT
- Isolation: Ensures that concurrent transactions are executed serially, one after another, preventing interference between transactions.
- cockroachDB always use serialisable isolation
- the strongest of the four isolation levels defined by the SQL standard
- Durability: Guarantees that once a transaction is committed, its changes are permanent, even in case of system failure.
SQL vs NoSQL
SQL
- Pros
- Relational
- Data is strucutred
- ACID compliant
- Cons
- Structure must be created in advance
- Difficult to scale horizontally
NoSQL
- Pros
- Flexible structure (key-value, document, etc., semi-structured or unstructured data)
- Easy to scale horizontally (suitable for handling large volumes of data and high traffic loads)
- Faster write and query performance
Cons
- Many NoSQL databases offer eventual consistency
PostgreSQL vs MySQL (innodb)
Similarity
- Both are relational database
- SQL compliance
- ACID compliance
- support JSON
- support stored procedures
- support both synchronous and asynchronous replication
PostgreSQL
- Open source
- support more index types such as B-tree, GiST (Generalized Search Tree), and GIN (Generalized Inverted Index)
- support multi-version concurrency control (MVCC)
- support more data types, including arrays, key-value, binary JSON
- designed for write-heavy workloads
- designed for more complex queries and transactions
- suitable for vertical scaling on a single server
- built-in security features including row-level security and SSL encryption
MySQL
- commercial product owned by Oracle
- the default index type is B-tree, also supports hash indexes
- designed for ready-heavy workloads
- suitable for horizontal scalling and distributed systems
AWS Redshift vs MySQL
- Purpose
- Redshift a large-scale Data Warehouse service for analytics purpose
- MySQL is a more general purpose database
- Scaling
- Redshift can handle petabyte-scale data
- MySQL performs efficiently when handling medium-to-large scale data
Redshift features
- Data Warehousing
- Redshift is a data warehouse service and it’s highly suitable for performing ETL (Extract, Transform, Load) operations.
- Data Loading
- Data is often loaded into Redshift from other AWS services like S3, DynamoDB, or data streams like Kinesis.
- Real-Time Analytics
- Redshift can be used in conjunction with other AWS services like Kinesis for real-time data streaming and analytics.
ref: