Database

Database scaling

Vertical vs Horizontal

Vertical: increase the capacity of a single server by adding more powerful hardware resources such as CPU, RAM, and storage

Pros
- Less complexity
  - simpler to scale up by adding more resources to a single server
  - no additional work to maintain data across machines like horizontal scaling does
- Data Consistency
Cons
- Limited Growth
  - There are physical limits to how much you can scale a single server. Once those limits are reached, you’ll need to look at horizontal scaling or other options.
- Single point of failure
In between
- Cost
  - Low-cost initially; less cost-effective over time
  - High-end hardware can also be expensive
- Downtime
  - You might need a backup server to handle operations while scaling
  - With some platforms, you can add resources without causing downtime
User cases
- If you are operating within a limited budget and require a rapid and cost-effective expansion of your infrastructure, vertical scaling is a good option
- You don’t know how much traffic or users you will get
- infrequent upgdade

Horizontal: distribute the database load across multiple servers by adding more instances to the existing pool of servers

pros
- Reliability
  - If one server fails, other servers can take over the load as you are not relying on a single machine
- Flexibility
  - based on your usage
- Scalability
  - Horizontal scaling allows you to handle larger loads by simply adding more servers
- Topographic distribution
  - to server nationwide or global clients
cons
- More Complexity
  - It requires changes to the application architecture such as implementing sharding or replication strategies
  - Data partitioning and replication strategies can be complex to implement and maintain. Data integrity and avoiding data duplication can also be challenging
- Data inconsistency
  - ensuring data consistency can be challenging due to issues like network partitioning and latency
  - require more complex coordination and consensus protocols
In between
- Cost
  - High costs initially; optimal over time
- Downtime
  - operation such as addition or removal of nodes doesn’t require downtime
    - as it doesn’t need to switch the old machine off like vertical scaling does
  - However, changes such as data sharding, repartitioning, or rebalancing data across nodes can potentially require downtime
- Performance
  - The traffic can be shared across connections to multiple machines
  - However, more servers mean more network traffic and potential for latency as data may need to be synchronized between servers.
Use cases
- Serve your clients in different geographical locations at lower latency
- Expect your service will be in high demand

Data consistency

Strong consistency:
- it requires a synchronisation process between all replica servers before a write operation can be considered complete
- Use cases
  - a banking application might require strong consistency to prevent issues like double spending
Eventual Consistency:
- the write operation might take some time to propagate to all servers
- a read operation might return the old data before the write operation propagates to the replica server
- Use cases
  - a social media application might be fine with eventual consistency

Note that MySQL replication only provides eventual consistency, not strong consistency

Sharding and Partitioning

sharding
- the data is spread across multiple machines
partitioning
- break up a large data set into smaller subsets

breaking up large tables into partitions and storing each partition on separate servers is called “sharding”

Common methods of database sharding

Geo-based sharding
- based on geo the user’s location
- pros
  - reduce laterncy
- cons
  - might not be even distribution of users
Range-based sharding
- split rows based on a range of values, for example split by the initial of Name (A to I, J to S, T to Z)
- pros
  - it is easier to implement
- cons
  - can result in the overloading of data on a single physical node
  - for example shard A might contain a much larger number of rows of data than shard C
Hash-base sharding
- splites rows by assigning the shard key (hash value) to each row of the database
- for example 4 rows can be grouped by 2 different hash value
- pros
  - more even distribution
- cons
  - difficult to assign the hash value when adding more physical shards
List-base sharding
- each shard corresponds to a particular value in a list
- pros
  - easy to understand where data is located
- cons
  - may lead to unbalanced shards

The pros and cons of sharding

pros

scalability
smaller data on each shard means better performance
more reliable than a single database
more affordable

cons

Not all data can be sharded e.g. foreign key relationship can only be maintained in a single shard
manual sharding, each shard runs on seperate servers and cross-shard query such as table join can be difficult to achieve
once sharding is set up, it can be difficult to undo

How to do partitioning and sharding?

Partitioning: MySQL supports the syntax PARTITION BY RANGE( store_id )
Sharding: MySQL doesn’t support it, but you can implement it at the application layer

Query a partitioned table in mysql

Partition key = column in the ORDER BY
- only read data from the relevant partitions, aka partition pruning
Partition key != column in the ORDER BY
- MySQL has to sort the results after retrieving data from the relevant partitions

If the query involves joining partitioned tables or combining data from multiple partitions, MySQL might create temporary tables to process the query

How to read data a shard server?

Before sending a query, we need to determine the shard key so that we know which shard we want to query

For example, use the hash function for user_id to get the shard key, then query the appropriate shard based on the shard key

code example

# Determine the shard
shard = hash_function(key)

# Find the server for this shard
server = metadata[shard]

data = connect_and_retrieve(server, key)

How to read data across shard servers?

cross-shard query

Not all databases support this operation
like a single, unified dataset, while the actual data may be distributed across multiple nodes
e.g. Google Spanner, CockroachDB

There are a few ways to achieve cross-shard query if the database doesn’t support

Application-Level Joins
- retrieve the data seperately from each shard and then combine it
Denormalization or Data Duplication
- make the data that needed to answer a particular query be in a single shard
- this might involve duplicating some data across shards
- BUT, it increases the complexity
Use a Database that Supports Cross-Shard Queries

CAP theorem

It’s impossible for a distributed data store to simultaneously provide more than two out of these three guarantee:

Consistency: All nodes see the same data at the same time. The write operation will only be seemed to be successful after the operation is replicated across all nodes.
Availability: Every request receives a (non-error) response which doesn’t guarantee the most updated data.
Partition Tolerance: The system should continue to function correctly despite network failures among partitions

Eventually consistent: The database prioritise Availability and Partition Tolerance over Consistency

CA database: MySQL, PostgreSQL, MariaDB, Oracle Database, SQL server
CP database: Google’s Spanner, MongoDB, HBase, Cosmos DB
AP database: Cassandra, CouchDB, DynamoDB, Riak

ACID

A set of properties that guarantee that database transactions are processed reliably

Atomicity: Guarantees that all operations in a transaction are executed or none are. If a transaction fails, all changes are rolled back.
Consistency: Ensures any data in a transaction will be valid according to all defined rules including SQL constraints, maintaining data integrity.
- constraints are sued to specify rules for the data e.g. NOT NULL, FOREIGN KEY, DEFAULT
Isolation: Ensures that concurrent transactions are executed serially, one after another, preventing interference between transactions.
- cockroachDB always use serialisable isolation
- the strongest of the four isolation levels defined by the SQL standard
Durability: Guarantees that once a transaction is committed, its changes are permanent, even in case of system failure.

SQL vs NoSQL

SQL

Pros
- Relational
- Data is strucutred
- ACID compliant
Cons
- Structure must be created in advance
- Difficult to scale horizontally

NoSQL

Pros
- Flexible structure (key-value, document, etc., semi-structured or unstructured data)
- Easy to scale horizontally (suitable for handling large volumes of data and high traffic loads)
- Faster write and query performance Cons
- Many NoSQL databases offer eventual consistency

PostgreSQL vs MySQL (innodb)

Similarity

Both are relational database
SQL compliance
ACID compliance
support JSON
support stored procedures
support both synchronous and asynchronous replication

PostgreSQL

Open source
support more index types such as B-tree, GiST (Generalized Search Tree), and GIN (Generalized Inverted Index)
support multi-version concurrency control (MVCC)
support more data types, including arrays, key-value, binary JSON
designed for write-heavy workloads
designed for more complex queries and transactions
suitable for vertical scaling on a single server
built-in security features including row-level security and SSL encryption

MySQL

commercial product owned by Oracle
the default index type is B-tree, also supports hash indexes
designed for ready-heavy workloads
suitable for horizontal scalling and distributed systems

AWS Redshift vs MySQL

Purpose
- Redshift a large-scale Data Warehouse service for analytics purpose
- MySQL is a more general purpose database
Scaling
- Redshift can handle petabyte-scale data
- MySQL performs efficiently when handling medium-to-large scale data

Redshift features

Data Warehousing
- Redshift is a data warehouse service and it’s highly suitable for performing ETL (Extract, Transform, Load) operations.
Data Loading
- Data is often loaded into Redshift from other AWS services like S3, DynamoDB, or data streams like Kinesis.
Real-Time Analytics
- Redshift can be used in conjunction with other AWS services like Kinesis for real-time data streaming and analytics.

ref: