System Design

How can you generate a key that is guaranteed to be unique?

The possibility to get duplicates by UUIDv4

UUIDv4 generates a random id using random numbers, with 122 random bits. The chances of generating a duplicate are extremely low but not zero.

To understand the probability, consider that the number of version 4 UUIDs which can be generated is 2^122 or about 5.3x10^36.

To put these numbers into perspective, the annual risk of a given person being hit by a meteorite is estimated to be one chance in 17 billion , which means the probability is about 0.00000000006 (6 × 10−11), equivalent to the odds of creating a few tens of trillions of UUIDs in a year and having one duplicate. In other words, only after generating 1 billion UUIDs every second for approximately 100 years would the probability of creating a single duplicate reach 50%.

The approaches to improve the collision issue

Combination of UUID and Timestamp or System-Specific Identifier
- e.g. use timestamp as prefix or suffix to the UUID
Use a Version 1 UUID (UUIDv1)
- e.g. the MAC address of the host that generated the UUID and the current time
- e.g. privacy concerns since MAC addresses can be traced back to the machine
Check Before Use
- check whether it’s been used before

Another way to make sure that the generated key is unique

Centralized ID Generation
- Using a centralized server or database that generates and keeps track of IDs used can guarantee uniqueness.
- This could involve using an auto-incrementing field in a database, or a dedicated service that generates unique IDs.
- However, this approach might not be suitable for systems that need to generate IDs in a distributed or offline manner.
- For example
  - generate all unique keys in advance (6 character long keys might be around 56.8 billion) in a table
  - once the key is used, then mark it by moving it to a separate table with appropriate locking in place
Distributed Consensus (solution to “Centralized ID Generation”)
- In a distributed system where a centralized authority for ID generation isn’t feasible, a consensus algorithm like Paxos or Raft could be used to agree on the next ID to use.
- This requires more complexity and communication between nodes, but can guarantee uniqueness.

Types of bounds

IO Bound
- spend most of its time waiting for input/output
- e.g. reading from or writing to the file, network communication, DB queries, downloading/uploading
CPU Bound
- heavily rely on computational power
- e.g. mathematical calculations, cryptographic operations, video encoding, image processing, complex algorithms
Memory Bound
- constrained by available memory
- e.g. processing large datasets
- can be improved by investigating memory leaks, optimising memory usage
Network Bound
- limit by network factors
- e.g. bandwidth limitations,
- e.g. sending or receiving large files, numerous API calls
- can be improved by CDN, optimising network config, reducing request,
Disk Bound
- limited by the data transfer rate of the storage device
- increasing CPU or RAM might not help too much
- e.g. read/write speed

(TODO) Cache strategy

Least Recently Used (LRU)

ref: