How can you generate a key that is guaranteed to be unique?
The possibility to get duplicates by UUIDv4
UUIDv4 generates a random id using random numbers, with 122 random bits. The chances of generating a duplicate are extremely low but not zero.
To understand the probability, consider that the number of version 4 UUIDs which can be generated is 2^122
or about 5.3x10^36
.
To put these numbers into perspective, the annual risk of a given person being hit by a meteorite is estimated to be one chance in 17 billion
, which means the probability is about 0.00000000006
(6 × 10−11), equivalent to the odds of creating a few tens of trillions of UUIDs in a year and having one duplicate.
In other words, only after generating 1 billion UUIDs every second for approximately 100 years would the probability of creating a single duplicate reach 50%.
The approaches to improve the collision issue
- Combination of UUID and Timestamp or System-Specific Identifier
- e.g. use timestamp as prefix or suffix to the UUID
- Use a Version 1 UUID (UUIDv1)
- e.g. the MAC address of the host that generated the UUID and the current time
- e.g. privacy concerns since MAC addresses can be traced back to the machine
- Check Before Use
- check whether it’s been used before
Another way to make sure that the generated key is unique
- Centralized ID Generation
- Using a centralized server or database that generates and keeps track of IDs used can guarantee uniqueness.
- This could involve using an auto-incrementing field in a database, or a dedicated service that generates unique IDs.
- However, this approach might not be suitable for systems that need to generate IDs in a distributed or offline manner.
- For example
- generate all unique keys in advance (6 character long keys might be around 56.8 billion) in a table
- once the key is used, then mark it by moving it to a separate table with appropriate locking in place
- Distributed Consensus (solution to “Centralized ID Generation”)
- In a distributed system where a centralized authority for ID generation isn’t feasible, a consensus algorithm like Paxos or Raft could be used to agree on the next ID to use.
- This requires more complexity and communication between nodes, but can guarantee uniqueness.
Types of bounds
- IO Bound
- spend most of its time waiting for input/output
- e.g. reading from or writing to the file, network communication, DB queries, downloading/uploading
- CPU Bound
- heavily rely on computational power
- e.g. mathematical calculations, cryptographic operations, video encoding, image processing, complex algorithms
- Memory Bound
- constrained by available memory
- e.g. processing large datasets
- can be improved by investigating memory leaks, optimising memory usage
- Network Bound
- limit by network factors
- e.g. bandwidth limitations,
- e.g. sending or receiving large files, numerous API calls
- can be improved by CDN, optimising network config, reducing request,
- Disk Bound
- limited by the data transfer rate of the storage device
- increasing CPU or RAM might not help too much
- e.g. read/write speed
(TODO) Cache strategy
- Least Recently Used (LRU)
ref: