Data Structures

Big O

A mathematical notation used in computer science to describe the performance or complexity of an algorithm

Complexity

O(1) < O(logn) < O(n) < O(nlogn) < O(n^2) < O(2^n) < O(n!)

O(1) constant
O(logn) logarithmic
- base is 2
- an array storing 1M items, will do 19 comparison with binary search
- binary search
O(n) linear
- for loop
- an array storing 1 to 10, linear search
O(nlogn) log linear
- merge sort,
O(n^2) quadratic
- double for loop
- bubble, insertion, selection sort
O(2^n) exponential
- recursive Fibonacci numbers
O(n!) factorial

Common Data Structures

arrays
- Time
  - Access O(1)
    - Arrays have O(1) access time because elements are stored in a contiguous block of memory and the memory location of an element can be calculated based on its index, making it possible to access an element quickly in constant time.
  - Search O(n)
    - Arrays have O(n) search time because finding a specific element typically requires checking each element one by one, and the time required to find an element is proportional to the number of elements in the array.
  - Insertion/Deletion O(n)
    - Arrays have O(n) insertion and deletion time because, in general, inserting or deleting an element in the middle of an array requires shifting all the elements after the insertion or deletion point, which takes linear time proportional to the number of elements in the array.
    - For example, inserting or deleting the first item requires shifting all elements one position to the right or left.
- Space O(n)
  - The total amount of memory required to store the array is n * m bits. As the number of elements in the array grows, the amount of memory required also grows linearly, making the space complexity O(n).
  - Note that the space complexity of an array does not depend on the size of the elements, only on the number of elements.
- unordered data
stacks
- Time
  - Access/Search O(n)
    - Stacks take O(n) access time because, in general, accessing an element in a stack requires starting at the top of the stack and then following the chain of pointers to the desired element, which takes linear time proportional to the number of elements in the stack.
    - Stacks take O(n) search time because finding an element in a stack typically requires starting at the top of the stack and following the chain of pointers until the desired element is found, which takes linear time proportional to the number of elements in the stack.
  - Insertion/Deletion O(1)
    - Stacks take O(1) insertion and deletion time because they follow the Last-In-First-Out (LIFO) principle, where the most recently added element is always the first to be removed. In a stack, inserting (push) or deleting (pop) an element only requires updating the top pointer, which takes constant time and does not depend on the size of the stack.
- Space O(n)
  - Stacks take O(n) space complexity because each element in a stack requires a constant amount of memory to store, and the amount of memory required by a stack is proportional to the number of elements stored in the stack.
- LIFO
queues
- Time
  - Access/Search O(n)
    - Queues take O(n) access and search time because, in general, accessing or searching for an element in a queue requires starting at one end of the queue and then following the chain of pointers until the desired element is found, which takes linear time proportional to the number of elements in the queue.
  - Insertion/Deletion O(1)
    - Queues take O(n) insertion and deletion time because they follow the First-In-First-Out (FIFO) principle, where the first element added to the queue is the first to be removed. In a queue, inserting (enqueue) an element at the end of the queue and deleting (dequeue) an element from the front of the queue both require updating the front and rear pointers, which takes linear time proportional to the number of elements in the queue.
- Space O(n)
- FIFO
linked list
- Time
  - Access/Search O(n)
    - Linked lists take O(n) access and search time because, in general, accessing or searching for an element in a linked list requires following the chain of pointers from one element to the next until the desired element is found, which takes linear time proportional to the number of elements in the linked list.
  - Insertion/Deletion O(1)
    - Linked lists take O(1) insertion and deletion time because each element in a linked list is connected to its neighbors via pointers, making it possible to insert or delete an element by simply updating the pointers of the neighboring elements.
- Space O(n)
- no fixed size
- takes up space (needs to store point and metadata)
- slow lookup (walk through each item)
- unordered data
hash tables
- Time
  - Access N/A
  - (avg) Search/Insertion/Deletion O(1)
    - Hash tables can get an average time complexity of O(1) because they use a hash function to map keys to indices in an array, which allows for fast access to elements. When the hash function is well-designed and the hash table is properly balanced, each key maps to a unique index, and accessing an element in the hash table can be done in constant time.
  - (worse) Search/Insertion/Deletion O(n)
    - Hash tables use a hash function to map keys to indices in an array, which allows for fast search for elements. However, if the hash function is poorly designed or the hash table becomes too full, collisions can occur, where multiple keys map to the same index.
    - In the worst case scenario, when all keys collide and map to the same index, the hash table becomes a linked list, and insertion becomes an O(n) operation because you have to traverse the linked list to find the right place to insert the new element.
- Space O(n)
trees
- Binary Search Tree
  - Time
    - (avg) Access/Search/Insertion/Deletion O(log n)
      - binary search trees are structured such that elements are organized in a way that allows for efficient searching, insertion, and deletion operations. When the tree is balanced, the height of the tree is logarithmic with respect to the number of elements in the tree, which results in an average time complexity of O(log n) for these operations.
    - (worse) Access/Search/Insertion/Deletion O(n)
      - if the tree becomes unbalanced, the height of the tree can be proportional to the number of elements in the tree, resulting in a worst-case time complexity of O(n)
    - Since the height of a well-balanced BST is O(log(n)), insertion takes O(log(n)) time on average. In the worst case, when the BST is not balanced and becomes a linear chain, the time complexity is O(n)
  - Space O(n)
  - pros
    - better than O(n)
    - ordered
    - flexible size
  - cons
    - No O(1) operations
- Heap (binary heap)
  - Time
    - Insertion O(log n)
      - inserting a new element into a heap requires adding the element to the end of the heap and then performing a series of swaps to maintain the heap property
    - Insert all elements O(n log n)
    - Deletion(= extract) O(log n)
      - deleting the root element of a heap requires replacing it with the last element in the heap and then performing a series of swaps to maintain the heap property
    - Heapify O(log n)
    - Building a heap O(n)
  - Space O(n)
  - pros
    - better than O(n)
    - priority
    - flexible size
    - fast insert
  - cons
    - slow lookup
- tries
  - Suffix Tree
- graphs

Binary search tree

Demonstration

               101
        33             105
     9     37      104     144

Basics

All child nodes in the tree to the right of root node must be greater than the current node
If the BST is balanced, the average time complexity of all operation is O(log(n)). However, the worse case (unbalance BST) would be O(n)
Pros
- better than O(n)
- ordered
- Flexible size
Cons
- No O(1) operation

The time complexity of insertion in a BST

The time complexity of inserting a new node into a binary search tree (BST) depends on the height of the tree, and can range from O(log n) to O(n), where n is the number of nodes in the tree.

The time complexity of inserting a new node into a BST depends on the height of the tree. In the best case, when the tree is balanced, the height is O(log n), and the time complexity of insertion is also O(log n). This is because each comparison eliminates half of the remaining tree, and we need to perform at most log n comparisons to reach a leaf node.

The implementation difference between Array and Linked List

Array: Need to move several nodes when a node is added or deleted.
Linked List: Improve the drewback of Array.

Heap

Demonstration

max heap

     90
    /   \
  60     50
 /  \   /  \
40  30 20  10

min heap

     10
    /   \
  20     30
 /  \   /  \
40  60 50  70

Basics

A Binary Heap is either Min Heap or Max Heap

max-heap: root is the maximum number
- Every child is less than their parent
- It doesn’t matter whether the left or right child is larger
- heapify up: swap with the parent if the parent is smaller
- heapify down: swap with biggest child if parent is less than its children
min-heap: root is the minimum number
- heapify up: swap with the parent if the parent is bigger
- heapify down: swap with the smallest child if parent is bigger than its children

Use cases

heap sort
priority queue

Operations

Insertion
- Ensure that the tree remains balanced by adding the value to the bottom layer from left to right
- Maintain the features of a min/max heap by swapping nodes all the way to the top if needed
Deletion
- Remove the root node, then replace it with the node from the bottom-rightmost layer of the heap
- Maintain the features of a min/max heap by swapping nodes all the way to the bottom if needed

The difference of implementation between array and linked list

array (preferred)
- pros
  - Easy implementation
    - easy to compute the array index of a node’s children
    - more efficient to find the Kth element of an array than the Kth element of a linked list
  - Good for time complexity
    - efficient operations such as inserting a new element, deleting the minimum (or maximum) element, and finding the minimum (or maximum) element, which can be done in O(log n) time.
  - in-place algorithm
- cons
  - Memory reallocation: if the heap grows beyond its initial size, it must be resized, leading to wasted space and increased memory usage
linked list
- pros
  - Dynamic memory allocation: using pointers its easier to grow the data structure size dynamically, you can insert new elements without worrying about reallocating memory
- cons
  - Difficult implementation (compared to array)
  - Bad for time complexity: finding the minimum (or maximum) element, inserting a new element, and deleting the minimum (or maximum) element all have a time complexity of O(n) in the worst case because it may require traversing the entire list to find a specific element
  - O(n) space

How do I make decision which implementation to use, arrays or linked lists?

If the array size changes frequently and array is huge
- yes -> linked list
- no -> array
If the performance of operations such as insertion, deletion matters
- yes -> array
- no -> linked list

Why does heap sort require O(n log n) time?

The time complexity for heap sort is the sum of the two stages: O(n) time for building the heap and O(n log n) to remove each node in order. Therefore, the overall time complexity is O(n log n).

Build a heap by inserting all elements O(n logn)

each insertion takes O(log n) time, and we perform this n times, resulting in O(n * log n) time complexity

Build a heap from unsorted binary tree O(n)

Heapify works by processing the elements from the middle of the array to the beginning, rearranging elements to satisfy the heap property. Since the majority of elements are located in the lower part of the heap (which requires less work to rearrange), the overall time complexity is reduced to O(n)

Why does building a heap take O(n)? (siftDown)

n = 31, h = log n = 5

                         E                                  h=4
                        / \
                      /     \
                    /         \
                  /             \
                /                 \
              /                     \
            /                         \
           D                           D                    h=3
         /   \                       /   \
       /       \                   /       \
     C           C               C           C              h=2
   /   \       /   \           /   \       /   \
  B     B     B     B         B     B     B     B           h=1
 / \   / \   / \   / \       / \   / \   / \   / \
A   A A   A A   A A   A     A   A A   A A   A A   A         h=0

Explanation

there are 16 A nodes at height 0 (n/2^1 = 31/2 = 16)
there are 8 B nodes at height 1 (n/2^2 = 31/4 = 8)
there are 4 C nodes at height 2 (n/2^3 = 31/8 = 4)
there are 2 D nodes at height 3 (n/2^4 = 31/16 = 2)
there is 1 E node at height 4 (n/2^5 = 31/32 = 1)

there are n/2^(h+1) nodes at level h, each node will perform at most h times of swap

For example, in the case of node C, it will swap at most 2 times to reach the bottom

the total work of swap in worse case will be

        A              B             C            D             E
    (n/2^1 * 0) + (n/2^2 * 1) + (n/2^3 * 2) + (n/2^4 * 3) + (n/2^(h+1) * h)
=    (n/2 * 0)  +  (n/4 * 1)  +  (n/8 * 2)  +  (2/16 * 3) +   (31/32 * 4)
=       16*0    +    8*1      +     4*2     +    2*3      +     1*4
=    26

the time complexity is O(n)

the proof is too complicated to me, please see https://stackoverflow.com/questions/9755721/how-can-building-a-heap-be-on-time-complexity if you are interested in

Demonstration

Start with the last node in the tree that has children (for a binary tree represented as an array, this is n/2 - 1 where n is the total number of nodes or the array’s length)

what if n is odd e.g. 15, which node should I start? indices must be integers, so you would round down and start with the node at index 6. In a 0-indexed array, this would be the 7th element of the array.

Original

                  2                     <- h=3
                /   \
              /       \
            /           \
          3               1             <- h=2
        /   \            /  \
      7       4       11      9         <- h=1
    /  \    /  \     /  \    /  \
   8   14  10   5   12   6  13  17      <- h=0

Start with h=1

                  2
                /   \
              /       \
            /           \
          3               1
        /   \            /  \
     14      10       12      17
    /  \    /  \     /  \    /  \
   8    7  4    5   11   6  13   9

Next, h=2

                  2
                /   \
              /       \
            /           \
          14              17
        /   \            /  \
      8      10       12      13
    /  \    /  \     /  \    /  \
   3    7  4    5   11   6  1    9

some nodes are exchanged to bottom with more than once time

Then, h=3

                  17
                /   \
              /       \
            /           \
          14              13
        /   \            /  \
      8      10       12      9
    /  \    /  \     /  \    /  \
   3    7  4    5   11   6  1    2

The difference between siftUp and sift Down

siftUp

O(n log n)
siftUp operation is only needed to perform inserts into an existing heap
be used to implement a priority queue using a binary heap

siftDown

O(n)
buildHeap (build the heap from a binary tree) and heapSort will only use siftDown

Which is more efficient, siftUp or siftDown?

siftDown

(0 * n/2) + (1 * n/4) + (2 * n/8) + ... + (h * 1)

O(n)

siftUp

(h * n/2) + ((h-1) * n/4) + ((h-2)*n/8) + ... + (0 * 1)

(h * n/2) = log n * n/2 = n log n (the first one alone takes n log n)

If we need to apply an operation to every node, it is preferable to use siftDown rather than siftUp.

Graph

basics

Components

vertex (node)
edge (arc)

Types

directed graph
- u -> v
- (u, v)
un-directed graph
- u – v
- (u, v) and (v, u) are same

Implementation using Edge list

graph using array

        2 ---------- 0
      /   \
     1-----3

represent in code

graph = [[0,2], [2,3], [2,1], [1,3]]

Each array element represents a connection between two nodes in the graph

pros

Simple Representation: each pair in the list represents an edge
Space Efficiency for Sparse Graphs: the number of edges is much less than the square of the number of vertices
Easy Addition of Edges: just append a new pair to the list, which takes O(1), assuming you don’t have to check for duplicates

cons

Inefficient for Dense Graphs: For dense graphs (where the number of edges is close to the square of the number of vertices)
Slow Lookup: Checking whether an edge exists between two vertices can be slow, as it may require scanning the entire list, which take O(Edge)
No Quick Adjacency Information: Unlike adjacency lists or matrices, edge lists don’t provide a fast way to find all vertices adjacent to a given vertex. You would need to traverse the entire list and pick out the edges for the given vertex
Duplicate and Loop Checking: If you need to ensure that there are no duplicate edges or loops in your graph, you’ll need to do extra work every time you add an edge. In the worst case, this could involve scanning the entire list.

use cases

suitable for sparse graph

Implementation using Adjacent list

graph

        2 ---------- 0
      /   \
     1-----3

represent in code using array of linked list

graph = [[2], [2,3], [0,1,3], [1,2]]

Each array element’s index represents a node in the graph, and the values within that element represent connections to that node.

For instance, the index 0 [2] represents a connection between nodes 0 and 2, while the index 1 [2,3] indicates connections between nodes 1 and 2, as well as node 1 and 3.

pros

Space Efficient
- only store the nodes that actually exists in the graph
- This works nicely in all kinds of graphs, either dense or sparse, since the maximum space usage is always smaller than N^2
Dynamic Allocation
- allows for dynamic allocation of memory, making it easier to add or remove vertices and edges from the graph
fast to iterate all edges
- Iterating over all edges in an adjacency list representation is easier and faster compared to an adjacency matrix representation, especially for sparse graphs

cons

Slow to accessing a specific edge / slow to determine if 2 nodes are connected
- need to list all the nodes that the node is connected to
- O(N)

use cases

Adjacency list is much more efficient for the storage of the graph, especially sparse graphs, when there is a lot less edges than nodes.
suitable for sparse graph

Implementation using Adjacent matrix

graph

        2 ---------- 0
      /   \
     1-----3

represent in code using array

graph = {
    [0,0,1,0],
    [0,0,1,1],
    [1,1,0,1],
    [0,1,1,0]
}

It has a similar concept to adjacent list.

The index 0 in the 2-dimensional array represents node 0 in the graph, and the presence of a value 1 at a particular index within that array element signifies a connection between node 0 and the node represented by that index.

For example, the index 0 [0,0,1,0] represents a connection between node 0 and node 2, while the index 2 [1,1,0,1] indicates connections between node 2 and node 0, node 2 and node 1, as well as node 2 and node 3.

pros

easy to implement
Accessing and updating information about an edge take O(1) time
- determine if two vertices are adjacent to each other
- add an edge in the graph
- delete an edge form the graph

cons

consume more memory O(V^2)
- if there is a graph with 1000 vertices and 1 edge, in order to store on edge it takes an array of size 1000^2 in memory
- not good for a sparse graph
Traversing the graph using algorithms like DPS/BFS requires O(V^2) time
- very slow for algorithms that require iterating over all edges, as the number of edges is proportional to n^2.

use cases

If the graph contains edges in order of V2, then it is better to use adjacency matrix as compared to adjacency list. This is because the size of both adjacency list and adjacency matrix will be comparable so using adjacecny matrix doesn’t necceessary waste a lot of memeory.
If we want to perform operations like add/delete or check that the vertices are adjancent or not very frequently, then it is recommended to use adjacency matrix since we can perform these operations in constant time.
suitable for dense graph

Trie

basics

a tree-like data strucutre
can have more than 2 children
used for search string (like search engine)

Represent a trie using linked list

linked list attributes

map[string]*node
- string is a character
isWord (boolean)

trie: ["cat", "tr", "trid", "trie", "trief"]

      [   "c"      "t"    ]
        (false)   (false)
         /            \
      ["a"]          ["r"]
     (false)         (true)
      /                  \
   ["t"]                ["i"]
  (true)               (false)
                            \
                      [  "d"    "e"  ]
                       (true)  (true)
                                   \
                                  ["f"]
                                  (true)

ref

https://www.youtube.com/watch?v=_r7cfVrn28c
https://www.digitalocean.com/community/tutorials/binary-search-tree-bst-search-insert-remove
Delete a node in BST https://www.youtube.com/watch?v=DkOswl0k7s4
Delete a node in BST https://www.educative.io/answers/how-to-delete-a-node-from-a-binary-search-tree
Delete a node in BST https://www.techiedelight.com/deletion-from-bst/
graph https://slaystudy.com/advantages-and-disadvantages-of-adjacency-matrix-graph-representation/
graph https://www.quora.com/What-are-the-pros-and-cons-of-representing-a-graph-in-an-adjacency-list-an-adjacency-matrix-and-objects-classes
how graph shows directed graph using adjacency list and matrix https://www.softwaretestinghelp.com/graph-implementation-cpp/