Mastering Database Fundamentals: Storage Fundamentals

This entry is part 5 of 5 in the series Database Fundamentals

Understanding CAP Theorem in Distributed Systems
Mastering Database Fundamentals: ACID Property
Mastering Database Fundamentals: ACID with practical example
Mastering Database Fundamentals: Read Phenomena and Isolation
Mastering Database Fundamentals: How Databases Store and Retrieve Data

Database Storage Fundamentals

Have you ever wondered what happens behind the scenes when you insert or query data in a database? Just like organizing a library, databases have clever ways of storing and retrieving data efficiently. The Database storage fundamentals article explains how data is stored under the hood, combining easy-to-grasp analogies with hands-on examples using Docker and MySQL/PostgreSQL.

By the end of this article, you’ll understand:

How databases store data in pages and organize it with B-trees.
The difference between clustered and non-clustered indexes.
How these mechanisms impact performance, with real-world examples and analogies.

Let’s dive in!

How Databases Store Data: Pages, Rows, and Free Space

Databases don’t store data as one continuous list. Instead, they use pages—fixed-sized chunks of storage (e.g., 16 KB in MySQL InnoDB). Think of a page as a box in a warehouse, where each box can hold multiple items (rows of data).

Each page contains:

Headers: Metadata about the page.
Rows: Your actual records.
Free Space: Reserved for future updates.

Analogy:

Pages = Warehouse boxes
Rows = Items inside the box
Headers = Labels on the box (e.g., box ID and contents info)

When a database retrieves data, it looks at these boxes (pages), not individual items, making retrieval efficient.

Tables in a database are collections of rows, where each row represents a record, and columns define the attributes of those records. The physical storage of tables depends on the database engine (e.g., MySQL, PostgreSQL).

Page Structure

Most relational database systems divide disk storage into fixed-size blocks called pages (typically 4KB, 8KB, or larger). To demonstrate this practically, you can use Docker to set up a MySQL or PostgreSQL instance and inspect how pages are managed. For example, enabling page inspection or using database-specific tools like pg_filedump (for PostgreSQL) allows you to see how pages are allocated, what data they contain, and how rows are distributed within pages.

Each page stores multiple rows, along with metadata like free space.

A page is the smallest unit of I/O.

Table Storage in Pages

[ Page 1 ] [ Page 2 ] [ Page 3 ] ...
| Row 1   | Row 5  | Row 9   |
| Row 2  | Row 6  | Row 10 |
| Row 3  | Row 7  |         |
| Row 4  | Row 8  |         |

Row Format

Rows within a page are stored in a compact format. If a row exceeds the page size, the database splits the row across multiple pages, linking the parts together through pointers or overflow pages. This ensures efficient use of storage while accommodating larger data entries. For instance, a very long string in a VARCHAR column might start in one page and continue in another, with metadata indicating the split. For example:

Fixed-length fields (e.g., integers) occupy a predefined number of bytes.
Variable-length fields (e.g., strings) include a prefix to indicate their length.
Null values are represented by a bitmap.

Hands-On: Setting Up Your Environment

Let’s bring these concepts to life by setting up a database using Docker.

Step 1: Start a Database in Docker

For MySQL:

docker run --name mysql-db -e MYSQL_ROOT_PASSWORD=root -d mysql:latest
docker exec -it mysql-db mysql -uroot -proot

For PostgreSQL:

docker run --name postgres-db -e POSTGRES_PASSWORD=root -d postgres:latest
docker exec -it postgres-db psql -U postgres

Step 2: Create a Table and Insert Data

Let’s create a simple table to represent customer orders:

MySQL/PostgreSQL:

CREATE TABLE orders (
    order_id SERIAL PRIMARY KEY,
    customer_name VARCHAR(100),
    total_amount DECIMAL(10, 2)
);

INSERT INTO orders (customer_name, total_amount) VALUES 
('Alice', 100.00),
('Bob', 200.00),
('Charlie', 150.50);

Here, each record (row) gets stored in a page. The database automatically organizes the data inside these pages.

The orders table will look something like below,

order_id	customer_name	total_amount
1	Alice	100.00
2	Bob	200.00
3	Charlie	150.50

Table with Data

Indexes: The Librarian of the Database

When you search for a book in a library, you don’t go through every shelf. Instead, you use the catalog (index) to find the exact location. Databases use indexes the same way.

Clustered Index

A clustered index determines the physical order of data in a table.

Example: The PRIMARY KEY (order_id) acts as a clustered index, meaning the table is stored on disk in this order.
Benefit: Fast lookups for range queries like:
SELECT * FROM orders WHERE order_id < 3;

Non-Clustered Index

A non-clustered index is a separate structure that points to the data.

Example: Create a non-clustered index on customer_name:
CREATE INDEX idx_customer_name ON orders(customer_name);
Now, queries like this are faster:
SELECT * FROM orders WHERE customer_name = 'Alice';

Visualizing Data Storage and Indexing

Here’s how these concepts play out in the real world:

Storage Without Indexes (Full Table Scan)
Without indexes, the database checks every row in every page to find a match. This is like searching for a book in a messy library.
Clustered Index
The data is organized by the order_id (e.g., books sorted by their Dewey Decimal number). This makes range queries lightning fast.
Non-Clustered Index
The database builds a “map” pointing to the rows in the table. This is like an index card pointing to the exact location of a book.

Hands-On: Query Performance Analysis

Let’s analyze how indexes and fragmentation affect performance.

1. Run a Query Without an Index:

EXPLAIN SELECT * FROM orders WHERE customer_name = 'Alice';

Result: Full table scan, slower performance.

2. Add a Non-Clustered Index and Re-Run the Query:

CREATE INDEX idx_customer_name ON orders(customer_name);
EXPLAIN SELECT * FROM orders WHERE customer_name = 'Alice';

Result: Index lookup, faster performance.

Clustered vs. Non-Clustered Index: Key Differences

Feature	Clustered Index	Non-Clustered Index
Data Storage	Determines physical order	Separate structure (pointer)
Use Case	Range queries	Point lookups
Number per Table	One	Many

Indexes: Organized Access Using B-Trees

To retrieve data efficiently, databases use indexes, which are often organized as B-trees (balanced tree structures).

What is a B-Tree?

A B-tree is a self-balancing tree that ensures data is always organized and accessible in logarithmic time. Think of it as a hierarchical structure where:

Each node contains a range of values (keys).
Leaf nodes store pointers to actual rows in the database.

How Does a B-Tree Work?

Imagine you’re looking for the book “The Great Gatsby” in a library. The library catalog (B-tree) helps you narrow your search:

Check the first level: “Books starting with A–G.”
Narrow down to the second level: “Books starting with G.”
Finally, find “The Great Gatsby.”

Simplified B-Tree

        [ 15 ]
       /      \
   [ 7 ]      [ 20 ]
  /  \       /     \
[1]  [10]  [16]   [25]

Searching for 10 starts at the root node, traverses through intermediate nodes, and ends at the leaf node.
Leaf nodes often contain row identifiers (e.g., primary key values) or pointers to rows in the table.

In the database, the B-tree enables fast lookups by dividing and conquering the search space.

Fragmentation: Why Maintenance Matters

Over time, as rows are inserted, updated, or deleted, pages can become fragmented. This affects performance, as the database must scan multiple pages for contiguous data.

Automatic Maintenance

Fortunately, modern databases like MySQL and PostgreSQL automatically handle fragmentation through periodic vacuuming (PostgreSQL) or optimization (MySQL). These processes reclaim space and reorganize pages:

PostgreSQL VACUUM: Runs periodically to clean up dead rows and prevent bloat.
MySQL OPTIMIZE TABLE: Runs as part of background tasks.

You can trigger these manually if needed:

MySQL: OPTIMIZE TABLE orders;
PostgreSQL: VACUUM FULL orders;

Conclusion

In this Database Storage Fundamentals article, we explored row-based storage, B-trees, and indexes. But not all databases are row-based. For analytics workloads, column-based storage (storing data by columns) offers significant advantages.

Stay tuned for the next article, where we’ll compare row-based and column-based databases with hands-on examples!

Series Navigation<< Mastering Database Fundamentals: Read Phenomena and Isolation