Chapter 1: What is SQL?

Structured Query Language (SQL) is the standard language for describing, querying, and changing data in relational databases. It is declarative, which means you describe the result you want and the database engine decides how to produce it. With SQL you can define tables, state relationships, add and modify rows, and ask questions that combine and filter many rows at once. The goal is consistent, reliable data that many applications can share.

This chapter introduces the purpose of relational databases, the difference between declarative and procedural thinking, and how SQL fits inside a typical application stack. You will see the core ideas that guide everything that follows in later chapters.

💡 Say what you want, not how to do it. In SQL you describe a result set; the engine chooses an efficient plan.

Throughout the book you will see code blocks and small schemas. Names, sizes, and data are examples. Real projects will have different columns and constraints…

Purpose and scope of relational databases

Relational databases store data in tables that relate to each other by keys. Each table models a single subject (for example, customers or orders). Rows are individual records. Columns are attributes with fixed types. Keys and constraints preserve correctness, while transactions keep changes reliable even when many users work at the same time.

Relational basics

A table has a defined set of columns with types and constraints. A primary key uniquely identifies each row. Foreign keys relate rows between tables. Queries combine tables by matching key values and return a new table as the result set.

-- A tiny, portable demo schema
CREATE TABLE customers (
  customer_id INTEGER PRIMARY KEY,
  name        TEXT    NOT NULL
);

CREATE TABLE orders (
  order_id    INTEGER PRIMARY KEY,
  customer_id INTEGER NOT NULL,
  total       NUMERIC NOT NULL,
  FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
);

With this shape you can ask questions such as which customers placed orders, totals per customer, and which orders match given criteria.

Integrity and transactions

Constraints prevent invalid data from entering the system. Transactions group statements so they succeed or fail together. This produces consistent state, even with concurrent sessions and failures.

BEGIN;

INSERT INTO orders (order_id, customer_id, total)
VALUES (1001, 1, 49.99);

UPDATE customers
SET name = 'Robin N.'
WHERE customer_id = 1;

COMMIT;  -- Or ROLLBACK; to undo the whole unit

⚠️ Without constraints and transactions, bugs can leave data in a partially updated or contradictory state. Always design tables with keys and the required integrity rules.

Declarative vs procedural thinking

Procedural code tells the computer step by step how to get a result. Declarative SQL states the result shape (columns, filters, groupings) and leaves the steps to the optimizer. Thinking in sets, rather than loops, is the key shift.

Set thinking in practice

Instead of looping through rows and testing conditions, write one statement that operates on the whole set. The database uses indexes and joins to find matches efficiently.

-- Declarative: return names of customers with orders over 50
SELECT c.name
FROM customers AS c
JOIN orders    AS o ON o.customer_id = c.customer_id
WHERE o.total > 50;

-- Procedural mindset (illustrative pseudocode)
for customer in customers:
  for order in orders:
    if order.customer_id == customer.customer_id and order.total > 50:
      emit(customer.name)

Both describe the same intent. The declarative version lets the engine choose an efficient plan.

The role of the optimizer

The optimizer examines your SELECT statement, available indexes, statistics, and constraints. It then chooses a strategy (join order, join method, index use) that should run quickly for the current data. You express filters and joins; the engine works out the path.

⚠️ Result row order is not guaranteed unless you specify ORDER BY. If order matters, always include ORDER BY with explicit columns.

How SQL fits into application stacks

Applications call the database through a driver or library. The app sends statements such as SELECT, INSERT, UPDATE, and DELETE over a connection. The database parses and plans the statement, executes it, and returns a result set or a status.

Typical request flow

A common path is: UI or service layer → application code → database driver → database server → result set → application. Connection pooling keeps a small number of open connections that many requests can share.

-- Parameterized query (safe pattern)
SELECT name, total
FROM orders_view
WHERE customer_id = :customer_id
  AND total > :min_total;

Parameters avoid string concatenation and reduce the risk of injection. The driver binds values before execution.

💡 Keep business rules in one place. Use views, constraints, and well-named queries so multiple services can reuse the same trusted data rules.

Data definition and data manipulation

SQL statements fall into broad families. DDL (data definition) creates or changes schema objects. DML (data manipulation) reads or changes row data. DCL (access control) grants privileges. TCL (transaction control) manages COMMIT and ROLLBACK. You will use all of these across real projects.

With these foundations in place, you can begin to write clear, portable SQL that supports reliable applications at any scale.

Chapter 2: A Brief History of SQL

SQL was born from the idea that data could be managed with a mathematical model rather than through custom file formats or procedural code. In the early 1970s, researchers at IBM applied the relational model (first proposed by Dr. Edgar F. Codd) to practical database systems. They created a language called SEQUEL (Structured English Query Language) to express operations on relational data. It soon evolved into SQL after a trademark issue forced a name change, but the core concept stayed the same: describe what you want, not how to get it.

💡 The key to SQL’s success was abstraction. It hides physical storage and indexing details so that users can think in terms of relations, rows, and columns rather than file structures.

Over the following decades, SQL became the foundation of nearly every relational database system. It went through several standardization phases and vendor interpretations, but its declarative nature and strong theoretical base made it remarkably durable. Even with the rise of NoSQL and newer paradigms, SQL remains the most widely used database language in the world.

From SEQUEL to SQL-92/99/2003/2016/2023

The evolution of SQL can be traced through its major standard releases, each adding features to reflect the changing needs of applications and hardware over time.

Early development (1970s–1980s)

IBM’s first prototypes, System R and SQL/DS, demonstrated that the relational model could outperform file-based systems in flexibility and data integrity. In 1979, Oracle released the first commercial implementation of SQL, followed by IBM’s DB2 and other systems throughout the 1980s. During this period, the core commands (SELECT, INSERT, UPDATE, DELETE, and CREATE) were firmly established.

SQL-86 and SQL-89

The first official standard, SQL-86, was published by ANSI and ISO to provide a common foundation. SQL-89 refined syntax and clarified behaviors but added little new functionality. These versions unified the core vocabulary and helped different vendors align basic behavior.

SQL-92 (SQL2)

SQL-92 marked a huge expansion. It added set operations, CASE expressions, more join types, schema definitions, and stricter data typing rules. Compliance levels (Entry, Intermediate, and Full) were introduced so vendors could claim partial compatibility while adding proprietary features.

⚠️ Some products still advertise “SQL-92 compliant.” This usually means support for the basic syntax, not full adherence to every clause or constraint of the standard.

SQL:1999 (SQL3)

This version introduced procedural and object-oriented extensions. Features such as recursive queries (WITH RECURSIVE), triggers, user-defined types, and large objects (BLOB, CLOB) made SQL capable of expressing more complex business logic directly in the database. It also introduced BOOLEAN types and regular expression-like pattern matching.

SQL:2003 to SQL:2016

SQL:2003 added XML integration, MERGE statements, and the window function framework. Later standards (SQL:2008, SQL:2011, SQL:2016) refined analytic functions, introduced temporal tables for tracking changes over time, and standardized JSON operations.

SQL:2023 and beyond

The most recent updates include further JSON handling, improvements to polymorphic table functions, and clarifications for modern distributed and cloud databases. Despite decades of growth, the core philosophy remains intact: maintain data integrity, express queries declaratively, and ensure forward compatibility.

ANSI and ISO standards

SQL is governed by two standard bodies. The American National Standards Institute (ANSI) maintains the U.S. standard, and the International Organization for Standardization (ISO) manages the global equivalent. In practice, the two collaborate, and the published standard documents are nearly identical. Each revision is identified by the release year, for example, SQL:2016 or SQL:2023.

Structure of the standard

The standard is divided into numbered parts, each covering an area such as core language, data definition, triggers, or external bindings. The complete text runs to thousands of pages and defines precise behavior for every keyword and clause.

Levels of conformance

Vendors can claim conformance to different levels of the standard. Core conformance covers basic SELECT, INSERT, UPDATE, and DELETE. Higher levels add support for advanced types, recursive queries, and temporal data. This layered structure encourages adoption without forcing every feature at once.

💡 The ANSI and ISO standards do not dictate implementation details. Each vendor is free to choose indexing methods, storage engines, and optimizers, as long as behavior matches the formal definitions.

Vendor dialects and cross-compatibility

Although all major relational databases implement SQL, each adds its own syntax extensions, data types, and functions. These variations (often called dialects) arose from competition and innovation. For example, LIMIT in MySQL and PostgreSQL corresponds to TOP in SQL Server, while Oracle uses FETCH FIRST n ROWS ONLY.

Common dialect families

Vendor	Common Extension Example
MySQL	`AUTO_INCREMENT` columns
PostgreSQL	`SERIAL` and `RETURNING` clauses
SQL Server	`TOP n` in `SELECT`
Oracle	`DUAL` table and `ROWNUM` filtering
SQLite	Dynamic typing with relaxed constraints

Portable practices

To keep your code portable, rely on standard features unless a vendor-specific function is essential. Always consult documentation for type names, string concatenation operators, and date functions, which often differ between engines.

⚠️ Cross-database compatibility is rarely perfect. Even identical queries may produce small differences in sorting, null handling, or identifier case sensitivity.

Why dialects persist

Vendors compete by offering extra capabilities such as procedural extensions, optimized data types, integration with cloud services, and special indexes. These differences drive innovation, and many features later migrate into future SQL standards. The result is a rich but sometimes inconsistent ecosystem.

Understanding the shared foundations and differences among dialects helps you write clear, portable queries while still taking advantage of each system’s strengths.

Chapter 3: Relational Database Concepts

At the heart of every relational database lies a simple yet powerful model: data organized into related tables. Each table represents one subject; each row represents one record; and relationships between tables preserve logical connections. This chapter introduces the essential components of the relational model (tables, keys, normalization, constraints, and referential integrity) that make relational databases consistent and reliable.

💡 The relational model is built on solid mathematical foundations, but you do not need to be a mathematician to use it. Think in terms of tables and relationships, and the logic will follow naturally.

Tables, rows, columns, keys, and relationships

Data in a relational database is stored in tables. Each table consists of rows (records) and columns (fields). Rows hold the actual data values, while columns define the data’s structure and constraints. A table’s design defines what kind of data is allowed and how it connects to other tables.

Tables and columns

Every table should describe one type of entity. Columns describe attributes of that entity, such as a person’s name or an order’s date. Each column has a data type that restricts the kind of values it can hold.

CREATE TABLE customers (
  customer_id INTEGER PRIMARY KEY,
  name        TEXT    NOT NULL,
  email       TEXT    UNIQUE,
  joined_on   DATE    DEFAULT CURRENT_DATE
);

This table represents customers. Each column has a purpose: customer_id uniquely identifies the row, name is required, email must be unique, and joined_on defaults to the current date.

Rows and relationships

Each row is a single record, and relationships connect rows in one table to rows in another. By linking tables with keys, you can query data across multiple subjects without duplicating information.

CREATE TABLE orders (
  order_id    INTEGER PRIMARY KEY,
  customer_id INTEGER NOT NULL,
  order_date  DATE    NOT NULL,
  total       NUMERIC NOT NULL,
  FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
);

Here, orders refers to customers using customer_id. The relationship is one-to-many: one customer can have many orders, but each order belongs to a single customer.

⚠️ Avoid storing the same data in more than one table. Redundancy makes updates harder and can create inconsistencies when one copy changes but others do not.

Normalization and denormalization

Normalization is the process of organizing data to reduce redundancy and improve integrity. It divides large, repetitive tables into smaller, related ones. Each table focuses on a single subject, and relationships connect them. The goal is to avoid anomalies when inserting, updating, or deleting data.

Normal forms

Several “normal forms” define stages of improvement. Each form builds upon the previous one. The first three are most commonly applied in practical design.

First Normal Form (1NF): Each column holds atomic (indivisible) values, and each row is unique.
Second Normal Form (2NF): Every non-key column depends on the entire primary key, not just part of it.
Third Normal Form (3NF): No non-key column depends on another non-key column.

-- Bad design: repeated address fields cause redundancy
CREATE TABLE employees (
  emp_id    INTEGER PRIMARY KEY,
  name      TEXT,
  dept_name TEXT,
  dept_addr TEXT
);

-- Better design: separate departments into their own table
CREATE TABLE departments (
  dept_id   INTEGER PRIMARY KEY,
  dept_name TEXT,
  dept_addr TEXT
);

CREATE TABLE employees (
  emp_id  INTEGER PRIMARY KEY,
  name    TEXT,
  dept_id INTEGER REFERENCES departments(dept_id)
);

💡 Normalization helps ensure that each fact is stored only once. Queries may become longer, but updates and consistency checks become much easier.

Denormalization

Denormalization deliberately combines related data for performance or simplicity. It reduces joins by storing precomputed or redundant information. While this can speed up queries, it increases the chance of inconsistencies if data changes in one place but not another.

For example, an orders table might include a customer_name column to avoid a join in frequent reports. If a customer changes their name, all affected orders must be updated to keep data correct.

⚠️ Normalize for integrity first; denormalize only when you have measured and proven a performance need.

Primary vs foreign keys

Keys are special columns (or sets of columns) that define relationships and uniqueness. The primary key uniquely identifies each row in its table. A foreign key references a primary key in another table, establishing a link between the two.

Primary keys

A primary key ensures that each record is distinct. It can be a single column or a combination. A good primary key never changes.

CREATE TABLE products (
  product_id INTEGER PRIMARY KEY,
  name       TEXT NOT NULL,
  price      NUMERIC
);

Each product has a unique product_id. The database will reject any attempt to insert a duplicate.

Foreign keys

A foreign key defines a relationship between two tables by referencing another table’s primary key. It enforces consistency between related records.

CREATE TABLE order_items (
  order_id   INTEGER NOT NULL,
  product_id INTEGER NOT NULL,
  quantity   INTEGER NOT NULL,
  FOREIGN KEY (order_id) REFERENCES orders(order_id),
  FOREIGN KEY (product_id) REFERENCES products(product_id),
  PRIMARY KEY (order_id, product_id)
);

Here, each item belongs to both an order and a product. The composite key (order_id, product_id) prevents duplicate entries for the same product in the same order.

Constraints

Constraints are rules that protect data integrity. They define what values are allowed, ensure required data is present, and maintain logical consistency. Most databases enforce constraints automatically at insert or update time.

Common constraint types

PRIMARY KEY: Ensures unique identification of each row.
FOREIGN KEY: Links tables and enforces referential integrity.
UNIQUE: Prevents duplicate values in a column.
NOT NULL: Requires a value for the column.
CHECK: Validates data against a condition.
DEFAULT: Supplies a fallback value when none is provided.

CREATE TABLE employees (
  emp_id   INTEGER PRIMARY KEY,
  name     TEXT NOT NULL,
  salary   NUMERIC CHECK (salary >= 0),
  hired_on DATE DEFAULT CURRENT_DATE
);

If you try to insert a negative salary or omit the name, the database will reject the statement.

💡 Always define constraints close to the data. They serve as a built-in layer of validation, independent of any application code.

Deferrable and cascading constraints

Some databases allow constraints to be deferred until transaction commit. This lets you perform several related updates that temporarily violate rules but end up consistent when the transaction completes. Cascading actions automatically apply updates or deletions to related rows.

CREATE TABLE orders (
  order_id    INTEGER PRIMARY KEY,
  customer_id INTEGER REFERENCES customers(customer_id)
    ON UPDATE CASCADE
    ON DELETE SET NULL
);

If a customer’s ID changes, related orders update automatically. If the customer is deleted, their orders remain but the customer_id becomes NULL.

Referential integrity

Referential integrity ensures that relationships between tables remain valid. Every foreign key must match an existing primary key or be NULL. This rule prevents orphaned records and maintains consistency across tables.

How it works

When you insert or update data, the database checks foreign keys to confirm that referenced rows exist. When you delete data, it checks for dependent rows that might be affected.

-- Insert succeeds only if the customer_id exists
INSERT INTO orders (order_id, customer_id, order_date, total)
VALUES (2001, 1, '2025-11-02', 199.99);

Cascading effects

To preserve referential integrity automatically, most systems support cascading actions. You can specify what should happen when a referenced row is changed or deleted:

ON UPDATE CASCADE – Update foreign keys when the parent key changes.
ON DELETE CASCADE – Delete dependent rows when the parent is deleted.
ON DELETE SET NULL – Set the foreign key to NULL if the parent is deleted.
ON DELETE RESTRICT – Prevent deletion if dependent rows exist.

CREATE TABLE order_items (
  order_id   INTEGER REFERENCES orders(order_id) ON DELETE CASCADE,
  product_id INTEGER REFERENCES products(product_id),
  quantity   INTEGER NOT NULL
);

⚠️ Be careful with ON DELETE CASCADE. It can remove large amounts of data unintentionally if parent rows are deleted without proper checks.

Maintaining integrity over time

Referential integrity protects against data drift as applications evolve. Even if multiple programs or services write to the same database, properly defined keys and constraints ensure that the structure remains sound.

The principles of tables, keys, constraints, and relationships form the backbone of all relational databases. Mastering them will make every future SQL statement more reliable and every schema easier to maintain.

Chapter 4: Setting Up Your Environment

Before you can start writing and running SQL statements, you need a working database environment. Most systems today use one of a few major engines (MySQL, PostgreSQL, SQLite, or SQL Server) and all of them can be installed quickly on a local machine. This chapter walks through installing a database, connecting from both command-line and graphical tools, and creating your first database and table.

💡 You do not need a powerful computer to practice SQL. Even a small laptop can host lightweight databases such as SQLite or a Docker-based PostgreSQL image.

Installing and connecting to a server

Every database system provides its own installer or package. Choose one that matches your platform and preferred workflow. MySQL and PostgreSQL run as background services, while SQLite is an embedded library that stores data in a single file.

Installing MySQL

MySQL can be downloaded from the official Oracle distribution or installed with a package manager:

# Ubuntu or Debian
sudo apt install mysql-server

# macOS (Homebrew)
brew install mysql

Once installed, start the service and connect to it:

sudo service mysql start
mysql -u root -p

The -u flag specifies a username and -p prompts for a password. After connecting, you can create databases and users directly from the prompt.

Installing PostgreSQL

PostgreSQL offers similar installation methods. On most systems you can use the default package manager:

# Ubuntu or Debian
sudo apt install postgresql

# macOS (Homebrew)
brew install postgresql

To connect, use the psql client:

sudo service postgresql start
psql -U postgres

Inside psql you can execute SQL statements directly. Commands starting with a backslash (like \l or \d) control the session rather than sending SQL to the server.

⚠️ PostgreSQL separates authentication roles from databases. The default superuser role is postgres; you will often create your own user with limited privileges for daily work.

Using SQLite

SQLite is built into many operating systems and requires no setup beyond downloading a small binary if it is not already installed. To start an interactive session, simply run:

sqlite3 mydata.db

This creates (or opens) a file called mydata.db. All data lives inside this file. SQLite is ideal for experiments, testing, or small applications that do not require a networked server.

Using CLI and GUI tools

Every major database comes with a command-line client, and many also offer graphical tools for browsing and editing data. Both approaches are valuable: the CLI is fast and scriptable, while GUIs provide context and visualization.

Command-line clients

The command line is the quickest way to execute SQL. Once connected, you type statements and press Enter. A semicolon (;) marks the end of a statement. For example:

mysql> SHOW DATABASES;
psql> \l
sqlite> .tables

These commands list databases or tables depending on the system. You can also redirect input from a file to run a complete script:

mysql -u user -p < setup.sql

💡 Keep a collection of .sql files for schema creation and testing. They can be version-controlled and reused on new systems.

Graphical tools

If you prefer a visual interface, there are many free and commercial clients available:

MySQL Workbench – Official graphical client for MySQL.
pgAdmin – The primary management tool for PostgreSQL.
SQL Server Management Studio – Used for Microsoft SQL Server.
DBeaver and HeidiSQL – Popular cross-platform options supporting many databases.

Most GUI tools let you browse tables, run queries, and export data in CSV or JSON formats. They are convenient for learning and for checking results of your SQL scripts.

⚠️ GUI tools can mask the underlying commands. Always understand what SQL they execute, especially when performing changes that affect data integrity.

Creating your first database and table

After installing your chosen system, you can create your first database and a simple table. The syntax differs slightly between engines, but the core idea is the same: define a database, connect to it, and create a table with columns and constraints.

Creating a database

-- MySQL and PostgreSQL
CREATE DATABASE sampledb;

-- SQLite (creates a file automatically)
sqlite3 sampledb.db

Once created, connect to the new database:

USE sampledb;        -- MySQL
\c sampledb          -- PostgreSQL

Creating a table

CREATE TABLE users (
  user_id INTEGER PRIMARY KEY,
  name    TEXT    NOT NULL,
  email   TEXT    UNIQUE,
  joined  DATE    DEFAULT CURRENT_DATE
);

This table stores user information. Try inserting and selecting rows:

INSERT INTO users (user_id, name, email)
VALUES (1, 'Alice', 'alice@example.com');

SELECT * FROM users;

The output confirms that your database and table are working correctly. You now have a functioning environment for running every example in this book.

💡 Keep your practice database small and simple. When experiments go wrong, it is faster to drop and recreate the schema than to debug broken data.

With the tools installed and your first database ready, you can begin learning the language itself—starting with how to query and retrieve data using SELECT statements in the next chapter.

Chapter 5: Data Types and Storage

Every piece of data in a relational database has a defined type that determines what values it can hold and how they are stored internally. Choosing the correct types improves precision, saves space, and helps the database process queries efficiently. This chapter introduces the major categories of SQL data types (numeric, string, date/time, and special types) and explains how to select the right ones for your tables.

💡 Always assign an explicit type to every column. It makes your schema self-documenting and helps prevent subtle bugs caused by type conversions.

Numeric, string, date/time, and special types

The SQL standard defines several broad families of data types. Each database vendor implements them with small variations, but the concepts remain the same. The four most common categories are numeric, character, temporal, and special-purpose types.

Numeric types

Numeric types represent numbers used in calculations, identifiers, and measurements. They can be exact or approximate.

INTEGER (or INT): Whole numbers, typically 32-bit.
BIGINT: Larger integers for identifiers and counters.
DECIMAL(p,s) or NUMERIC(p,s): Exact fixed-point values with precision p and scale s.
FLOAT or REAL: Approximate floating-point numbers for scientific or statistical data.

CREATE TABLE products (
  id       INTEGER PRIMARY KEY,
  name     TEXT,
  price    DECIMAL(8,2),
  quantity INT
);

Use DECIMAL or NUMERIC for currency values to avoid rounding errors, and integer types for counts or IDs.

⚠️ Floating-point types are faster but can produce tiny rounding differences. For financial applications, never store prices as FLOAT.

String types

Character data types store text, identifiers, and other symbolic data. The main distinction is between fixed and variable length.

CHAR(n): Fixed-length text padded with spaces if shorter than n.
VARCHAR(n): Variable-length text up to n characters.
TEXT or CLOB: Large, unbounded text values (nonstandard length).

CREATE TABLE customers (
  id    INTEGER PRIMARY KEY,
  name  VARCHAR(100),
  email VARCHAR(200) UNIQUE
);

Choose a length that fits your data comfortably without wasting space. VARCHAR is the usual default for user input and short strings.

💡 When defining VARCHAR limits, base them on expected real data rather than arbitrary round numbers. This makes validation easier and signals intent to future developers.

Date and time types

Temporal types represent calendar dates, times of day, or combined timestamps. They are crucial for tracking when events occur and for performing time-based queries.

DATE: Year, month, and day.
TIME: Time of day (optionally with fractional seconds).
TIMESTAMP: Combined date and time.
INTERVAL: Duration (such as “3 days” or “2 hours”).

CREATE TABLE orders (
  order_id   INTEGER PRIMARY KEY,
  order_date DATE DEFAULT CURRENT_DATE,
  shipped_at TIMESTAMP
);

Use date functions like NOW(), CURRENT_DATE, and AGE() (PostgreSQL) to work with temporal values. Storing times in UTC makes it easier to convert and compare across regions.

⚠️ Databases differ in time zone handling. Some store timestamps with zone offsets, others assume the server’s local time. Always check documentation before relying on automatic conversions.

Special types

Modern SQL engines support additional specialized types for structured and binary data:

BOOLEAN: Logical true/false values.
BLOB: Binary large object, for files or images.
JSON: Structured documents in JavaScript Object Notation.
UUID: Universally unique identifiers.
ARRAY (in PostgreSQL and others): Lists of elements of the same type.

CREATE TABLE settings (
  id      INTEGER PRIMARY KEY,
  active  BOOLEAN DEFAULT TRUE,
  config  JSON,
  file    BLOB
);

Special types allow you to store complex or non-tabular data without breaking the relational model. However, they should be used carefully to avoid portability issues between different database engines.

⚠️ JSON support varies widely. Some systems store it as text, while others use binary formats and allow direct indexing. Write portable code by using JSON for flexible metadata, not core relational data.

Choosing correct types for efficiency and precision

Efficient database design depends as much on choosing the right types as on writing good queries. Smaller and more precise types save space, speed up joins, and make indexing faster.

Match type to purpose

Choose the simplest type that correctly expresses the data’s meaning. If a value is always whole, use an integer rather than a floating point. If a value can be absent, decide whether NULL is acceptable or if a default should apply.

-- Better: explicit types and constraints
CREATE TABLE inventory (
  item_id   INTEGER PRIMARY KEY,
  quantity  SMALLINT CHECK (quantity >= 0),
  restocked DATE DEFAULT CURRENT_DATE
);

Consider storage and indexing

Larger data types consume more disk space and can slow queries. For instance, using BIGINT where INTEGER would suffice doubles the index size without benefit. Similarly, overly large VARCHAR columns can waste memory and disk bandwidth.

💡 Databases store indexes separately from table data. Smaller key sizes make indexes faster to scan and fit better in memory.

Precision and scale in decimals

When defining DECIMAL(p,s) or NUMERIC(p,s), choose precision and scale that fit realistic data ranges. For currency, two decimal places usually suffice. Overly large precision can slow arithmetic operations.

-- Price up to 999999.99
price DECIMAL(8,2)

This definition stores six digits before the decimal point and two after, using a fixed binary representation for exactness.

Use defaults wisely

Defaults supply values automatically when none are provided, reducing repetitive input. Use them for predictable data such as creation timestamps, flags, or counters.

CREATE TABLE logs (
  log_id   INTEGER PRIMARY KEY,
  created  TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  active   BOOLEAN   DEFAULT TRUE
);

⚠️ Defaults should express intent, not replace missing data silently. If a value is truly optional, let it be NULL instead of substituting a placeholder.

Portability considerations

Different database engines have slightly different names or behaviors for equivalent types. Here are some common equivalents:

Concept	ANSI SQL	MySQL	PostgreSQL	SQLite
Auto-increment integer	`GENERATED AS IDENTITY`	`AUTO_ INCREMENT`	`SERIAL`	`INTEGER PRIMARY KEY`
Fixed text	`CHAR(n)`	`CHAR(n)`	`CHAR(n)`	`TEXT`
Variable text	`VARCHAR(n)`	`VARCHAR(n)`	`VARCHAR(n)`	`TEXT`
Date/time	`TIMESTAMP`	`DATETIME`	`TIMESTAMP`	`TEXT` (ISO format)

Understanding these differences will help you design schemas that are easy to port between systems or share across teams using different engines.

By selecting appropriate data types and storage options, you lay the foundation for efficient, maintainable databases that behave predictably under any workload.

Chapter 6: Designing Tables

Designing tables is one of the most important steps in building a relational database. A good table design enforces clear structure, minimizes redundancy, and provides a foundation for reliable queries. The CREATE TABLE statement defines the structure of each table—its columns, data types, and integrity rules. This chapter explores the statement in depth, explains how constraints ensure valid data, and shows how to create automatically incrementing identifiers.

💡 Think of table design as defining contracts. Each column’s type and constraint describes what data the database promises to accept and how it guarantees correctness.

The CREATE TABLE statement in depth

The CREATE TABLE statement defines a new table and its structure. You specify column names, data types, optional constraints, and sometimes table-level rules. Here is a simple but complete example:

CREATE TABLE employees (
  emp_id    INTEGER PRIMARY KEY,
  name      VARCHAR(100) NOT NULL,
  position  VARCHAR(50),
  salary    DECIMAL(10,2) CHECK (salary >= 0),
  hired_on  DATE DEFAULT CURRENT_DATE
);

This statement creates a table with five columns, assigns types, and includes constraints to ensure data validity. The database engine records this definition in its system catalog and enforces the rules automatically.

Basic structure

The general form of CREATE TABLE is:

CREATE TABLE table_name (
  column_name data_type [column_constraints],
  ...,
  [table_constraints]
);

Square brackets indicate optional elements. Column-level constraints apply only to one column. Table-level constraints can refer to multiple columns.

Table options

Many databases allow additional options such as specifying storage engines, schema names, or table inheritance (in PostgreSQL). For example:

CREATE TABLE sales.orders (
  order_id    SERIAL PRIMARY KEY,
  customer_id INTEGER NOT NULL,
  total       DECIMAL(8,2)
);

Here, the table resides in the sales schema rather than the default one. The exact syntax for such options differs among engines, but the core concept remains consistent.

⚠️ Not every database supports advanced options like PARTITION BY or TABLESPACE. Stick to standard clauses for maximum portability unless you need vendor-specific features.

Constraints

Constraints define the logical rules that keep data valid. They act as safeguards inside the database engine, preventing bad data from ever being stored. Constraints can be applied per column or across multiple columns at once.

NOT NULL

NOT NULL requires a value in the column. Use it for fields that are always mandatory, such as names, IDs, or timestamps.

CREATE TABLE users (
  user_id INTEGER PRIMARY KEY,
  username VARCHAR(50) NOT NULL,
  email    VARCHAR(100)
);

Here, username must always contain a value, but email may be NULL.

DEFAULT

DEFAULT assigns a value when none is provided in an INSERT statement. It can be a constant or an expression supported by the database.

CREATE TABLE logs (
  id        INTEGER PRIMARY KEY,
  created   TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  active    BOOLEAN   DEFAULT TRUE
);

Defaults simplify inserts and maintain consistent initialization of rows.

CHECK

CHECK validates that column values satisfy a logical condition. If the condition fails, the statement is rejected.

CREATE TABLE accounts (
  account_id INTEGER PRIMARY KEY,
  balance    DECIMAL(10,2) CHECK (balance >= 0)
);

Here, the database will not allow negative balances. You can also place CHECK constraints at the table level to involve multiple columns.

⚠️ Some engines (like older MySQL versions) historically ignored CHECK constraints. Always verify your system enforces them correctly.

UNIQUE

UNIQUE ensures that all values in a column (or combination of columns) are distinct. It can be used alongside a PRIMARY KEY or independently.

CREATE TABLE customers (
  id     INTEGER PRIMARY KEY,
  email  VARCHAR(200) UNIQUE,
  phone  VARCHAR(20)
);

This rule prevents two customers from sharing the same email address.

PRIMARY KEY

The PRIMARY KEY constraint uniquely identifies each row. It can consist of one or more columns. A table can have only one primary key definition.

CREATE TABLE products (
  product_id INTEGER,
  sku        VARCHAR(20),
  name       TEXT,
  PRIMARY KEY (product_id)
);

When a primary key is defined, the database automatically enforces uniqueness and disallows NULL values in the key columns.

FOREIGN KEY

A FOREIGN KEY establishes a relationship between two tables. It ensures that each referenced value exists in the parent table, preserving referential integrity.

CREATE TABLE orders (
  order_id    INTEGER PRIMARY KEY,
  customer_id INTEGER REFERENCES customers(id)
    ON DELETE CASCADE
    ON UPDATE RESTRICT
);

Here, if a customer is deleted, their orders will be removed automatically because of ON DELETE CASCADE. The ON UPDATE RESTRICT clause prevents changing a customer ID that still has related orders.

💡 Choose cascading rules carefully. They can simplify cleanup but also cause widespread deletions if used without caution.

Auto-increment / identity columns

Most tables need a unique numeric key that increases automatically. Rather than calculating it manually, you can use an auto-increment or identity column. This feature generates a new value for each inserted row.

MySQL

CREATE TABLE employees (
  emp_id INT AUTO_INCREMENT PRIMARY KEY,
  name   VARCHAR(100)
);

The AUTO_INCREMENT keyword tells MySQL to assign the next available integer automatically. Values remain unique even when rows are deleted.

PostgreSQL

PostgreSQL traditionally used the SERIAL pseudo-type and now supports the standard GENERATED AS IDENTITY clause:

CREATE TABLE employees (
  emp_id INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  name   VARCHAR(100)
);

This approach conforms to the SQL standard and provides better control over sequence options.

SQLite

CREATE TABLE employees (
  emp_id INTEGER PRIMARY KEY,
  name   TEXT
);

In SQLite, a column defined as INTEGER PRIMARY KEY automatically acts as an alias for the internal row identifier. It increments with each insert.

SQL Server

CREATE TABLE employees (
  emp_id INT IDENTITY(1,1) PRIMARY KEY,
  name   NVARCHAR(100)
);

The parameters in IDENTITY(1,1) specify the starting value and increment. The first inserted row receives 1, the next 2, and so on.

⚠️ Auto-increment sequences may leave gaps if rows are deleted or transactions roll back. These gaps are normal and do not indicate data loss.

Auto-generated keys are convenient, efficient, and widely supported. Combined with proper constraints, they give every table a reliable way to identify its rows.

By mastering CREATE TABLE and constraints, you can build schemas that maintain data integrity and make future queries predictable and efficient.

Chapter 7: Altering and Dropping Structures

Once a database is in use, structures evolve. You add columns, rename objects, create or remove indexes, and carefully roll out changes so applications continue to run. This chapter covers the core statements for structural change; safe patterns for index management; and practical guidance for schema migrations.

⚠️ Structural changes can lock tables or block concurrent queries. Plan changes for low-traffic windows; prefer transactional changes where supported; and always test on a copy first.

`ALTER TABLE`, `RENAME`, and `DROP TABLE`

ALTER TABLE changes an existing table without recreating it. Typical operations include adding or dropping columns, changing data types, setting or removing constraints, and renaming columns or the table itself.

Common `ALTER TABLE` patterns

-- Add a nullable column with a default (ANSI-style)
ALTER TABLE accounts
  ADD COLUMN last_login TIMESTAMP DEFAULT CURRENT_TIMESTAMP;

-- Backfill and then enforce NOT NULL (safer two-step)
UPDATE accounts SET last_login = CURRENT_TIMESTAMP WHERE last_login IS NULL;
ALTER TABLE accounts
  ALTER COLUMN last_login SET NOT NULL;           -- PostgreSQL
-- MySQL
ALTER TABLE accounts
  MODIFY last_login TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP;
-- SQL Server
ALTER TABLE accounts
  ALTER COLUMN last_login DATETIME2 NOT NULL;

-- Rename a column
-- PostgreSQL / SQLite
ALTER TABLE accounts RENAME COLUMN last_login TO last_seen;
-- MySQL 8+
ALTER TABLE accounts RENAME COLUMN last_login TO last_seen;
-- SQL Server
EXEC sp_rename 'dbo.accounts.last_login', 'last_seen', 'COLUMN';

-- Change a column type (with USING when needed)
-- PostgreSQL
ALTER TABLE orders
  ALTER COLUMN total TYPE NUMERIC(12,2)
  USING total::NUMERIC(12,2);

-- Add a CHECK constraint
ALTER TABLE accounts
  ADD CONSTRAINT chk_age CHECK (age >= 13);

-- Drop a constraint (name varies by engine)
-- PostgreSQL
ALTER TABLE accounts DROP CONSTRAINT chk_age;
-- MySQL (checks are named automatically; use SHOW CREATE TABLE to find the name)
-- SQL Server
ALTER TABLE accounts DROP CONSTRAINT chk_age;

💡 When tightening constraints, use a phased approach: add column as nullable; backfill; validate; then set NOT NULL. For massive tables, do the backfill in batches (for example by primary key ranges).

Renaming tables and columns with `RENAME`

Most engines provide a simple rename that preserves data and constraints.

-- Rename a table
-- PostgreSQL / SQLite
ALTER TABLE session RENAME TO session_archive;

-- MySQL
RENAME TABLE session TO session_archive;

-- SQL Server
EXEC sp_rename 'dbo.session', 'session_archive';

⚠️ Renames can break dependent objects (views, stored procedures, foreign keys) and application code. Prefer compatibility layers (views or synonyms) during a transition window.

Dropping tables with `DROP TABLE`

Dropping a table removes its data and metadata. Use conditional clauses to avoid errors if the table may not exist; and use dependency options carefully.

-- Safer: avoid errors if missing
DROP TABLE IF EXISTS temp_import;

-- Handle dependencies
-- PostgreSQL: CASCADE drops dependent objects; RESTRICT fails if any exist
DROP TABLE report CASCADE;
DROP TABLE report RESTRICT;

-- MySQL: CASCADE/RESTRICT are parsed but not enforced for tables; check FKs
DROP TABLE report;

-- SQL Server
DROP TABLE IF EXISTS dbo.report;

⚠️ DROP TABLE is irreversible without a backup. Consider ALTER TABLE ... RENAME TO ..._archive for a deprecation period, or export data before removal.

Quick reference: common operations and dialect notes

Operation	PostgreSQL	MySQL	SQL Server	SQLite
Add column	`ALTER TABLE ... ADD COLUMN ...`	`ALTER TABLE ... ADD COLUMN ...`	`ALTER TABLE ... ADD ...`	`ALTER TABLE ... ADD COLUMN ...`
Rename column	`ALTER TABLE ... RENAME COLUMN a TO b`	`ALTER TABLE ... RENAME COLUMN a TO b` (8+)	`sp_rename 'tbl.a','b','COLUMN'`	`ALTER TABLE ... RENAME COLUMN a TO b` (3.25+)
Change type	`ALTER COLUMN ... TYPE ... USING ...`	`MODIFY COLUMN ...`	`ALTER COLUMN ...`	Often requires create-copy-swap
Drop constraint	`DROP CONSTRAINT name`	Drop by altering column/index; FK by name	`DROP CONSTRAINT name`	Limited; often recreate table
Rename table	`ALTER TABLE ... RENAME TO ...`	`RENAME TABLE ... TO ...`	`sp_rename 'schema.tbl','new'`	`ALTER TABLE ... RENAME TO ...`
Drop table	`DROP TABLE [IF EXISTS] ... [CASCADE\|RESTRICT]`	`DROP TABLE [IF EXISTS] ...`	`DROP TABLE IF EXISTS ...`	`DROP TABLE IF EXISTS ...`

💡 SQLite supports many alter operations today, but complex changes (for example dropping a column in older versions) may still require the copy-table pattern: create a new table; copy data; drop the old table; rename the new table.

Managing indexes

Indexes accelerate reads by organizing data for fast lookups. Design them for the queries you run; balance read speed against write overhead and storage.

Creating and dropping indexes

-- Basic non-unique index
CREATE INDEX idx_users_email ON users (email);

-- Unique index (enforces no duplicates)
CREATE UNIQUE INDEX uq_users_email ON users (email);

-- Composite (multi-column) index
CREATE INDEX idx_orders_customer_date ON orders (customer_id, order_date);

-- Functional / expression index
-- PostgreSQL
CREATE INDEX idx_users_email_lower ON users ((lower(email)));
-- SQLite
CREATE INDEX idx_users_email_lower ON users (lower(email));
-- MySQL (use generated column prior to functional indexes)
ALTER TABLE users ADD COLUMN email_lower VARCHAR(255) GENERATED ALWAYS AS (lower(email)) STORED;
CREATE INDEX idx_users_email_lower ON users (email_lower);

-- Include (covering) columns
-- PostgreSQL 11+
CREATE INDEX idx_orders_cover ON orders (customer_id, order_date) INCLUDE (status);
-- SQL Server
CREATE INDEX idx_orders_cover ON dbo.orders (customer_id, order_date) INCLUDE (status);

-- Partial / filtered index
-- PostgreSQL
CREATE INDEX idx_orders_open ON orders (customer_id) WHERE status = 'OPEN';
-- SQL Server
CREATE INDEX idx_orders_open ON dbo.orders (customer_id) WHERE status = 'OPEN';
-- SQLite supports partial indexes with WHERE
CREATE INDEX idx_orders_open ON orders (customer_id) WHERE status = 'OPEN';

-- Drop indexes (syntax differs)
-- PostgreSQL / SQLite
DROP INDEX IF EXISTS idx_users_email;
-- MySQL
DROP INDEX idx_users_email ON users;
-- SQL Server
DROP INDEX idx_users_email ON dbo.users;

⚠️ Indexes speed up reads but slow down writes and increase storage. Measure with your real workload; remove dead indexes that do not help any queries.

Choosing columns and order

Place the most selective column first for equality lookups; follow with columns used in join predicates and range filters.
Create one good composite index instead of many overlapping single-column indexes.
Cover common queries (all needed columns available by index) using include columns where supported; otherwise consider wider composite keys carefully.
Prefer expression indexes for case-insensitive search, JSON keys, or computed fields (engine support varies).

💡 Start from the slow queries: examine execution plans, add the smallest effective index, and re-check the plan. Avoid speculative indexing.

Rebuilding and concurrent indexing

-- Build an index without blocking writes (engine-specific)
-- PostgreSQL
CREATE INDEX CONCURRENTLY idx_events_created_at ON events (created_at);

-- Rebuild or reorganize to defragment
-- SQL Server
ALTER INDEX ALL ON dbo.orders REORGANIZE;    -- light
ALTER INDEX ALL ON dbo.orders REBUILD;       -- heavy

⚠️ Concurrent index builds reduce lock time but can run longer and use more temporary space. Read your engine’s guidance for maintenance windows.

Schema migrations

Schema migrations are controlled, versioned changes to database structures applied in sequence. A reliable process reduces risk and keeps environments aligned.

Versioning and tracking

-- A simple migrations ledger
CREATE TABLE schema_migrations (
  version VARCHAR(50) PRIMARY KEY,
  applied_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);

-- Mark a migration as applied
INSERT INTO schema_migrations (version) VALUES ('2025-11-02_1200_add_last_seen');

Each migration is an ordered script with an id (for example 2025-11-02_1200_add_last_seen.sql) that can be applied once. Tools like Flyway and Liquibase manage this automatically; the principle is the same.

Designing safe, incremental changes

Backward compatibility: deploy code that can work with both old and new schema; then migrate the data; then enforce stricter constraints.
Two-step constraint tightening: add column as nullable; backfill; set NOT NULL; add foreign keys at the end when possible.
Online operations: use online/instant algorithms when supported (for example MySQL instant add column; PostgreSQL concurrently built indexes; SQL Server online index operations in suitable editions).
Roll forward: prefer forward fixes over rollbacks; keep rollback scripts only for small reversible steps (for example dropping a new index).

💡 For hot systems, use “expand and contract”: expand the schema to support new code; deploy new code; backfill; switch reads/writes; contract old columns or tables only after a cooling period.

Example migration set

-- 001_expand_add_column.sql
ALTER TABLE accounts
  ADD COLUMN last_seen TIMESTAMP;

-- 002_backfill_last_seen.sql  (run in batches in application code or a job)
UPDATE accounts
SET last_seen = GREATEST(last_login, created_at)
WHERE last_seen IS NULL
  AND id BETWEEN :min_id AND :max_id;

-- 003_enforce_not_null.sql
-- PostgreSQL
ALTER TABLE accounts ALTER COLUMN last_seen SET NOT NULL;
-- MySQL
ALTER TABLE accounts MODIFY last_seen TIMESTAMP NOT NULL;
-- SQL Server
ALTER TABLE accounts ALTER COLUMN last_seen DATETIME2 NOT NULL;

-- 004_add_index.sql
-- PostgreSQL (concurrent)
CREATE INDEX CONCURRENTLY idx_accounts_last_seen ON accounts (last_seen);
-- MySQL
CREATE INDEX idx_accounts_last_seen ON accounts (last_seen);
-- SQL Server
CREATE INDEX idx_accounts_last_seen ON dbo.accounts (last_seen);

-- 005_contract_drop_old.sql
-- After monitoring period
ALTER TABLE accounts DROP COLUMN last_login;

⚠️ Test migrations against production-like data volumes. Small tables hide lock times and long-running backfills. Use staged rollouts and measure actual impact.

Blue-green and compatibility layers

For renames that risk breaking code, introduce a compatibility layer:

-- Keep old name available during transition
-- PostgreSQL view
CREATE VIEW session AS SELECT * FROM session_archive;

-- SQL Server synonym
CREATE SYNONYM dbo.session FOR dbo.session_archive;

-- MySQL view
CREATE VIEW session AS SELECT * FROM session_archive;

💡 Maintain the compatibility layer until all consumers have switched; then remove it in a later migration with a clear version tag.

Structure of a migration file

-- <version>__<summary>.sql
-- 2025-11-02__rename_session_to_session_archive.sql

BEGIN;  -- or implicit transaction if the engine handles it

-- … DDL / DML statements here

COMMIT;

Where the engine does not support transactional DDL (or only for some statements), split the migration into safe chunks and ensure idempotency checks are present (for example IF EXISTS, IF NOT EXISTS, or guards in application code).

⚠️ Some statements are not fully transactional in certain engines (for example MySQL implicit commit for ALTER TABLE in some modes). Read engine notes and plan accordingly.

With careful use of ALTER TABLE, disciplined index management, and versioned migrations, you can evolve a schema smoothly while keeping systems available.

Chapter 8: Indexes and Performance Basics

Indexes speed up reads by providing a smaller, ordered structure that the query planner can search quickly; they trade extra storage and slower writes for faster lookups. This chapter explains physical organization (clustered versus non-clustered using alternative wording), composite index design and maintenance settings, and practical guidance on when to index or leave data unindexed.

💡 Start from the workload: profile slow queries, inspect execution plans, and add the smallest effective index that changes a bad plan into a good one. Measure again after changes.

Clustered vs non-clustered

A clustered organization stores table rows in the same order as a chosen key; a non-clustered index stores a separate structure that points to table rows. Different engines implement this differently.

Engine behaviors at a glance

Engine	Clustered storage	Notes
MySQL (InnoDB)	Yes (primary key defines physical order)	The table is organized by the primary key; secondary indexes store primary key values as row locators.
SQL Server	Optional clustered index	One clustered index per table; non-clustered indexes reference the clustered key or a row identifier.
PostgreSQL	No permanent clustering	`CLUSTER` rewrites the table in index order but does not keep it permanently aligned; normal indexes are non-clustered.
SQLite	No separate clustered type	Rowids define storage order unless using `WITHOUT ROWID`; indexes are non-clustered.

⚠️ Choosing a primary key in InnoDB affects physical locality. Monotonic keys (for example a synthetic bigint or a timestamp sequence) reduce page splits and fragmentation; random keys (for example UUID v4) may increase write amplification unless mitigated.

Practical implications

Range scans: clustered order helps queries that read contiguous key ranges (for example by date).
Covering lookups: non-clustered indexes can include needed columns to avoid visiting the base table (engine support varies).
Write cost: maintaining order causes page splits and extra I/O under random insert patterns; bulk loads benefit from pre-sorted input where feasible.

-- SQL Server: create a clustered index on (created_at, id)
CREATE CLUSTERED INDEX cix_orders_created_id
  ON dbo.orders (created_at, id);

-- PostgreSQL: opportunistic clustering (one-off rewrite)
CLUSTER orders USING idx_orders_created_at;
-- Later, to refresh the order again:
CLUSTER orders;

-- MySQL (InnoDB): primary key defines clustering
ALTER TABLE orders
  ADD PRIMARY KEY (created_at, id);

💡 If lookups are by a surrogate primary key but most queries filter by a different column (for example customer_id then created_at), keep the surrogate primary key for simplicity and add a composite secondary index for query speed.

Composite indexes, fillfactor, and maintenance

Composite indexes store multiple columns in a single index key. The column order matters; the index can support lookups that match a left-most prefix of the key (and sometimes more with advanced checks by the planner).

Designing composite keys

Put equality filters first, then range filters, then columns used for ordering.
Prefer one well-chosen composite index over several overlapping single-column indexes.
Use expression or functional indexes for case-insensitive or computed lookups where supported.
Use include columns to cover queries without widening the key (SQL Server and PostgreSQL have explicit INCLUDE).

-- Composite equality then range
CREATE INDEX idx_orders_customer_date
  ON orders (customer_id, order_date);

-- Expression index for case-insensitive search
-- PostgreSQL
CREATE INDEX idx_users_email_lower ON users ((lower(email)));
-- SQLite
CREATE INDEX idx_users_email_lower ON users (lower(email));

-- Covering index with include columns
-- PostgreSQL 11+
CREATE INDEX idx_orders_cov ON orders (customer_id, order_date) INCLUDE (status, total);
-- SQL Server
CREATE INDEX idx_orders_cov ON dbo.orders (customer_id, order_date) INCLUDE (status, total);

⚠️ A composite index on (a, b) helps queries that filter on a or on a and b. It usually does not help queries that filter only on b. Add a second index only if that pattern is common and material.

Fillfactor (page fullness) and fragmentation

Fillfactor controls how full to pack index pages at build time so that future inserts have room to grow. Lower values reduce page splits at the cost of more disk space.

Engine	Per-index setting	Example
PostgreSQL	Yes	`CREATE INDEX ... WITH (fillfactor = 90)`
SQL Server	Yes	`CREATE INDEX ... WITH (FILLFACTOR = 90)`
MySQL (InnoDB)	Global knob for some operations	Engine manages page fill; per-index setting is not typical in common builds
SQLite	No	Engine manages internal layout automatically

-- PostgreSQL
CREATE INDEX idx_events_created_at
  ON events (created_at)
  WITH (fillfactor = 90);

-- SQL Server
CREATE INDEX idx_events_created_at
  ON dbo.events (created_at)
  WITH (FILLFACTOR = 90, PAD_INDEX = ON);

💡 Use a lower fillfactor for hot, randomly inserted keys (for example UUIDs) to reduce page splits; keep the default for mostly append-only patterns.

Index maintenance and statistics

Rebuild versus reorganize: heavy rebuild recreates the index; light reorg defragments leaf pages where supported.
Statistics: up-to-date statistics inform the planner; autovacuum or auto-stats usually keep these fresh, but manual refresh can help after big changes.
Concurrent builds: build or rebuild without long blocking where the engine supports it.

-- PostgreSQL
REINDEX INDEX idx_users_email;          -- full rebuild
ANALYZE users;                          -- refresh stats
CREATE INDEX CONCURRENTLY idx_e_c ON events (created_at);

-- SQL Server
ALTER INDEX ALL ON dbo.orders REORGANIZE;
ALTER INDEX ALL ON dbo.orders REBUILD WITH (ONLINE = ON);  -- edition dependent
UPDATE STATISTICS dbo.orders WITH FULLSCAN;

-- MySQL
ANALYZE TABLE orders;
OPTIMIZE TABLE orders;                  -- can rebuild table and indexes (engine specific)

⚠️ Maintenance operations can consume I/O and extend transaction log or write-ahead log usage. Schedule them with care and monitor impact on replicas.

When (and when not) to index

Not every column should be indexed. Good candidates are columns used in frequent search predicates, joins, and orderings where the index avoids scanning large portions of the table. Poor candidates either do not filter much or change so often that the write cost outweighs the read benefit.

Good candidates

Foreign key columns used in joins (for example orders.customer_id).
Columns used in equality filters on large tables (for example WHERE email = ?).
Composite patterns that match common queries (for example (customer_id, created_at) for recent-by-customer listings).
Columns used for sorting or range scans (for example ORDER BY created_at DESC LIMIT 50).
Expressions that normalize search (for example lower(email) or JSON key extraction) where the engine supports expression indexes.

-- Typical lookup and ordering needs
CREATE INDEX idx_orders_customer_created
  ON orders (customer_id, created_at DESC);

-- Join helper for a foreign key
CREATE INDEX idx_orders_product_id
  ON orders (product_id);

💡 Aim for covering indexes on your top N queries so the engine can satisfy them from index pages alone; use include columns instead of widening the key where available.

Poor candidates

Columns with very low cardinality (for example a boolean flag) unless paired in a composite index that increases selectivity.
Small tables that fit in memory or are scanned quickly.
Columns updated in nearly every write (for example counters that change on each event).
Leading wildcard patterns that cannot use a normal b-tree (for example LIKE '%term'); consider full-text features instead.
Highly volatile temporary data that is created and deleted in short cycles.

Measuring impact

Record query latency and throughput before and after adding an index.
Inspect plan changes (for example index-only scans versus heap visits; key lookups versus scans).
Track write latency and storage growth to catch harmful side effects.

-- PostgreSQL: show why the planner chooses a path
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders
WHERE customer_id = $1
ORDER BY created_at DESC
LIMIT 50;

-- SQL Server
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
SELECT TOP (50) *
FROM dbo.orders
WHERE customer_id = @cid
ORDER BY created_at DESC;

⚠️ Indexes that are never used waste memory and I/O. Periodically review usage reports (for example index scans versus table scans where available) and drop dead indexes after verification.

With a clear understanding of physical layout, careful composite design, sensible fillfactor choices, and a policy for measuring benefits, you can keep read performance high without over-indexing or harming write throughput.

Chapter 8: Inserting Data

Adding rows to tables is one of the most common database operations. The INSERT family of statements supports single and multiple row inserts, copying data from other tables, and creating new tables directly from query results. This chapter covers the core syntax and useful patterns across engines.

⚠️ Always define explicit column lists in your INSERT statements. It prevents silent mismatches when the table structure changes.

`INSERT INTO` syntax and multi-row inserts

The standard form specifies the target table, a list of columns, and the values to insert. Each value list corresponds to one row.

INSERT INTO employees (first_name, last_name, role, salary)
VALUES ('Alice', 'Green', 'Manager', 90000);

To insert multiple rows in one statement, provide additional value sets separated by commas. This is efficient and supported by all major engines.

INSERT INTO employees (first_name, last_name, role, salary)
VALUES 
  ('Bob', 'White', 'Engineer', 75000),
  ('Carla', 'Smith', 'Designer', 70000),
  ('David', 'Brown', 'Engineer', 77000);

💡 Multi-row inserts greatly reduce transaction overhead when adding many rows. Use this pattern for batch loads rather than looping individual inserts.

Column omission and defaults

If a column has a default value or allows nulls, you can omit it from the insert list. The database uses the default expression or sets it to NULL.

-- 'created_at' defaults to CURRENT_TIMESTAMP
INSERT INTO accounts (username, email)
VALUES ('jdoe', 'jdoe@example.com');

⚠️ Relying on implicit column order is risky. Always specify column names in both INSERT INTO and SELECT clauses.

Engine-specific bulk patterns

PostgreSQL: supports multi-row inserts and COPY for high-volume imports.
MySQL: supports multi-row inserts and LOAD DATA INFILE.
SQL Server: supports multi-row inserts and BULK INSERT.
SQLite: supports multi-row inserts since 3.7.11 and can use transactions for batches.

-- PostgreSQL bulk load (server-side)
COPY employees (first_name, last_name, role, salary)
FROM '/tmp/employees.csv'
WITH (FORMAT csv, HEADER true);

-- MySQL
LOAD DATA INFILE '/tmp/employees.csv'
INTO TABLE employees
FIELDS TERMINATED BY ',' 
LINES TERMINATED BY '\n'
IGNORE 1 ROWS;

💡 Use database-native bulk loaders for millions of rows. They bypass per-row overhead and can commit in larger chunks.

Inserting from subqueries or `SELECT … INTO`

You can populate a table using data from another query. This pattern is common for data transformations, archiving, or building summary tables.

Using `INSERT INTO ... SELECT`

The INSERT INTO ... SELECT form inserts the results of a query into an existing table.

-- Copy all engineers into a new department
INSERT INTO engineers (first_name, last_name, salary)
SELECT first_name, last_name, salary
FROM employees
WHERE role = 'Engineer';

The columns in the SELECT must align in number and type with those in the target list.

⚠️ Some engines implicitly commit after large inserts from subqueries. Wrap statements in transactions if you need atomicity across multiple steps.

Transforming during insert

-- Apply transformations as you insert
INSERT INTO archive_orders (order_id, customer_id, total, archived_at)
SELECT id, customer_id, total, CURRENT_TIMESTAMP
FROM orders
WHERE order_date < CURRENT_DATE - INTERVAL '1 year';

💡 Combining filters and transformations during insert reduces I/O. You move only the necessary data rather than inserting everything and cleaning it later.

Creating new tables with `SELECT ... INTO`

Some engines allow you to create a new table directly from a query result. The table is created with column names and types inferred from the query.

-- SQL Server
SELECT customer_id, COUNT(*) AS order_count
INTO customer_summary
FROM orders
GROUP BY customer_id;

-- PostgreSQL and MySQL
CREATE TABLE customer_summary AS
SELECT customer_id, COUNT(*) AS order_count
FROM orders
GROUP BY customer_id;

-- SQLite
CREATE TABLE customer_summary AS
SELECT customer_id, COUNT(*) AS order_count
FROM orders
GROUP BY customer_id;

⚠️ SELECT ... INTO creates a physical table only once. If you run it again with the same table name, you get an error unless you drop or truncate first.

Combining inserts with conflict handling

When inserting from a subquery, duplicates can occur. Each engine offers options to handle them gracefully.

-- PostgreSQL: upsert pattern
INSERT INTO users (email, name)
SELECT email, name FROM staging_users
ON CONFLICT (email) DO UPDATE
  SET name = EXCLUDED.name;

-- MySQL
INSERT INTO users (email, name)
SELECT email, name FROM staging_users
ON DUPLICATE KEY UPDATE
  name = VALUES(name);

-- SQL Server (using MERGE)
MERGE INTO users AS target
USING staging_users AS source
ON target.email = source.email
WHEN MATCHED THEN
  UPDATE SET target.name = source.name
WHEN NOT MATCHED THEN
  INSERT (email, name) VALUES (source.email, source.name);

💡 For large merges, consider staging into a temporary table first. It lets you validate data and control transaction scope before affecting production tables.

Understanding how to insert efficiently—whether single, multi-row, or from queries—sets the foundation for loading, migrating, and transforming data safely and quickly.

Chapter 10: Updating and Deleting Data

Databases evolve continuously; records change, become obsolete, or must be removed. The UPDATE, DELETE, and TRUNCATE statements perform these operations, and each carries risk if used without precision. This chapter shows best practices, safe filters, and recovery options when changes go wrong.

⚠️ Always run update and delete commands inside transactions during development and testing. Commit only after verifying the affected row count.

`UPDATE` and `DELETE` best practices

Updates modify data in place; deletes remove it. Both are straightforward, yet dangerous if filters are missing or too broad. Always specify a clear WHERE clause and verify the expected row count.

Basic `UPDATE` form

UPDATE employees
SET salary = salary * 1.05
WHERE department = 'Engineering';

Multiple columns can be updated in the same statement:

UPDATE products
SET price = price * 0.9,
    updated_at = CURRENT_TIMESTAMP
WHERE category = 'Clearance';

💡 Track who and when modified rows by using audit columns such as updated_by and updated_at. Populate them through triggers or in the same UPDATE statement.

Basic `DELETE` form

DELETE FROM sessions
WHERE last_active < CURRENT_DATE - INTERVAL '30 days';

Omitting WHERE removes every row from the table, which is almost never intended. Add safety limits in scripts that generate SQL automatically.

⚠️ A DELETE without a filter removes all data but keeps the table definition. Use TRUNCATE only when you are sure you want a complete reset.

Engine-specific batch updates

PostgreSQL: supports RETURNING to capture changed rows.
MySQL: supports LIMIT for partial deletes or updates (non-standard).
SQL Server: supports OUTPUT to capture old and new values.

-- PostgreSQL
UPDATE accounts
SET balance = balance - 50
WHERE id = 123
RETURNING id, balance;

-- SQL Server
UPDATE dbo.accounts
SET balance = balance - 50
OUTPUT deleted.balance AS old_balance, inserted.balance AS new_balance
WHERE id = 123;

-- MySQL
DELETE FROM logs
WHERE created_at < NOW() - INTERVAL 90 DAY
LIMIT 10000;   -- avoid long locks

💡 For heavy workloads, perform updates or deletes in manageable batches using primary key ranges or date windows. This reduces locks and transaction log growth.

Using `WHERE`, subqueries, and safeguards

The WHERE clause filters which rows to affect. Combine it with joins or subqueries to match related records precisely.

Targeting related rows

-- Reduce salaries for departments that exceeded budget
UPDATE employees
SET salary = salary * 0.95
WHERE department_id IN (
  SELECT id FROM departments WHERE budget_used > budget_limit
);

You can also join directly in engines that support UPDATE ... JOIN or DELETE ... USING syntax.

-- MySQL and SQL Server
UPDATE e
SET e.salary = e.salary * 0.95
FROM employees e
JOIN departments d ON e.department_id = d.id
WHERE d.budget_used > d.budget_limit;

-- PostgreSQL
DELETE FROM orders o
USING customers c
WHERE o.customer_id = c.id
  AND c.status = 'INACTIVE';

💡 If your engine lacks joins in updates, use subqueries. They are logically equivalent but sometimes less efficient. Always verify execution plans for large datasets.

Verifying before execution

First preview the affected rows with a SELECT that uses the same filter. When certain, replace SELECT with UPDATE or DELETE.

-- Preview what would be deleted
SELECT id, email FROM users
WHERE email LIKE '%@test.invalid';

-- Then safely delete
DELETE FROM users
WHERE email LIKE '%@test.invalid';

⚠️ Accidental unfiltered updates are among the most common causes of data loss. Enable safe update modes in your client (for example sql_safe_updates=1 in MySQL) to block statements without WHERE or LIMIT.

Transactional safeguards

Use transactions to isolate changes and roll back on error.

BEGIN;

UPDATE inventory
SET quantity = quantity - 1
WHERE product_id = 1001;

DELETE FROM carts
WHERE product_id = 1001
  AND user_id = 42;

COMMIT;

💡 During interactive sessions, keep transactions short. Open transactions hold locks and can block other sessions. Always COMMIT or ROLLBACK before stepping away.

Using conditions from other tables

You can base updates or deletes on computed or aggregated values from related tables.

-- Deactivate customers with no orders in the past year
UPDATE customers
SET active = FALSE
WHERE id NOT IN (
  SELECT DISTINCT customer_id FROM orders
  WHERE order_date > CURRENT_DATE - INTERVAL '1 year'
);

⚠️ Subqueries in NOT IN can misbehave with nulls. Use NOT EXISTS instead for reliable results.

UPDATE customers c
SET active = FALSE
WHERE NOT EXISTS (
  SELECT 1 FROM orders o
  WHERE o.customer_id = c.id
    AND o.order_date > CURRENT_DATE - INTERVAL '1 year'
);

`TRUNCATE` and restoring from backups

TRUNCATE quickly removes all rows from a table while keeping its structure. It typically resets identity counters and uses less logging than DELETE.

TRUNCATE TABLE logs;

Engine	Transactional	Resets identity	Notes
PostgreSQL	Yes (within transaction)	Yes	Can cascade to related tables with `TRUNCATE ... CASCADE`.
MySQL	No (implicit commit)	Yes	Cannot be rolled back; requires `DROP + CREATE` privileges.
SQL Server	Yes (non-logged minimal mode)	Yes	Cannot target tables referenced by foreign keys.
SQLite	Emulated via `DELETE FROM`	Yes	Acts as a fast full delete.

⚠️ TRUNCATE is irreversible. It usually skips triggers and does not fire row-level delete events. Always ensure no dependent tables rely on the data being cleared.

Alternatives for safe clearing

-- Use DELETE with WHERE to preserve control
DELETE FROM logs
WHERE created_at < CURRENT_DATE - INTERVAL '90 days';

To reclaim storage space after mass deletions, run engine-specific maintenance such as VACUUM (PostgreSQL, SQLite) or OPTIMIZE TABLE (MySQL).

Restoring deleted or truncated data

Accidental data removal happens. Your ability to recover depends on logging and backup strategies.

Point-in-time recovery: use write-ahead logs or transaction logs to restore up to the moment before the error.
Regular full backups: restore the latest backup and replay logs if possible.
Logical dumps: maintain daily pg_dump, mysqldump, or BACKUP DATABASE outputs for selective restore.

-- PostgreSQL example recovery
pg_restore --dbname=production --table=users backup.dump

-- MySQL
mysql -u root -p production < backup.sql

-- SQL Server
RESTORE DATABASE Production
FROM DISK = 'D:\backups\prod_full.bak'
WITH REPLACE, RECOVERY;

💡 Practice restoring backups regularly. The time to learn recovery syntax is not during an outage.

By combining disciplined WHERE filters, short transactions, and reliable backups, you can manage data changes confidently and avoid irreversible mistakes.

Chapter 111: Selecting and Querying Data

The SELECT statement is the heart of SQL. It retrieves and shapes data from one or more tables and can include expressions, filters, sorting, and limits. Every analytic, report, and application query builds on its structure. This chapter covers its syntax, derived columns, and key modifiers for distinctness, limits, and ordering.

💡 The logical order in which SQL clauses are processed differs from how they are written. Understanding that sequence helps you reason about query results and optimize performance.

`SELECT` structure and clause order

The typical SELECT query has a predictable layout. Not every clause is required, but the order is fixed by the SQL standard:

SELECT [DISTINCT] column_list
FROM table_name
  [JOIN ...]
WHERE condition
GROUP BY grouping_columns
HAVING group_condition
ORDER BY sort_columns
LIMIT number OFFSET start;

Logical evaluation order

Although you write SELECT first, the database engine logically processes the clauses in this sequence:

Step	Clause	Description
1	`FROM` / `JOIN`	Identify source tables and combine them.
2	`WHERE`	Filter rows before grouping.
3	`GROUP BY`	Aggregate rows into groups.
4	`HAVING`	Filter groups based on aggregate conditions.
5	`SELECT`	Produce output columns and expressions.
6	`ORDER BY`	Sort the result set.
7	`LIMIT / OFFSET`	Return a subset of rows.

⚠️ You cannot reference an alias in the WHERE clause because WHERE is evaluated before SELECT. Use the full expression instead, or repeat it in both places if necessary.

Basic example

SELECT first_name, last_name, department, salary
FROM employees
WHERE salary > 80000
ORDER BY department, salary DESC;

💡 Always specify column names instead of * in production queries. It avoids surprises when new columns are added and reduces I/O.

Column aliases, expressions, and derived columns

Aliases rename columns in the output or simplify references to computed expressions. Use AS for clarity, though it is optional in most engines.

Using aliases

SELECT 
  first_name AS given,
  last_name AS family,
  salary * 1.1 AS adjusted_salary
FROM employees;

Aliased columns can make results easier to read and can be reused in the ORDER BY clause.

SELECT 
  id, 
  department_id,
  salary * 1.1 AS adjusted_salary
FROM employees
ORDER BY adjusted_salary DESC;

⚠️ Aliases are valid only in the scope of the query that defines them. They do not rename columns in the underlying table.

Expressions and derived data

SQL expressions allow arithmetic, string concatenation, conditional logic, and type conversion directly in the query.

SELECT 
  first_name || ' ' || last_name AS full_name,
  EXTRACT(YEAR FROM hire_date) AS hire_year,
  CASE 
    WHEN salary >= 100000 THEN 'High'
    WHEN salary >= 70000 THEN 'Medium'
    ELSE 'Low'
  END AS pay_band
FROM employees;

Use arithmetic for numeric calculations (price * quantity).
Use functions for transformations (UPPER(name), SUBSTR(email, 1, 10)).
Use CASE for conditional derivations.

💡 Derived columns can simplify application code by moving computation into SQL, but avoid overloading the database with heavy logic better handled upstream.

Table aliases and joins

Short aliases for tables make multi-table queries readable. Use clear, short forms like e for employees.

SELECT e.first_name, e.last_name, d.name AS department
FROM employees e
JOIN departments d ON e.department_id = d.id;

You can even reuse derived data in subqueries or common table expressions (CTEs).

WITH high_earners AS (
  SELECT id, first_name, last_name, salary
  FROM employees
  WHERE salary > 100000
)
SELECT first_name, last_name
FROM high_earners
ORDER BY last_name;

The following modifiers control result uniqueness, size, and ordering. Their use and syntax vary slightly between engines.

`DISTINCT`

DISTINCT removes duplicate rows from the result set. It applies to the combination of all selected columns.

SELECT DISTINCT department
FROM employees;

SELECT DISTINCT department, role
FROM employees;

⚠️ DISTINCT can be expensive on large datasets. If possible, filter or aggregate before applying it.

`TOP`, `LIMIT`, and `OFFSET`

These clauses restrict the number of rows returned. Syntax differs among engines.

Engine	Keyword	Example
PostgreSQL / MySQL / SQLite	`LIMIT` / `OFFSET`	`SELECT * FROM employees ORDER BY salary DESC LIMIT 10 OFFSET 20;`
SQL Server	`TOP` or `OFFSET FETCH`	`SELECT TOP (10) * FROM employees ORDER BY salary DESC;`
Oracle	`FETCH FIRST`	`SELECT * FROM employees ORDER BY salary DESC FETCH FIRST 10 ROWS ONLY;`

-- PostgreSQL / MySQL
SELECT first_name, last_name, salary
FROM employees
ORDER BY salary DESC
LIMIT 5;

-- SQL Server
SELECT TOP (5) first_name, last_name, salary
FROM employees
ORDER BY salary DESC;

💡 Combine LIMIT with OFFSET for pagination, but use keyset pagination (for example, “where id > last_seen_id”) for large tables to avoid scanning overhead.

`ORDER BY`

ORDER BY defines the sort order of the result set. Without it, row order is not guaranteed, even if it appears consistent in testing.

SELECT first_name, last_name, salary
FROM employees
ORDER BY salary DESC, last_name ASC;

Specify ascending (ASC) or descending (DESC) per column.
Use column positions (ORDER BY 3 DESC) only for quick tests, not production code.
Sort by expressions or aliases (ORDER BY adjusted_salary DESC).

⚠️ Without ORDER BY, SQL does not define any particular row order. Engines may return rows differently between executions.

By mastering SELECT structure, aliases, and output controls such as DISTINCT, LIMIT, and ORDER BY, you can build efficient, readable queries that form the backbone of all SQL usage.

Chapter 12: Filtering Results

Filtering limits query results to only the rows that meet specified conditions. The WHERE clause forms the backbone of filtering, combining comparison operators, ranges, pattern matching, and logical connectors to express precise criteria. This chapter explores essential filters, logical combinations, and how operator precedence affects evaluation.

💡 Filtering early reduces work. The database engine applies WHERE conditions before grouping and sorting, which can save significant processing time.

`WHERE`, `BETWEEN`, `LIKE`, `IN`, `IS NULL`

The WHERE clause specifies a condition for each row. Only rows evaluating to true are included in the result set.

Basic comparisons

SELECT * 
FROM employees
WHERE department = 'Sales'
  AND salary > 60000;

= equal
<> or != not equal
>, <, >=, <= comparisons

Use single quotes for string literals and avoid implicit conversions between text and numbers.

Using `BETWEEN` for ranges

SELECT id, total
FROM orders
WHERE total BETWEEN 100 AND 500;

BETWEEN includes both boundary values. For exclusive ranges, use explicit comparisons.

WHERE total > 100 AND total < 500;

⚠️ For dates, ensure consistent data types and time zones. Using BETWEEN on timestamps can include unintended results if time components differ.

Pattern matching with `LIKE`

LIKE uses wildcards to match string patterns:

% matches any sequence of characters.
_ matches a single character.

SELECT first_name, last_name
FROM employees
WHERE last_name LIKE 'Sm%';     -- starts with Sm

SELECT * 
FROM customers
WHERE email LIKE '%@example.com';  -- ends with @example.com

💡 For case-insensitive matching, use ILIKE (PostgreSQL) or apply LOWER() to both sides. Create functional indexes on LOWER(column) for performance.

Matching sets with `IN`

SELECT * 
FROM products
WHERE category IN ('Electronics', 'Books', 'Home');

IN checks membership in a list of values or subquery results:

SELECT id, name
FROM employees
WHERE department_id IN (
  SELECT id FROM departments WHERE region = 'EU'
);

⚠️ NOT IN behaves differently when subqueries return NULL values. Prefer NOT EXISTS to avoid unexpected filtering.

Checking for `NULL` values

NULL represents missing or unknown data. You must use IS NULL or IS NOT NULL to test it.

SELECT id, email
FROM customers
WHERE email IS NULL;

SELECT id, email
FROM customers
WHERE email IS NOT NULL;

💡 Comparisons with NULL using = or <> always return false. Handle missing values explicitly with IS NULL or COALESCE().

Logical operators

Logical operators combine multiple conditions in a WHERE clause. They control how filters interact to include or exclude rows.

`AND` and `OR`

SELECT * 
FROM employees
WHERE department = 'Engineering'
  AND salary > 90000;

All conditions joined by AND must be true for a row to qualify. OR allows any one of its conditions to be true.

SELECT * 
FROM employees
WHERE department = 'Sales'
   OR department = 'Marketing';

You can mix them, but operator precedence determines evaluation order (see below). Parentheses clarify intent and prevent logical errors.

`NOT` operator

NOT negates a condition. It can be used with any expression that evaluates to true or false.

SELECT * 
FROM employees
WHERE NOT (department = 'HR' OR salary < 50000);

⚠️ NOT IN and NOT EXISTS differ in how they treat nulls. If the subquery result can include NULL, NOT IN may return no rows at all.

Combining filters logically

SELECT * 
FROM orders
WHERE (status = 'SHIPPED' OR status = 'DELIVERED')
  AND total > 100;

Always test combinations carefully. A missing parenthesis or inverted logic can drastically change results.

💡 Read combined filters aloud in plain language ("status is shipped or delivered, and total is over 100") to catch logical mistakes before execution.

Operator precedence and parentheses

When combining multiple conditions, SQL follows a fixed order of evaluation:

Order	Operator	Description
1	`NOT`	Negation
2	`AND`	Conjunction (both must be true)
3	`OR`	Disjunction (either condition true)

Without parentheses, AND conditions are evaluated before OR. This means:

-- This query returns rows where
-- (department = 'Sales' AND salary > 50000) OR department = 'HR'
SELECT * 
FROM employees
WHERE department = 'Sales' AND salary > 50000
   OR department = 'HR';

To make intent explicit, group conditions with parentheses:

SELECT * 
FROM employees
WHERE department = 'HR'
   OR (department = 'Sales' AND salary > 50000);

⚠️ Even seasoned developers misread logical precedence when scanning long WHERE clauses. Always use parentheses for clarity instead of relying on default order.

Mastering filters and logical combinations makes queries accurate, readable, and efficient. By understanding how SQL evaluates conditions, you can write powerful searches that return exactly the intended data.

💡 Think of parentheses as punctuation for logic. Use them liberally to match human reading order, not just machine evaluation.

Chapter 13: Joining Tables

Relational data lives across multiple tables; joins let you combine rows that share related keys. This chapter shows the core join types, how to read and write them, and how to avoid common mistakes when matching data across tables.

`INNER`, `LEFT`, `RIGHT`, `FULL`, `CROSS` joins

A join combines rows from two tables based on a match condition (except for a true Cartesian product). The typical pattern is:

SELECT {column_list …}
FROM table1
[INNER] JOIN table2
  ON table1.key = table2.key;

💡 When the join condition compares columns with the same name and meaning (for example orders.customer_id to customers.customer_id), prefer a simple equality predicate and select only the columns you need.

`INNER JOIN`: matches in both tables

Returns only rows where the join condition is true in both tables.

SELECT o.order_id, c.customer_name
FROM orders AS o
INNER JOIN customers AS c
  ON o.customer_id = c.customer_id;

If a customer has no orders, or an order references a missing customer, that pair does not appear in the result.

`LEFT JOIN` and `RIGHT JOIN`: preserving one side

A LEFT JOIN returns all rows from the left table and fills unmatched columns from the right with NULL. A RIGHT JOIN is symmetric for the right table.

-- All customers; show matching orders if any
SELECT c.customer_id, c.customer_name, o.order_id
FROM customers AS c
LEFT JOIN orders AS o
  ON o.customer_id = c.customer_id;

-- All products; show suppliers if any
SELECT p.product_id, p.product_name, s.supplier_name
FROM products AS p
RIGHT JOIN suppliers AS s
  ON p.supplier_id = s.supplier_id;  -- Example; prefer LEFT with swapped sides

⚠️ Prefer LEFT JOIN for readability; most teams avoid RIGHT JOIN and instead swap tables so the preserved table is on the left.

`FULL OUTER JOIN`: union of `LEFT` and `RIGHT`

Returns every row from both tables; rows that do not match on either side appear with NULLs for the other side.

SELECT a.id, a.name, b.name AS b_name
FROM table_a AS a
FULL OUTER JOIN table_b AS b
  ON a.id = b.id;

⚠️ MySQL does not implement FULL OUTER JOIN. Use LEFT JOIN plus RIGHT JOIN with UNION (or UNION ALL with filters) to emulate:

SELECT a.id, a.name, b.name AS b_name
FROM table_a AS a
LEFT JOIN table_b AS b ON a.id = b.id
UNION
SELECT a.id, a.name, b.name AS b_name
FROM table_a AS a
RIGHT JOIN table_b AS b ON a.id = b.id;

`CROSS JOIN` (Cartesian product)

Produces every combination of rows from both tables. Use carefully; it scales as rows(table1) × rows(table2).

-- Equivalent forms in many engines
SELECT c.color, s.size
FROM colors AS c
CROSS JOIN sizes AS s;

SELECT c.color, s.size
FROM colors AS c, sizes AS s;  -- Avoid this legacy comma form in new code

⚠️ A missing ON clause on a regular JOIN can accidentally produce a large Cartesian product. Always verify your ON predicate.

`ON` vs `WHERE` with outer joins

With outer joins, filters in WHERE can turn an outer join into an inner join by removing the NULL-extended rows. Put row-matching logic in ON, and post-join filtering in WHERE only when you intend to filter both matched and unmatched rows.

-- Keep customers even when they have no orders
SELECT c.customer_id, o.order_id
FROM customers AS c
LEFT JOIN orders AS o
  ON o.customer_id = c.customer_id
  AND o.status = 'OPEN'   -- filter applies to right side only
WHERE c.country = 'UK';   -- filter applies to preserved left rows as well

💡 As a rule: matching conditions go in ON; global constraints go in WHERE. This keeps outer join semantics correct.

Self-joins and aliasing

A self-join relates a table to itself (for example hierarchical data, predecessor links, or de-duplicating near matches). Aliases keep the references readable.

Aliasing with `AS` (or implicit aliases)

SELECT o.order_id, c.customer_name
FROM orders AS o
JOIN customers AS c ON c.customer_id = o.customer_id;

-- Most engines allow omitting AS (use sparingly for clarity):
SELECT o.order_id, c.customer_name
FROM orders o
JOIN customers c ON c.customer_id = o.customer_id;

💡 Choose short aliases that hint at meaning (c for customers, o for orders). Avoid single-letter aliases that collide (t for many tables).

Classic self-join: employees and managers

-- Each employee row has manager_id pointing to another employee's employee_id
SELECT e.employee_id,
       e.employee_name,
       m.employee_name AS manager_name
FROM employees AS e
LEFT JOIN employees AS m
  ON m.employee_id = e.manager_id;

The alias e stands for the employee row; m stands for the manager row in the same table.

Self-join for de-duplication and matching

-- Find pairs of customers with the same email (case-insensitive), different ids
SELECT c1.customer_id AS id1,
       c2.customer_id AS id2,
       LOWER(c1.email) AS email_normalized
FROM customers AS c1
JOIN customers AS c2
  ON LOWER(c1.email) = LOWER(c2.email)
 AND c1.customer_id < c2.customer_id;  -- prevent symmetric duplicates

⚠️ For large tables, self-joins can be expensive. Add useful indexes (for example employees.manager_id, customers.email) to keep scans efficient.

Hierarchy traversal (adjacency list)

For a simple parent → child relationship stored as (id, parent_id), a self-join returns parent attributes alongside children.

SELECT child.category_id,
       child.category_name,
       parent.category_name AS parent_name
FROM categories AS child
LEFT JOIN categories AS parent
  ON parent.category_id = child.parent_id;

Practical join examples

This section uses a small sales schema:

Table	Key columns	Notes
customers	`customer_id`	One row per customer
orders	`order_id`, `customer_id`	Order header; one row per order
order_items	`order_id`, `product_id`	Line items per order
products	`product_id`	Catalog of products

Joining three or more tables

SELECT o.order_id,
       c.customer_name,
       p.product_name,
       oi.quantity,
       oi.unit_price
FROM orders AS o
JOIN customers AS c
  ON c.customer_id = o.customer_id
JOIN order_items AS oi
  ON oi.order_id = o.order_id
JOIN products AS p
  ON p.product_id = oi.product_id;

Add joins one step at a time; verify each join multiplies rows as expected.

Aggregations across joins

-- Revenue per customer
SELECT c.customer_id,
       c.customer_name,
       SUM(oi.quantity * oi.unit_price) AS revenue
FROM customers AS c
JOIN orders AS o
  ON o.customer_id = c.customer_id
JOIN order_items AS oi
  ON oi.order_id = o.order_id
GROUP BY c.customer_id, c.customer_name
ORDER BY revenue DESC;

💡 Aggregate after joining the necessary detail rows; avoid aggregating prematurely then joining, unless you intend to reduce row counts before a large join.

Finding unmatched rows (`LEFT JOIN` with `NULL` check)

-- Customers with no orders
SELECT c.customer_id, c.customer_name
FROM customers AS c
LEFT JOIN orders AS o
  ON o.customer_id = c.customer_id
WHERE o.order_id IS NULL;

This pattern (outer join plus IS NULL on the non-preserved side) is the standard anti-join approach.

Many-to-many bridges

Use a bridge (junction) table to connect two entities. Join through the bridge to reach the far table.

-- Authors <- book_authors -> Books
SELECT a.author_name, b.title
FROM authors AS a
JOIN book_authors AS ba
  ON ba.author_id = a.author_id
JOIN books AS b
  ON b.book_id = ba.book_id;

Filtering on the joined table safely

-- Customers with at least one OPEN order; preserve the "at least one" intent
SELECT DISTINCT c.customer_id, c.customer_name
FROM customers AS c
JOIN orders AS o
  ON o.customer_id = c.customer_id
WHERE o.status = 'OPEN';

If you need to keep customers with zero open orders as well, move the status predicate into ON and use LEFT JOIN.

`USING` and `NATURAL JOIN`

-- USING collapses duplicate column names in the output
SELECT o.order_id, customer_id
FROM orders AS o
JOIN customers AS c
USING (customer_id);

-- NATURAL JOIN matches columns with the same name automatically (avoid in production)
SELECT *
FROM orders NATURAL JOIN customers;

⚠️ NATURAL JOIN depends on column names that might change. Prefer explicit ON or USING so behavior is clear.

Join performance checklist

Index foreign keys (orders.customer_id, order_items.order_id, order_items.product_id).
Project only needed columns; wide selects slow I/O.
Filter early where possible (push selective predicates into ON for outer joins or WHERE for inner joins).
Watch cardinality multiplication; understand one-to-many and many-to-many expansion.

💡 For investigative queries, start with SELECT COUNT(*) at each step to see how join choices affect row counts, then switch to detailed columns when satisfied.

Dialect notes

Topic	MySQL	PostgreSQL	SQL Server	SQLite	Oracle
`FULL OUTER JOIN`	Not supported; emulate with `UNION`	Supported	Supported	Not supported; emulate with `UNION`	Supported
`USING` clause	Supported	Supported	Not supported; use `ON`	Supported	Supported
Legacy outer join syntax	N/A	N/A	`=`, `=` (deprecated)	N/A	`(+)` (deprecated; use ANSI joins)

With these patterns, you can confidently combine tables for analysis and application queries.

Chapter 14: Grouping and Aggregating

Aggregations turn many rows into summarized results; grouping decides which rows belong together. This chapter explains GROUP BY and HAVING, introduces core aggregate functions, and shows advanced multi-level summaries with ROLLUP, CUBE, and GROUPING SETS.

`GROUP BY`, `HAVING`

A basic grouped query follows this pattern:

SELECT {group_columns …}, {aggregates …}
FROM {table}
[WHERE {row filters …}]
GROUP BY {group_columns …}
[HAVING {aggregate filters …}]
[ORDER BY {columns …}];

💡 Columns in the SELECT list must be either grouped (appear in GROUP BY) or aggregated. Some engines allow functional dependency shortcuts; write the explicit grouping for clarity and portability.

Single and multi-column `GROUP BY`

-- Orders per customer
SELECT o.customer_id, COUNT(*) AS order_count
FROM orders AS o
GROUP BY o.customer_id;

-- Revenue per customer and year
SELECT o.customer_id,
       EXTRACT(YEAR FROM o.order_date) AS yr,
       SUM(oi.quantity * oi.unit_price) AS revenue
FROM orders AS o
JOIN order_items AS oi ON oi.order_id = o.order_id
GROUP BY o.customer_id, EXTRACT(YEAR FROM o.order_date);

Grouping by expressions is allowed; you do not need to repeat the expression in an alias in every engine. Use the same expression in GROUP BY that you select.

`WHERE` vs `HAVING`

Use WHERE to filter input rows before grouping. Use HAVING to filter groups after aggregation.

-- Count only OPEN orders per customer; drop customers with zero OPEN orders
SELECT o.customer_id, COUNT(*) AS open_orders
FROM orders AS o
WHERE o.status = 'OPEN'
GROUP BY o.customer_id
HAVING COUNT(*) > 0;

⚠️ Putting non-aggregate conditions in HAVING can be slower because it waits until after grouping. Prefer WHERE when possible.

Handling `NULL` in groups

Rows where the grouping expression evaluates to NULL are grouped together in a single NULL group. Aggregate functions ignore NULL values except COUNT(*) which counts rows regardless of NULLs.

-- Count orders per coupon; NULL coupon rows form one group
SELECT o.coupon_code, COUNT(*) AS n
FROM orders AS o
GROUP BY o.coupon_code;

Aliasing in `GROUP BY` and `ORDER BY`

-- Most engines allow using select-list aliases in ORDER BY
SELECT c.customer_id AS id, COUNT(*) AS orders
FROM orders AS o
JOIN customers AS c ON c.customer_id = o.customer_id
GROUP BY c.customer_id
ORDER BY orders DESC;

-- Some engines allow ordinal positions in ORDER BY; avoid for readability
-- ORDER BY 2 DESC;

💡 Prefer explicit names over ordinal positions; renaming or reordering select columns can silently change results otherwise.

`COUNT`, `SUM`, `AVG`, `MIN`, `MAX`, `DISTINCT` aggregates

These standard aggregates work across engines.

`COUNT` variants

-- Count rows regardless of NULLs
SELECT COUNT(*) FROM orders;

-- Count non-NULL values in a column
SELECT COUNT(ship_date) FROM orders;

-- Count distinct customers who ordered this year
SELECT COUNT(DISTINCT customer_id)
FROM orders
WHERE EXTRACT(YEAR FROM order_date) = 2025;

⚠️ COUNT(DISTINCT a, b) (multiple columns) is not supported in all engines. Use a concatenation or a derived key where needed, or group first then count.

`SUM` and `AVG`

-- Revenue per product
SELECT oi.product_id,
       SUM(oi.quantity * oi.unit_price) AS revenue,
       AVG(oi.unit_price) AS avg_price
FROM order_items AS oi
GROUP BY oi.product_id;

AVG ignores NULL values. Be mindful of integer division in some engines; cast to a decimal type when precision matters.

`MIN` and `MAX`

-- First and latest order per customer
SELECT o.customer_id,
       MIN(o.order_date) AS first_order,
       MAX(o.order_date) AS latest_order
FROM orders AS o
GROUP BY o.customer_id;

`DISTINCT` inside aggregates

-- Number of distinct products per order
SELECT oi.order_id,
       COUNT(DISTINCT oi.product_id) AS unique_products
FROM order_items AS oi
GROUP BY oi.order_id;

💡 DISTINCT applies to the aggregate's argument only. It does not affect other selected columns.

Conditional aggregates

Use CASE expressions for portable conditional sums and counts.

-- Order status breakdown per customer
SELECT o.customer_id,
       SUM(CASE WHEN o.status = 'OPEN' THEN 1 ELSE 0 END)   AS open_orders,
       SUM(CASE WHEN o.status = 'CLOSED' THEN 1 ELSE 0 END) AS closed_orders
FROM orders AS o
GROUP BY o.customer_id;

PostgreSQL and a few engines support FILTER for concise conditional aggregates:

-- PostgreSQL example
SELECT customer_id,
       COUNT(*) FILTER (WHERE status = 'OPEN')   AS open_orders,
       COUNT(*) FILTER (WHERE status = 'CLOSED') AS closed_orders
FROM orders
GROUP BY customer_id;

⚠️ Window functions (OVER clause) compute per-row analytics and are different from grouped aggregates. This book covers windows later…

`ROLLUP`, `CUBE`, and `GROUPING SETS`

Advanced grouping operators produce multi-level totals in one query. They extend GROUP BY and return subtotal rows alongside detail rows.

`ROLLUP`: hierarchical subtotals

ROLLUP creates subtotals moving from right to left through the grouping list and a grand total at the end.

-- Revenue by year, then month, plus yearly totals and a grand total
SELECT EXTRACT(YEAR  FROM o.order_date) AS yr,
       EXTRACT(MONTH FROM o.order_date) AS mon,
       SUM(oi.quantity * oi.unit_price) AS revenue
FROM orders AS o
JOIN order_items AS oi ON oi.order_id = o.order_id
GROUP BY ROLLUP (EXTRACT(YEAR  FROM o.order_date),
                 EXTRACT(MONTH FROM o.order_date))
ORDER BY yr, mon;

Subtotal rows show NULL in the rolled-up columns. Use GROUPING or GROUPING_ID (where available) to detect them cleanly.

-- Label subtotal rows (syntax varies by engine)
SELECT
  EXTRACT(YEAR FROM o.order_date) AS yr,
  EXTRACT(MONTH FROM o.order_date) AS mon,
  SUM(oi.quantity * oi.unit_price) AS revenue,
  GROUPING(EXTRACT(YEAR FROM o.order_date))  AS g_yr,
  GROUPING(EXTRACT(MONTH FROM o.order_date)) AS g_mon
FROM orders AS o
JOIN order_items AS oi ON oi.order_id = o.order_id
GROUP BY ROLLUP (EXTRACT(YEAR FROM o.order_date),
                 EXTRACT(MONTH FROM o.order_date));

💡 When presentation matters, wrap columns with CASE to replace NULL in subtotal rows (for example CASE WHEN GROUPING(mon)=1 THEN 'All months' ELSE TO_CHAR(mon,'FM00') END).

`CUBE`: all combinations of groups

CUBE(a, b, c) returns subtotals for every combination of the listed columns plus the grand total. This grows quickly with each additional column.

-- Revenue by region, channel, and product category plus all subtotals
SELECT r.region, o.channel, p.category,
       SUM(oi.quantity * oi.unit_price) AS revenue
FROM orders AS o
JOIN order_items AS oi ON oi.order_id = o.order_id
JOIN products AS p ON p.product_id = oi.product_id
JOIN regions  AS r ON r.region_id = o.region_id
GROUP BY CUBE (r.region, o.channel, p.category);

⚠️ CUBE emits many rows. Limit dimensions or apply HAVING to keep result sizes manageable.

`GROUPING SETS`: explicit subtotal patterns

GROUPING SETS let you specify exactly which groupings you want.

-- Get revenue by region, by channel, and the grand total; avoid unwanted combos
SELECT r.region, o.channel,
       SUM(oi.quantity * oi.unit_price) AS revenue
FROM orders AS o
JOIN order_items AS oi ON oi.order_id = o.order_id
JOIN regions AS r ON r.region_id = o.region_id
GROUP BY GROUPING SETS (
  (r.region),
  (o.channel),
  ()            -- grand total
)
ORDER BY r.region NULLS LAST, o.channel NULLS LAST;

Emulating advanced groupings

If your engine lacks these operators, use UNION ALL across several grouped queries.

-- Emulate ROLLUP(year, month)
SELECT yr, mon, SUM(revenue) AS revenue
FROM (
  SELECT EXTRACT(YEAR FROM o.order_date)  AS yr,
         EXTRACT(MONTH FROM o.order_date) AS mon,
         oi.quantity * oi.unit_price      AS revenue
  FROM orders AS o
  JOIN order_items AS oi ON oi.order_id = o.order_id
) AS t
GROUP BY yr, mon

UNION ALL
SELECT yr, NULL AS mon, SUM(revenue)
FROM (
  SELECT EXTRACT(YEAR FROM o.order_date) AS yr,
         oi.quantity * oi.unit_price     AS revenue
  FROM orders AS o
  JOIN order_items AS oi ON oi.order_id = o.order_id
) AS y
GROUP BY yr

UNION ALL
SELECT NULL AS yr, NULL AS mon, SUM(oi.quantity * oi.unit_price)
FROM orders AS o
JOIN order_items AS oi ON oi.order_id = o.order_id;

💡 When emulating with UNION ALL, add a column that tags the row level (for example 'detail', 'by_year', 'total') so you can sort and format consistently.

Dialect notes

Feature	MySQL	PostgreSQL	SQL Server	SQLite	Oracle
`ROLLUP` / `WITH ROLLUP`	Supported (`WITH ROLLUP` and `ROLLUP()` in recent versions)	Supported	Supported	Not supported; emulate with `UNION`	Supported
`CUBE`	Not supported; emulate	Supported	Supported	Not supported; emulate	Supported
`GROUPING SETS`	Supported (recent); check version	Supported	Supported	Not supported; emulate	Supported
`GROUPING`/`GROUPING_ID`	Partial support; no `GROUPING_ID` in older releases	Supported (`GROUPING()`)	Supported (both)	N/A	Supported (both)

Master the basics of grouping, then add subtotals with ROLLUP, full combinatorial summaries with CUBE, or targeted totals with GROUPING SETS as your analysis requires.

Chapter 15: Subqueries and Derived Tables

Subqueries allow you to nest one query inside another. They return values or tables that the outer query can use for filtering, comparison, or joining. This chapter explores scalar, row, and table subqueries, explains EXISTS, ANY, and ALL, and concludes with Common Table Expressions (CTEs) for clarity and reuse.

Scalar, row, and table subqueries

A subquery is enclosed in parentheses and can appear almost anywhere an expression or table name could. The outer query treats the subquery’s result as a single value, a row, or a full virtual table.

Scalar subqueries (single value)

A scalar subquery returns exactly one value (one row and one column). Use it wherever a literal could appear.

-- Find customers whose total spend exceeds the average order total
SELECT c.customer_id, c.customer_name
FROM customers AS c
WHERE (
  SELECT SUM(oi.quantity * oi.unit_price)
  FROM orders AS o
  JOIN order_items AS oi ON oi.order_id = o.order_id
  WHERE o.customer_id = c.customer_id
) > (
  SELECT AVG(total)
  FROM (
    SELECT SUM(oi.quantity * oi.unit_price) AS total
    FROM orders AS o
    JOIN order_items AS oi ON oi.order_id = o.order_id
    GROUP BY o.order_id
  ) AS sub
);

⚠️ If a scalar subquery returns more than one row, most databases raise an error. Limit or aggregate to ensure a single value.

Row subqueries (multiple columns, one row)

A row subquery returns one row with several columns. It can be compared to another row using tuple syntax.

-- Compare composite (city, state) values
SELECT c1.customer_id, c1.customer_name
FROM customers AS c1
WHERE (c1.city, c1.state) = (
  SELECT c2.city, c2.state
  FROM customers AS c2
  WHERE c2.customer_id = 42
);

⚠️ Row comparison syntax like (col1, col2) = (val1, val2) is supported in PostgreSQL, Oracle, and some others; MySQL supports it partially; SQL Server does not.

Table subqueries (derived tables)

When a subquery returns a full result set, it can be treated as a table source. These are often called derived tables or inline views.

-- Derived table to compute revenue per customer
SELECT t.customer_id, t.revenue
FROM (
  SELECT o.customer_id,
         SUM(oi.quantity * oi.unit_price) AS revenue
  FROM orders AS o
  JOIN order_items AS oi ON oi.order_id = o.order_id
  GROUP BY o.customer_id
) AS t
WHERE t.revenue > 1000;

Always give derived tables an alias. Without one, most databases reject the query.

💡 Derived tables help isolate intermediate logic. They are recalculated for each use; for repeated references use a CTE instead.

Correlated subqueries

A correlated subquery references columns from its outer query. It executes once per outer row, not just once overall.

-- Customers whose latest order is within 30 days
SELECT c.customer_id, c.customer_name
FROM customers AS c
WHERE EXISTS (
  SELECT 1
  FROM orders AS o
  WHERE o.customer_id = c.customer_id
    AND o.order_date > CURRENT_DATE - INTERVAL '30' DAY
);

⚠️ Correlated subqueries can be slow on large datasets. Many can be rewritten as joins or aggregations for better performance.

`EXISTS`, `ANY`, `ALL`

These operators compare outer query rows to subquery results. They provide powerful logic for inclusion and comparison tests.

`EXISTS`

EXISTS tests whether a subquery returns at least one row. It stops scanning as soon as one match is found.

-- Customers who have ever placed an order
SELECT c.customer_id, c.customer_name
FROM customers AS c
WHERE EXISTS (
  SELECT 1
  FROM orders AS o
  WHERE o.customer_id = c.customer_id
);

EXISTS ignores the subquery’s actual values; only presence matters.

💡 Use EXISTS instead of IN when the subquery may produce NULLs or large results. EXISTS often optimizes better.

`NOT EXISTS`

To find outer rows with no matches, invert the condition.

-- Customers with no orders
SELECT c.customer_id, c.customer_name
FROM customers AS c
WHERE NOT EXISTS (
  SELECT 1
  FROM orders AS o
  WHERE o.customer_id = c.customer_id
);

⚠️ Avoid NOT IN when subquery columns may contain NULL; it can return no rows unexpectedly. NOT EXISTS is safer.

`ANY` and `ALL`

ANY (or SOME) compares a value to each subquery result and succeeds if any comparison is true. ALL requires all comparisons to be true.

-- Orders larger than the average for any customer in Germany
SELECT o.order_id, o.customer_id, o.total_amount
FROM orders AS o
WHERE o.total_amount > ANY (
  SELECT AVG(oi.quantity * oi.unit_price)
  FROM orders AS g
  JOIN order_items AS oi ON oi.order_id = g.order_id
  JOIN customers AS c ON c.customer_id = g.customer_id
  WHERE c.country = 'Germany'
  GROUP BY g.customer_id
);

-- Products cheaper than all others in their category
SELECT p.product_id, p.product_name
FROM products AS p
WHERE p.unit_price < ALL (
  SELECT p2.unit_price
  FROM products AS p2
  WHERE p2.category_id = p.category_id
    AND p2.product_id <> p.product_id
);

💡 Think of ANY as “at least one” and ALL as “every.” For clarity, many developers replace them with MIN or MAX aggregates where possible.

Common Table Expressions

Common Table Expressions (CTEs) provide named, reusable subqueries that improve readability. They are introduced with WITH and can be referenced by later queries in the same statement.

Basic CTE syntax

WITH totals AS (
  SELECT o.customer_id, SUM(oi.quantity * oi.unit_price) AS revenue
  FROM orders AS o
  JOIN order_items AS oi ON oi.order_id = o.order_id
  GROUP BY o.customer_id
)
SELECT c.customer_name, t.revenue
FROM customers AS c
JOIN totals AS t ON t.customer_id = c.customer_id
WHERE t.revenue > 1000;

A CTE acts like a temporary named table visible only to the main query (and to other CTEs declared after it).

💡 You can define multiple CTEs separated by commas. Each can reference earlier ones in the same WITH clause.

Recursive CTEs

Recursive CTEs reference themselves to walk hierarchies or compute sequences. They use UNION ALL between an anchor member and a recursive member.

-- Organization hierarchy
WITH RECURSIVE org_chart AS (
  SELECT employee_id, manager_id, employee_name, 1 AS level
  FROM employees
  WHERE manager_id IS NULL
  UNION ALL
  SELECT e.employee_id, e.manager_id, e.employee_name, oc.level + 1
  FROM employees AS e
  JOIN org_chart AS oc ON e.manager_id = oc.employee_id
)
SELECT *
FROM org_chart
ORDER BY level, employee_name;

⚠️ Use UNION ALL instead of UNION in recursive CTEs to avoid redundant sorting and distinct checks.

CTEs vs derived tables

CTEs are easier to read and maintain, but engines may not always materialize them as temporary results. Derived tables can be faster in some cases. Most optimizers treat them equivalently unless hints or recursion are involved.

-- Derived table version of the same revenue query
SELECT c.customer_name, t.revenue
FROM customers AS c
JOIN (
  SELECT o.customer_id, SUM(oi.quantity * oi.unit_price) AS revenue
  FROM orders AS o
  JOIN order_items AS oi ON oi.order_id = o.order_id
  GROUP BY o.customer_id
) AS t ON t.customer_id = c.customer_id;

💡 Use CTEs for readability or multi-step logic. For performance tuning, compare execution plans of both forms.

Dialect notes

Feature	MySQL	PostgreSQL	SQL Server	SQLite	Oracle
CTE support	Supported (8.0+)	Supported	Supported	Supported	Supported
Recursive CTEs	Supported (8.0+)	Supported	Supported	Supported	Supported
Materialization hints	Not available	`MATERIALIZED` / `NOT MATERIALIZED` supported	Optimizer controlled	Optimizer controlled	Optimizer controlled

Subqueries, derived tables, and CTEs all provide ways to compose complex logic. Choose the simplest structure that expresses your intent clearly while remaining efficient for the database engine.

Chapter 16: Transactions and ACID

Transactions ensure that groups of SQL statements are executed safely and consistently. They bundle multiple operations into a single logical unit of work that either fully succeeds or has no effect at all. This chapter introduces transaction control commands, the ACID properties that define reliability, and how databases handle isolation, concurrency, and locking.

`COMMIT`, `ROLLBACK`, `SAVEPOINT`

A transaction starts automatically when you execute the first statement after a previous one has committed (or when you issue BEGIN or START TRANSACTION). You can then decide whether to make changes permanent or discard them.

-- Basic transaction pattern
START TRANSACTION;

UPDATE accounts SET balance = balance - 100 WHERE account_id = 1;
UPDATE accounts SET balance = balance + 100 WHERE account_id = 2;

COMMIT;   -- make changes permanent
-- or
ROLLBACK; -- undo all changes

COMMIT permanently applies all changes made since the transaction began. ROLLBACK cancels them and restores the database to its prior state.

💡 Many client libraries start transactions implicitly. Always call COMMIT or ROLLBACK explicitly to avoid leaving open transactions that hold locks.

`SAVEPOINT` and partial rollbacks

SAVEPOINT creates a named checkpoint inside a transaction. You can roll back to it without discarding the whole transaction.

START TRANSACTION;

UPDATE products SET stock = stock - 1 WHERE product_id = 101;
SAVEPOINT before_discount;

UPDATE products SET price = price * 0.8 WHERE category = 'Sale';

-- Change of mind: cancel the discount only
ROLLBACK TO SAVEPOINT before_discount;

COMMIT;

After a ROLLBACK TO SAVEPOINT, the savepoint remains defined in most engines, allowing further rollbacks to earlier points if needed.

⚠️ Some databases automatically release savepoints after a full ROLLBACK or COMMIT. Recreate them if you continue the transaction afterward.

Autocommit mode

Autocommit executes each statement in its own implicit transaction. This is convenient for read operations but risky for multi-step updates that must stay consistent.

-- Disable autocommit in MySQL
SET autocommit = 0;

-- Explicitly start and end a transaction
START TRANSACTION;
UPDATE accounts SET balance = balance - 50 WHERE account_id = 10;
UPDATE accounts SET balance = balance + 50 WHERE account_id = 20;
COMMIT;

-- Re-enable autocommit
SET autocommit = 1;

Use autocommit for safe single-statement updates and disable it for atomic multi-statement operations.

Isolation levels

Isolation defines how visible uncommitted changes are between concurrent transactions. SQL defines four standard isolation levels. Each prevents certain anomalies but may reduce performance or concurrency.

Level	Prevents	Allows
`READ UNCOMMITTED`	Nothing	Dirty reads, non-repeatable reads, phantom rows
`READ COMMITTED`	Dirty reads	Non-repeatable reads, phantom rows
`REPEATABLE READ`	Dirty and non-repeatable reads	Phantom rows (varies by engine)
`SERIALIZABLE`	All above anomalies	None

Setting isolation level

-- Session-level in MySQL and PostgreSQL
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;

-- SQL Server
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;

-- Oracle default is READ COMMITTED

Higher isolation reduces concurrency, so choose the weakest level that preserves correctness for your workload.

💡 For reporting queries, READ COMMITTED often suffices. For financial updates, SERIALIZABLE ensures full consistency.

Common anomalies

Dirty read: Transaction A reads uncommitted data from B, which B later rolls back.
Non-repeatable read: Transaction A rereads the same row and sees different data due to B’s commit.
Phantom read: A range query in A returns new or missing rows after B inserts or deletes matching rows.

Isolation levels balance performance and correctness by deciding which of these anomalies to allow.

Concurrency control

Databases use concurrency control to let many transactions operate safely at once. Two major strategies dominate: pessimistic locking (block conflicting operations) and optimistic control (detect conflicts and retry).

Pessimistic concurrency

Pessimistic systems lock resources as soon as they are read or updated, preventing conflicts but reducing parallelism.

-- Explicit lock in SQL Server
SELECT * FROM accounts WITH (UPDLOCK)
WHERE account_id = 100;

MySQL’s InnoDB and PostgreSQL automatically lock rows written inside a transaction. You can request explicit locks using SELECT ... FOR UPDATE.

-- Reserve a row for editing
SELECT * FROM accounts
WHERE account_id = 100
FOR UPDATE;

Optimistic concurrency

Optimistic control allows concurrent transactions without locking, checking later whether data changed. It suits read-heavy workloads.

-- Example using a version column
UPDATE accounts
SET balance = balance + 50, version = version + 1
WHERE account_id = 10 AND version = 5;

If zero rows are affected, another transaction modified the same row, and the update must be retried.

⚠️ Applications must handle retries explicitly when using optimistic concurrency. Databases do not automatically roll back on conflict.

Locking and deadlocks

Locks control concurrent access to data. They can apply to rows, pages, or entire tables depending on the engine and query. Proper lock management prevents inconsistent reads but may introduce deadlocks when transactions wait on each other.

Lock types

Lock type	Purpose
Shared (S)	Allows concurrent reads but blocks writes.
Exclusive (X)	Blocks all other access until commit or rollback.
Intent locks	Signal upcoming shared or exclusive locks at a coarser level.
Table locks	Used when statements affect entire tables (DDL, `TRUNCATE`, etc.).

Most engines acquire and release locks automatically. Manual control is rarely needed except for specialized concurrency logic.

Deadlocks

A deadlock occurs when two transactions each hold a lock that the other needs. Neither can proceed, so the database aborts one to resolve the conflict.

-- Example scenario
-- Transaction A:
UPDATE accounts SET balance = balance - 10 WHERE account_id = 1;
UPDATE accounts SET balance = balance + 10 WHERE account_id = 2;

-- Transaction B (in parallel):
UPDATE accounts SET balance = balance - 10 WHERE account_id = 2;
UPDATE accounts SET balance = balance + 10 WHERE account_id = 1;

Each transaction locks one row, then waits on the other’s lock, creating a circular wait. The engine detects this and rolls back one transaction.

💡 To avoid deadlocks, access tables and rows in a consistent order across transactions, keep transactions short, and handle rollbacks gracefully.

Monitoring locks

Databases provide diagnostic views to inspect locks and waits:

MySQL: SHOW ENGINE INNODB STATUS;
PostgreSQL: pg_locks system view
SQL Server: sys.dm_tran_locks DMV
Oracle: v$lock and v$session

Regularly reviewing these can reveal long-held locks or frequent deadlocks, which suggest missing indexes or inefficient transaction design.

Transactions, isolation, and concurrency control together form the foundation of reliable SQL systems. Mastering them lets you balance correctness with performance and design applications that remain consistent even under heavy load.

Chapter 17: Users, Roles, and Permissions

Security in SQL databases begins with users, roles, and privileges. Every statement runs under a specific identity, and permissions determine which objects and operations are allowed. This chapter covers privilege management with GRANT and REVOKE, shows how to organize permissions with roles, applies the principle of least privilege, and explains how views can act as protective layers around data.

`GRANT` and `REVOKE`

Permissions control access to database objects such as tables, views, and routines. You can assign or remove them with GRANT and REVOKE. Each database defines its own privilege types, but core privileges are widely supported.

Privilege	Purpose
`SELECT`	Read data from tables or views.
`INSERT`	Add new rows.
`UPDATE`	Modify existing rows.
`DELETE`	Remove rows.
`EXECUTE`	Run stored procedures or functions.
`USAGE`	Access objects like sequences or schemas (varies by engine).
`ALL PRIVILEGES`	Grant every available permission for the object.

Granting privileges

-- Give SELECT and UPDATE rights on a table
GRANT SELECT, UPDATE ON employees TO analyst;

-- Allow a role to run a procedure
GRANT EXECUTE ON PROCEDURE payroll_run TO payroll_role;

-- Grant all privileges to an admin
GRANT ALL PRIVILEGES ON database company TO admin;

Privileges can be granted to individual users or to roles. Roles then group privileges for easier management.

💡 Use roles to group related permissions (for example, sales_reader, data_entry). Then assign roles to users instead of repeating the same GRANT statements.

Revoking privileges

REVOKE removes privileges previously granted. Without specific privileges, users lose access to the affected objects.

REVOKE UPDATE ON employees FROM analyst;

REVOKE ALL PRIVILEGES ON database company FROM temp_user;

Some databases also support REVOKE GRANT OPTION FOR, which prevents a user from passing privileges to others without removing their own access.

⚠️ Revoking a privilege from a role removes it from all users who hold that role. Be careful when roles are shared across departments.

Ownership and inheritance

Object owners automatically have full rights to their objects. In some systems (like PostgreSQL), ownership can be transferred with ALTER ... OWNER TO. In others (like MySQL), privileges depend on grant tables and not strict ownership.

Roles may inherit privileges from other roles, forming a hierarchy. Use this to model layered access (for example, junior_analyst inherits reader privileges but adds limited write access).

Managing privileges

Privilege management involves creating users, assigning roles, and auditing access. Each system provides its own syntax for creating accounts and linking authentication credentials.

Creating users and roles

-- MySQL
CREATE USER 'jane'@'%' IDENTIFIED BY 'password123';
GRANT SELECT ON company.* TO 'jane'@'%';

-- PostgreSQL
CREATE ROLE analyst LOGIN PASSWORD 'strongpass';
GRANT SELECT ON ALL TABLES IN SCHEMA public TO analyst;

-- SQL Server
CREATE LOGIN jane WITH PASSWORD = 'password123';
CREATE USER jane FOR LOGIN jane;
EXEC sp_addrolemember 'db_datareader', 'jane';

Use CREATE ROLE to define reusable sets of permissions. In many databases, users and roles are both represented as roles with or without the LOGIN privilege.

💡 Prefer roles over direct user grants. This simplifies revocation and auditing when staff change or projects end.

Viewing existing privileges

-- MySQL
SHOW GRANTS FOR 'jane'@'%';

-- PostgreSQL
\du jane   -- psql shortcut to show roles and privileges

-- SQL Server
SELECT * FROM sys.database_permissions WHERE grantee_principal_id = USER_ID('jane');

Regularly auditing privileges helps maintain a secure environment and ensures compliance with organizational policies.

Grant chains and cascading effects

When a user is granted privileges with WITH GRANT OPTION, they can grant the same privileges to others. Revoking the original grant also removes privileges cascaded from it.

GRANT SELECT ON employees TO alice WITH GRANT OPTION;
GRANT SELECT ON employees TO bob;  -- by alice
REVOKE SELECT ON employees FROM alice CASCADE;  -- removes bob’s access too

⚠️ Avoid unnecessary WITH GRANT OPTION privileges. They can spread access beyond intended boundaries.

Least-privilege design

The principle of least privilege means giving users only the minimum rights required to perform their tasks. This limits the impact of mistakes and intrusions.

Implementing least privilege

Start from no privileges, then grant only what is necessary.
Use roles to represent job functions (for example, reader, editor, auditor).
Revoke default privileges that new users inherit in some systems.
Audit privilege use periodically and remove unused ones.
Separate duties: do not combine administration and data entry in one account.

Least privilege applies not only to human users but also to application connections. Each application should connect with a role that has access only to the schema objects it needs.

💡 Create a dedicated database user for each application component (web frontend, analytics job, backup tool). Restrict their access independently.

Temporary or time-limited privileges

Some databases support time-limited grants or scripts to revoke privileges after audits. When not supported natively, manage expiry manually via scheduled jobs or automation tools.

-- Example manual revocation
GRANT SELECT ON reports TO contractor;
-- Later
REVOKE SELECT ON reports FROM contractor;

Views as security boundaries

Views provide a controlled interface to data. They can hide sensitive columns, filter rows, or present derived information. By granting access to the view instead of the base table, you restrict what users can see or modify.

Creating secure views

-- Create a view that hides salary details
CREATE VIEW staff_public AS
SELECT employee_id, name, department
FROM employees;

GRANT SELECT ON staff_public TO hr_assistant;
REVOKE SELECT ON employees FROM hr_assistant;

This allows HR assistants to read staff information without seeing salaries or private data.

Updatable views

In many systems, simple views are updatable. More complex views (with joins, aggregates, or groupings) are read-only unless triggers or INSTEAD OF rules are defined.

-- SQL Server example
CREATE VIEW v_customers AS
SELECT customer_id, name, email FROM customers
WHERE active = 1
WITH CHECK OPTION;  -- prevents updates that violate the filter

⚠️ Use WITH CHECK OPTION to ensure updates through a view cannot modify rows outside its defined filter.

Definer and invoker security

Views can run with the privileges of either the view’s creator (definer rights) or the calling user (invoker rights). Definer rights let users access underlying data indirectly, while invoker rights enforce their own privileges.

-- MySQL example
CREATE DEFINER = 'admin'@'%' VIEW confidential_summary AS
SELECT COUNT(*) AS employee_count FROM employees;

When a normal user queries this view, it executes under the admin privileges defined at creation time, not the caller’s privileges.

💡 Use definer rights only for controlled reporting or shared dashboards. Avoid them when the logic performs updates or could expose private data indirectly.

Auditing access via views

Views can also log or trace access if combined with triggers or audit tables. For example, you can record every read of a sensitive view by inserting a log entry into an audit table through a trigger or wrapper function.

By combining roles, privileges, and carefully designed views, you can enforce fine-grained access control and maintain both usability and security within your database systems.

Chapter 18: Views and Virtual Tables

Views are stored SQL queries that act like virtual tables. They simplify complex logic, present consistent query interfaces, and protect underlying schema details. This chapter covers how to create and maintain views, explains when they can be updated, introduces materialized views for stored query results, and outlines caching strategies for performance.

Creating and updating views

A view is defined with a CREATE VIEW statement that stores a query for later use. When queried, the database runs the underlying SQL dynamically, presenting results as a table.

-- Create a view of active customers
CREATE VIEW active_customers AS
SELECT customer_id, name, country
FROM customers
WHERE active = 1;

-- Query it like a table
SELECT * FROM active_customers;

Views can join multiple tables, filter rows, or include calculated columns. They are often used to encapsulate business logic or simplify repetitive queries.

💡 Use clear naming for views that indicate their content or purpose (for example, v_sales_summary, v_active_users). This helps distinguish them from base tables.

Replacing or altering a view

To modify an existing view, use CREATE OR REPLACE VIEW where supported, or drop and recreate it.

-- PostgreSQL and Oracle
CREATE OR REPLACE VIEW active_customers AS
SELECT customer_id, name, email
FROM customers
WHERE active = 1;

-- MySQL (before 5.7) requires drop first
DROP VIEW IF EXISTS active_customers;
CREATE VIEW active_customers AS
SELECT customer_id, name, email
FROM customers
WHERE active = 1;

Altering a view does not change dependent privileges or references, but if underlying columns are renamed or removed, the view may break until redefined.

Dropping views

DROP VIEW active_customers;

Dropping a view removes only its definition; it does not affect underlying tables or data.

Updatable vs read-only views

Some views can be updated directly with INSERT, UPDATE, or DELETE. Others are inherently read-only because their definition makes it ambiguous which base rows should change.

Updatable views

Views are generally updatable if they meet these conditions:

They reference a single base table.
They include all non-nullable columns from the base table.
They contain no GROUP BY, DISTINCT, aggregates, or set operations.

-- Updatable view
CREATE VIEW department_staff AS
SELECT employee_id, name, department_id
FROM employees;

-- Update through the view
UPDATE department_staff
SET department_id = 3
WHERE employee_id = 102;

The database propagates such updates directly to the base table.

Read-only views

Views that use joins, aggregates, or derived columns are read-only because updates cannot be unambiguously mapped back to base data.

-- Read-only view
CREATE VIEW department_summary AS
SELECT d.department_name,
       COUNT(e.employee_id) AS staff_count
FROM departments AS d
LEFT JOIN employees AS e ON e.department_id = d.department_id
GROUP BY d.department_name;

Attempting to insert or update through such a view usually causes an error.

⚠️ Some systems (like SQL Server and Oracle) allow limited write operations on complex views when defined with INSTEAD OF triggers, which translate the view updates into custom logic.

Enforcing rules with `WITH CHECK OPTION`

WITH CHECK OPTION ensures that inserts or updates through a view do not produce rows that fall outside the view’s WHERE condition.

CREATE VIEW active_employees AS
SELECT employee_id, name, department_id
FROM employees
WHERE active = 1
WITH CHECK OPTION;

If a user tries to insert an inactive employee through this view, the database rejects the operation because it violates the condition.

Materialized views

A materialized view stores the results of a query physically, unlike a regular view which executes the query each time. This improves performance for expensive aggregations or joins but requires refresh management to stay up to date.

Creating a materialized view

-- PostgreSQL and Oracle
CREATE MATERIALIZED VIEW sales_summary AS
SELECT region_id,
       SUM(total_amount) AS total_sales,
       COUNT(order_id) AS num_orders
FROM orders
GROUP BY region_id;

After creation, querying the materialized view reads from the stored result instead of recomputing the query.

Refreshing materialized views

Because data changes over time, materialized views must be refreshed manually or automatically.

-- Manual refresh (PostgreSQL)
REFRESH MATERIALIZED VIEW sales_summary;

-- Oracle supports ON COMMIT refresh
CREATE MATERIALIZED VIEW sales_summary
REFRESH FAST ON COMMIT AS
SELECT region_id, SUM(total_amount) AS total_sales
FROM orders
GROUP BY region_id;

Fast refresh updates only changed rows if the engine supports materialized view logs. Full refresh rebuilds the entire dataset.

💡 Use fast refresh for large data sets with frequent small changes. For rarely updated tables, full refresh may be simpler.

Emulating materialized views

Engines that lack native materialized views (like MySQL before 8.0 or SQLite) can emulate them with a normal table and a scheduled job that repopulates it.

CREATE TABLE sales_summary AS
SELECT region_id,
       SUM(total_amount) AS total_sales
FROM orders
GROUP BY region_id;

-- Later: rebuild periodically
TRUNCATE TABLE sales_summary;
INSERT INTO sales_summary
SELECT region_id,
       SUM(total_amount)
FROM orders
GROUP BY region_id;

This approach gives similar performance benefits, though it requires external logic for consistency.

Caching strategies

Query results and materialized data can be cached at several layers to improve performance. The best strategy depends on data volatility and application workload.

Database-level caching

Most engines cache query execution plans and recently accessed data pages automatically. Tuning memory buffers and cache size parameters can greatly improve performance.

MySQL: innodb_buffer_pool_size controls caching of table data and indexes.
PostgreSQL: shared_buffers and work_mem influence cache behavior.
SQL Server: uses automatic plan and page caching.

Application-level caching

Applications can store query results in memory (for example, Redis or Memcached) or local caches to avoid repeated database reads. This is effective for rarely changing data such as configuration lists or product catalogs.

# Example (Python pseudocode)
key = "sales_summary_2025"
data = cache.get(key)
if not data:
    data = db.query("SELECT * FROM sales_summary")
    cache.set(key, data, ttl=3600)
return data

Hybrid caching with materialized views

Combining materialized views with in-memory caching can balance freshness and performance. The database maintains semi-fresh summaries, and the application caches them briefly to reduce load further.

For example, a dashboard might read from a refreshed hourly materialized view, caching the result in memory for only a few minutes.

⚠️ Always document refresh frequency and cache duration. Stale cached results can mislead users or produce inconsistent reports.

Monitoring and invalidation

Effective caching requires monitoring to ensure accuracy and performance. Invalidate or refresh caches when base data changes significantly. Many frameworks provide cache invalidation hooks tied to database triggers or data pipelines.

Views, materialized views, and caching layers together let you balance speed and accuracy. Used well, they simplify queries and reduce computation costs while maintaining a consistent, reliable interface to your data.

Chapter 19: Stored Procedures and Functions

Procedural SQL lets you package logic close to the data. You can create PROCEDURE blocks that perform actions and FUNCTION blocks that return values. The ANSI core describes the idea of routines, while each engine supplies its own procedural dialect. Use routines to reduce round trips, to encapsulate business rules, and to share logic across applications.

⚠️ SQLite does not implement stored procedures or user-defined SQL functions in pure SQL. You can create triggers in SQLite, and you can register external functions via extensions or host code.

When to use procedural SQL

Procedural code inside the database can be helpful, but it is not always the right tool. Start with set-based SQL, then reach for routines when you need orchestration, reusable logic, or server-side validation.

Good use cases

Use a PROCEDURE when you need multi-step workflows (validate input, insert rows, update aggregates). Use a FUNCTION when an expression needs a computed value (normalize a code, compute a score). Routines are helpful for enforcing rules that must run no matter which client calls the database.

When to avoid

Avoid routines when a single declarative statement does the job. If a query can be expressed as INSERT … SELECT or a well-written UPDATE with a JOIN, prefer the set-based version. Keep heavy business logic in application code when portability, versioning, and testing outweigh server-side benefits.

💡 Start with a plain SQL statement. If you repeat the same multi-step sequence, extract it into a PROCEDURE. If you repeat the same scalar calculation inside queries, extract it into a FUNCTION.

Portability considerations

Each vendor uses different routine syntax. MySQL and MariaDB use CREATE PROCEDURE and CREATE FUNCTION with BEGIN and END. PostgreSQL uses CREATE FUNCTION and CREATE PROCEDURE with LANGUAGE plpgsql or another language. SQL Server uses CREATE PROCEDURE and CREATE FUNCTION with T-SQL syntax (use of inserted and deleted pseudo-tables for triggers). Keep bodies simple, and isolate vendor specifics in the smallest possible surface area.

Variables, parameters, and return values

Routines introduce parameters, local variables, control flow, and return mechanisms. The details vary per engine; the goals are the same: validate input, compute results, and return a clear outcome.

`CREATE PROCEDURE` essentials

A PROCEDURE encapsulates actions. It can have IN, OUT, and INOUT parameters (support depends on the engine). It does not return a value in expressions. You call it with CALL (MySQL) or EXEC/EXECUTE (SQL Server, some others).

-- MySQL / MariaDB example
DELIMITER //
CREATE PROCEDURE add_user(IN p_email VARCHAR(255), IN p_name VARCHAR(100), OUT p_user_id BIGINT)
BEGIN
  DECLARE v_exists INT DEFAULT 0;

  SELECT COUNT(*) INTO v_exists
  FROM users
  WHERE email = p_email;

  IF v_exists > 0 THEN
    SET p_user_id = NULL;
    SIGNAL SQLSTATE '45000' SET MESSAGE_TEXT = 'Email already exists';
  ELSE
    INSERT INTO users(email, full_name, created_at)
    VALUES(p_email, p_name, NOW());
    SET p_user_id = LAST_INSERT_ID();
  END IF;
END//
DELIMITER ;

CALL add_user('a@example.com', 'A User', @new_id);
SELECT @new_id;

⚠️ MySQL requires switching the DELIMITER when defining routines in the CLI because semicolons appear inside the body. In GUI tools this is often handled for you.

-- SQL Server example
CREATE OR ALTER PROCEDURE dbo.AddUser
  @Email NVARCHAR(255),
  @Name  NVARCHAR(100),
  @UserId BIGINT OUTPUT
AS
BEGIN
  SET NOCOUNT ON;

  IF EXISTS(SELECT 1 FROM dbo.Users WHERE Email = @Email)
  BEGIN
    SET @UserId = NULL;
    THROW 50000, 'Email already exists', 1;
  END;

  INSERT INTO dbo.Users(Email, FullName, CreatedAt)
  VALUES(@Email, @Name, SYSDATETIME());

  SET @UserId = SCOPE_IDENTITY();
END;

DECLARE @id BIGINT;
EXEC dbo.AddUser @Email = N'a@example.com', @Name = N'A User', @UserId = @id OUTPUT;
SELECT @id AS NewId;

`CREATE FUNCTION` essentials

A FUNCTION returns a value and can be used inside expressions. It typically has only IN parameters. Keep functions deterministic when possible, and avoid side effects (engines may restrict data modification inside a function).

-- PostgreSQL example (plpgsql)
CREATE OR REPLACE FUNCTION public.slugify(p_text text)
RETURNS text
LANGUAGE plpgsql
AS $$
DECLARE
  v_text text := lower(trim(p_text));
BEGIN
  -- replace spaces and non-word characters
  v_text := regexp_replace(v_text, '\s+', '-', 'g');
  v_text := regexp_replace(v_text, '[^a-z0-9\-]', '', 'g');
  RETURN v_text;
END;
$$;

SELECT slugify('Hello World & More');  -- returns 'hello-world-more'

💡 If a computed value belongs in a SELECT list or WHERE clause, prefer a FUNCTION. If the logic changes data across multiple tables, prefer a PROCEDURE.

Local variables and control flow

Engines provide DECLARE for locals, and IF, CASE, LOOP or WHILE for control flow. Use them sparingly; first consider whether a set-based expression is clearer.

-- MySQL / MariaDB: loop and exit
CREATE PROCEDURE demo_loop(IN p_n INT)
BEGIN
  DECLARE i INT DEFAULT 1;
  my_loop: LOOP
    IF i > p_n THEN
      LEAVE my_loop;
    END IF;
    INSERT INTO t(counter) VALUES(i);
    SET i = i + 1;
  END LOOP my_loop;
END;

Error handling and signaling

Use the engine’s error constructs to signal problems. MySQL uses SIGNAL. PostgreSQL uses RAISE. SQL Server uses THROW. Prefer meaningful messages and consistent SQLSTATE or error numbers.

-- PostgreSQL
CREATE OR REPLACE FUNCTION safe_divide(a numeric, b numeric)
RETURNS numeric
LANGUAGE plpgsql
AS $$
BEGIN
  IF b = 0 THEN
    RAISE EXCEPTION 'Division by zero';
  END IF;
  RETURN a / b;
END;
$$;

Transactions inside routines

Transaction control varies. PostgreSQL functions run inside the caller’s transaction (you cannot COMMIT in a function). SQL Server procedures may issue BEGIN TRAN and COMMIT. MySQL procedures run within the current transaction; you can use START TRANSACTION and COMMIT if the storage engine supports it. Keep boundaries clear and avoid long transactions inside routines.

⚠️ Long-running routines that hold locks can block other sessions. Keep work chunks small, and use appropriate isolation levels to reduce contention.

Triggers

A trigger runs automatically in response to table events. Use triggers to enforce invariants, maintain audit trails, and derive values. Keep trigger bodies brief and predictable.

Trigger timing and events

Common timings are BEFORE, AFTER, and sometimes INSTEAD OF (often for views). Common events are INSERT, UPDATE, and DELETE. You can create separate triggers per event or a combined trigger depending on the engine.

Accessing rows inside triggers

MySQL and PostgreSQL use NEW and OLD row references. SQL Server uses the virtual tables inserted and deleted. Use these to read changed values and to validate or transform data.

-- MySQL: BEFORE INSERT to normalize and enforce
CREATE TRIGGER users_bi BEFORE INSERT ON users
FOR EACH ROW
BEGIN
  SET NEW.email = LOWER(TRIM(NEW.email));
  IF NEW.email IS NULL OR NEW.email = '' THEN
    SIGNAL SQLSTATE '45000' SET MESSAGE_TEXT = 'Email required';
  END IF;
END;

-- PostgreSQL: AFTER UPDATE audit trail
CREATE TABLE user_audit(
  user_id bigint,
  changed_at timestamptz,
  old_email text,
  new_email text
);

CREATE OR REPLACE FUNCTION audit_user_email()
RETURNS trigger
LANGUAGE plpgsql
AS $$
BEGIN
  IF NEW.email IS DISTINCT FROM OLD.email THEN
    INSERT INTO user_audit(user_id, changed_at, old_email, new_email)
    VALUES(NEW.id, now(), OLD.email, NEW.email);
  END IF;
  RETURN NEW;
END;
$$;

CREATE TRIGGER users_au
AFTER UPDATE ON users
FOR EACH ROW
EXECUTE FUNCTION audit_user_email();

-- SQL Server: AFTER INSERT trigger to maintain a counter
CREATE OR ALTER TRIGGER dbo.Users_AI ON dbo.Users
AFTER INSERT
AS
BEGIN
  SET NOCOUNT ON;

  UPDATE dbo.Counters
  SET UserCount = UserCount + (SELECT COUNT(*) FROM inserted)
  WHERE CounterName = 'Users';
END;

💡 Prefer small, focused triggers that validate or record. Put multi-table workflows in a PROCEDURE and call it explicitly to avoid hidden side effects.

Order, recursion, and performance

Trigger firing order can be unspecified or configurable. Some engines allow multiple triggers per timing; if order matters, consolidate logic into one trigger. Watch for unintended recursion when a trigger modifies the same table; guard with conditions or use session settings to disable nesting if available.

Best practices

Keep bodies small. Avoid network calls. Avoid nondeterministic behavior that surprises callers. Document every trigger with purpose and invariants. Test using representative data and concurrent sessions to catch locking issues early.

⚠️ Debugging trigger chains can be difficult. If a rule is not truly invariant, move it to a PROCEDURE or application layer where intent and flow are easier to follow.

Chapter 20: Working with Dates and Times

Dates, times, and timestamps represent real-world events. SQL engines provide arithmetic, extraction, formatting, time zone conversion, and interval math. Favor ISO standards, store in UTC when possible, and convert at the edges when displaying to users.

💡 Prefer timestamp with time zone where supported, or store UTC in a plain timestamp and keep a separate time zone column for the user context.

Date arithmetic, extraction, and formatting

SQL supports adding and subtracting intervals, extracting fields like year or month, and formatting values for presentation. Use arithmetic for logic, use formatting only at the final presentation step.

`EXTRACT` and basic arithmetic

EXTRACT returns a part of a date or timestamp. Arithmetic uses intervals, or engine-specific numeric offsets. Prefer interval-aware syntax to avoid month-length surprises.

-- ANSI-style extraction
SELECT
  EXTRACT(YEAR FROM CURRENT_DATE)  AS year_num,
  EXTRACT(MONTH FROM CURRENT_DATE) AS month_num,
  EXTRACT(DAY FROM CURRENT_DATE)   AS day_num;

-- ANSI-style addition and subtraction (engines vary)
-- PostgreSQL
SELECT CURRENT_DATE + INTERVAL '7 days' AS in_seven_days;

-- MySQL / MariaDB
SELECT DATE_ADD(CURDATE(), INTERVAL 7 DAY)   AS in_seven_days,
       DATE_SUB(CURDATE(), INTERVAL 1 MONTH) AS last_month;

-- SQL Server
SELECT DATEADD(DAY, 7, CAST(GETDATE() AS date))       AS in_seven_days,
       DATEADD(MONTH, -1, CAST(GETDATE() AS date))    AS last_month;

⚠️ Adding a fixed number to a date as if it were a serial integer can produce incorrect results. Always use interval-aware functions, or ANSI intervals.

`DATE_TRUNC`, bucketing, and grouping

Truncation rounds down to a boundary (hour, day, month). Use it for grouping facts into time buckets.

-- PostgreSQL bucketing by day and month
SELECT DATE_TRUNC('day',  created_at) AS day_bucket,  COUNT(*) FROM orders GROUP BY 1;
SELECT DATE_TRUNC('month', created_at) AS month_bucket, COUNT(*) FROM orders GROUP BY 1;

-- SQL Server: use DATETRUNC (SQL Server 2022+) or computed boundaries
SELECT DATETRUNC(day, CreatedAt) AS day_bucket, COUNT(*) FROM dbo.Orders GROUP BY DATETRUNC(day, CreatedAt);

-- MySQL: use DATE(), or format to YYYY-MM-01 for month buckets
SELECT DATE(created_at) AS day_bucket, COUNT(*) FROM orders GROUP BY DATE(created_at);
SELECT DATE_FORMAT(created_at, '%Y-%m-01') AS month_bucket, COUNT(*) FROM orders GROUP BY 1;

Differences and durations

Compute elapsed time with supported constructs. Prefer interval types; if the engine returns numbers, document the unit clearly.

-- PostgreSQL: interval result
SELECT (NOW() - order_placed_at) AS elapsed FROM orders WHERE id = 42;

-- MySQL: duration in seconds or minutes
SELECT TIMESTAMPDIFF(SECOND, order_placed_at, NOW()) AS elapsed_seconds FROM orders WHERE id = 42;

-- SQL Server: DATEDIFF returns integer units
SELECT DATEDIFF(second, OrderPlacedAt, SYSDATETIME()) AS elapsed_seconds FROM dbo.Orders WHERE Id = 42;

`TO_CHAR`, `DATE_FORMAT`, `FORMAT` and friends

Use formatting functions at the presentation boundary. Keep internal comparisons and joins in native date types, not in strings. Patterns differ across engines.

Engine	Function	Example
PostgreSQL	TO_CHAR	`TO_CHAR(NOW(), 'YYYY-MM-DD HH24:MI')`
MySQL / MariaDB	DATE_FORMAT	`DATE_FORMAT(NOW(), '%Y-%m-%d %H:%i')`
SQL Server	FORMAT	`FORMAT(SYSDATETIME(), 'yyyy-MM-dd HH:mm')`

💡 If you need a custom pattern placeholder like {fmt …} in application templates, keep the raw timestamp in SQL and format in the application layer for consistency.

Parsing strings into dates safely

Parsing is the inverse of formatting. Use explicit formats so inputs remain unambiguous.

-- PostgreSQL
SELECT TO_TIMESTAMP('2025-11-03 12:00', 'YYYY-MM-DD HH24:MI');

-- MySQL / MariaDB
SELECT STR_TO_DATE('2025-11-03 12:00', '%Y-%m-%d %H:%i');

-- SQL Server
SELECT TRY_CONVERT(datetime2, '2025-11-03 12:00', 120);  -- ODBC style 120 = yyyy-mm-dd hh:mi:ss

⚠️ Avoid locale-dependent formats like 03/11/2025. Use ISO 8601 strings YYYY-MM-DD for portability.

Time zones and intervals

Storing in UTC reduces ambiguity. Convert to a user’s time zone only when rendering or when the business rule depends on local time. Intervals represent human concepts like months or business days; arithmetic with intervals should respect calendar rules.

`AT TIME ZONE` and conversion

Engines offer explicit conversion functions. Make the source zone and target zone clear. Document whether the stored value is UTC or local time.

-- PostgreSQL: convert UTC to Europe/London at query time
SELECT (created_at AT TIME ZONE 'UTC') AT TIME ZONE 'Europe/London' AS created_local
FROM orders;

-- PostgreSQL alternative when created_at is timestamptz in UTC
SELECT created_at AT TIME ZONE 'Europe/London' AS created_local FROM orders;

-- SQL Server
SELECT SWITCHOFFSET(TODATETIMEOFFSET(CreatedAtUtc, '+00:00'), '+01:00') AS created_local
FROM dbo.Orders;

-- MySQL 8+: convert between zones
SELECT CONVERT_TZ(created_at_utc, '+00:00', 'Europe/London') AS created_local
FROM orders;

💡 Keep a canonical UTC column like created_at_utc. Add a computed or indexed persisted column in local time if you group by local dates often.

Intervals vs absolute time shifts

Intervals represent amounts like one month or two days. Absolute shifts like adding 86,400 seconds can be wrong around daylight saving transitions. Prefer calendar-aware intervals.

-- PostgreSQL: calendar-aware
SELECT (DATE '2025-03-28' + INTERVAL '5 days')  AS plus_5d,
       (TIMESTAMP '2025-03-31 23:30 Europe/London' AT TIME ZONE 'Europe/London')
         + INTERVAL '1 day' AS next_day_local;

-- SQL Server: add calendar units
SELECT DATEADD(day, 5, CAST('2025-03-28' AS date)) AS plus_5d;

-- MySQL: INTERVAL expression
SELECT DATE_ADD('2025-03-28', INTERVAL 5 DAY) AS plus_5d;

⚠️ Adding a fixed number of seconds across a daylight saving boundary can land in the wrong local hour. Use calendar units or zone-aware conversion.

Normalizing to day boundaries

Many reports need local-day buckets. Convert to the target zone first, then truncate.

-- PostgreSQL
SELECT DATE_TRUNC('day', created_at AT TIME ZONE 'Europe/London') AS day_local, COUNT(*)
FROM orders
GROUP BY 1
ORDER BY 1;

-- SQL Server (2022+)
SELECT DATETRUNC(day, SWITCHOFFSET(TODATETIMEOFFSET(CreatedAtUtc, '+00:00'), '+01:00)) AS day_local,
       COUNT(*)
FROM dbo.Orders
GROUP BY DATETRUNC(day, SWITCHOFFSET(TODATETIMEOFFSET(CreatedAtUtc, '+00:00'), '+01:00))
ORDER BY day_local;

-- MySQL
SELECT DATE(CONVERT_TZ(created_at_utc, '+00:00', 'Europe/London')) AS day_local, COUNT(*)
FROM orders
GROUP BY day_local
ORDER BY day_local;

Generating date ranges and calendars

Create ranges to fill gaps in reports. Use set-generating constructs or helper tables.

-- PostgreSQL: generate_series over dates
SELECT d::date AS dt
FROM generate_series(DATE '2025-10-01', DATE '2025-10-31', INTERVAL '1 day') AS g(d);

-- SQL Server: tally using recursive CTE
WITH d AS (
  SELECT CAST('2025-10-01' AS date) AS dt
  UNION ALL
  SELECT DATEADD(day, 1, dt) FROM d WHERE dt < '2025-10-31'
)
SELECT dt FROM d OPTION (MAXRECURSION 0);

-- MySQL: recursive CTE (8.0+)
WITH RECURSIVE d AS (
  SELECT DATE('2025-10-01') AS dt
  UNION ALL
  SELECT DATE_ADD(dt, INTERVAL 1 DAY) FROM d WHERE dt < '2025-10-31'
)
SELECT dt FROM d;

💡 Keep a persistent calendar table with columns like dt, is_weekend, is_holiday, and week_start; join to it in reports for speed and clarity.

Common pitfalls and checks

Validate inputs, document assumptions, and add constraints where the engine allows them.

Reject impossible dates at ingest with a CHECK like created_at_utc >= TIMESTAMP '2000-01-01'.
Avoid mixing local times from different time zones in one column without an explicit zone or separate columns.
Store durations as intervals or integers in clear units (seconds, minutes); name the column accordingly.

⚠️ Formatting hides type information. Keep comparisons, joins, and indexes on native date or timestamp columns. Convert to strings only at the final presentation step.

Chapter 21: String and Pattern Operations

Strings are among the most common data types in SQL. Operations include concatenation, trimming, replacing, substring extraction, and pattern matching using wildcards or regular expressions. Each engine follows the ANSI baseline but extends it with extra string and pattern tools.

💡 Normalize case, spacing, and encoding early when importing text so comparisons behave predictably. Use functions like LOWER() or UPPER() to maintain consistency.

Concatenation, `SUBSTRING`, `TRIM`, `REPLACE`

String construction and manipulation appear in most queries. Use ANSI forms when possible; switch to dialect-specific variants only for advanced use cases.

Concatenation

ANSI SQL defines || for concatenation, though some engines use + or CONCAT(). Prefer the function form for cross-engine portability.

-- ANSI
SELECT 'Hello' || ' ' || 'World' AS greeting;

-- MySQL / MariaDB
SELECT CONCAT('Hello', ' ', 'World') AS greeting;

-- SQL Server
SELECT 'Hello' + ' ' + 'World' AS greeting;

-- PostgreSQL supports both ANSI and CONCAT
SELECT CONCAT('Hello', ' ', 'World');

⚠️ The + operator in SQL Server performs arithmetic when any operand is numeric. Ensure all concatenated expressions are explicitly VARCHAR or NVARCHAR.

`SUBSTRING` extraction

SUBSTRING() (or SUBSTR()) retrieves part of a string. Position numbering starts at 1 in most engines. Negative indexes are engine-specific.

-- ANSI form
SELECT SUBSTRING('abcdef' FROM 2 FOR 3);  -- returns 'bcd'

-- MySQL / MariaDB
SELECT SUBSTRING('abcdef', 2, 3);

-- SQL Server
SELECT SUBSTRING('abcdef', 2, 3);

-- PostgreSQL (supports both forms)
SELECT SUBSTR('abcdef', 2, 3);

💡 When extracting structured tokens, use POSITION() or CHARINDEX() to locate delimiters, then apply SUBSTRING() around them for clarity.

`TRIM` and whitespace cleanup

TRIM() removes leading, trailing, or both types of spaces (or custom characters). SQL also offers LTRIM() and RTRIM() in many dialects.

-- ANSI
SELECT TRIM('  text  ') AS cleaned;
SELECT TRIM(BOTH 'x' FROM 'xxxtrimxx') AS cleaned;

-- MySQL / MariaDB
SELECT TRIM('  text  '), LTRIM('  text  '), RTRIM('  text  ');

-- SQL Server
SELECT TRIM('  text  '), LTRIM('  text  '), RTRIM('  text  ');

-- PostgreSQL
SELECT TRIM('  text  ');

`REPLACE` and substitution

REPLACE() substitutes every occurrence of one substring with another. All major engines support it with identical signatures.

SELECT REPLACE('abc def abc', 'abc', 'XYZ');  -- returns 'XYZ def XYZ'

⚠️ REPLACE() is case-sensitive in most databases. Use LOWER() or UPPER() to normalize before substitution if needed.

Other helpful functions

Function	Purpose	Example
`LENGTH()` / `CHAR_LENGTH()`	Count characters in a string	`SELECT LENGTH('hello');`
`POSITION()` / `CHARINDEX()`	Find substring index	`SELECT POSITION('b' IN 'abc');`
`LEFT()`, `RIGHT()`	Extract leading or trailing characters	`SELECT LEFT('abcdef', 3);`
`CONCAT_WS()`	Concatenate with a separator (MySQL, PostgreSQL)	`SELECT CONCAT_WS('-', '2025', '11', '03');`

Regular expressions

Regular expressions extend pattern matching beyond LIKE wildcards. They follow POSIX, PCRE, or T-SQL style syntax depending on the database. Use them for validation, extraction, and conditional filtering, but be cautious with performance.

Matching and filtering

-- PostgreSQL (POSIX)
SELECT 'abc123' ~ '^[a-z]+[0-9]+$' AS is_match;

-- MySQL 8+ (default is case-insensitive unless specified)
SELECT 'abc123' REGEXP '^[a-z]+[0-9]+$';

-- SQL Server (use LIKE or CLR regex functions; no native REGEXP)
-- Example using a pattern for digits only
SELECT CASE WHEN name LIKE '%[0-9]%' THEN 1 ELSE 0 END AS has_digits FROM dbo.People;

⚠️ MySQL versions before 8 use a limited regex engine and do not support lookahead or lookbehind. PostgreSQL supports full POSIX regex, and newer versions add more advanced operators.

Extraction and replacement

Some databases offer regex capture and substitution functions to transform text in one call.

-- PostgreSQL
SELECT REGEXP_REPLACE('abc123def', '[0-9]+', '###', 'g');  -- abc###def
SELECT REGEXP_MATCHES('id=42', '[0-9]+');                  -- returns 42

-- MySQL 8+
SELECT REGEXP_REPLACE('abc123def', '[0-9]+', '###');

💡 When cleaning data, use REGEXP_REPLACE() to strip unwanted characters like spaces, punctuation, or control codes, but test expressions carefully before applying to production data.

Performance and indexes

Regex scans are CPU-intensive because they bypass most index optimizations. Restrict them with other predicates or use precomputed flags to reduce cost.

`LIKE` patterns

The LIKE operator supports basic wildcard matching using % (any number of characters) and _ (a single character). It works in all engines and often benefits from indexes when the pattern does not begin with a wildcard.

Basic usage

SELECT name FROM customers WHERE name LIKE 'A%';   -- begins with A
SELECT name FROM customers WHERE name LIKE '%son'; -- ends with son
SELECT name FROM customers WHERE name LIKE '_a%';  -- second letter is a

Case sensitivity

Default behavior varies. MySQL is case-insensitive for most collations. PostgreSQL is case-sensitive by default but provides ILIKE for case-insensitive matches. SQL Server depends on the database collation.

-- PostgreSQL
SELECT name FROM users WHERE name ILIKE 'john%';  -- case-insensitive

-- MySQL
SELECT name FROM users WHERE name LIKE 'john%';   -- case-insensitive in utf8_general_ci

-- SQL Server
SELECT name FROM users WHERE name LIKE 'john%';   -- depends on collation

⚠️ Leading wildcards (like %text%) prevent index use and can slow queries significantly. Consider full-text search features for heavy pattern workloads.

Escaping special characters

To match literal % or _ symbols, define an escape character using ESCAPE.

SELECT filename
FROM logs
WHERE filename LIKE '%\%%' ESCAPE '\';  -- match a literal %

Combining with logic

Combine LIKE with logical operators for complex filters. Remember to group conditions with parentheses to maintain precedence.

SELECT title
FROM books
WHERE (title LIKE 'SQL%' OR title LIKE 'Database%')
  AND title NOT LIKE '%Advanced%';

💡 For multi-language or large-text searches, move to built-in full-text search features or external search engines. Use LIKE only for light pattern filtering.

Chapter 22: Advanced Queries

Advanced SQL features like window functions, pivoting, and recursive common table expressions (CTEs) let you express complex analytical and hierarchical logic directly in queries. These constructs build on core SELECT syntax but operate across multiple rows at once, providing cumulative, comparative, and hierarchical insights.

💡 Learn to read advanced SQL as declarative pipelines. Each clause refines or enriches results rather than looping or mutating data step by step.

Window functions (`OVER`, `PARTITION BY`, `RANK`, `LAG`/`LEAD`)

Window functions compute values across related rows without collapsing them into groups. They use the OVER() clause, which can define partitions (logical subgroups) and orderings. Typical examples include running totals, row numbers, moving averages, and value comparisons.

Basic structure

A window function looks like an aggregate but uses OVER() to define its scope. The result set retains one row per input record.

SELECT
  customer_id,
  order_date,
  amount,
  SUM(amount) OVER (PARTITION BY customer_id) AS total_per_customer
FROM orders;

This computes the total amount per customer without grouping rows together. Each order retains its own detail.

`ROW_NUMBER()`, `RANK()`, and `DENSE_RANK()`

Ranking functions assign order within each partition. They differ in how they handle ties.

SELECT
  customer_id,
  amount,
  ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY amount DESC) AS row_num,
  RANK()       OVER (PARTITION BY customer_id ORDER BY amount DESC) AS rank_num,
  DENSE_RANK() OVER (PARTITION BY customer_id ORDER BY amount DESC) AS dense_rank_num
FROM orders;

Function	Behavior
`ROW_NUMBER()`	Sequential numbering (no gaps)
`RANK()`	Gaps when ties occur
`DENSE_RANK()`	No gaps; next rank follows immediately

⚠️ When using ranking for pagination, always provide a deterministic ORDER BY with unique keys to avoid inconsistent results between runs.

`LAG()` and `LEAD()`

These functions look backward or forward in the ordered set to compare values across rows. They are useful for computing differences between consecutive events or detecting changes.

SELECT
  customer_id,
  order_date,
  amount,
  LAG(amount)  OVER (PARTITION BY customer_id ORDER BY order_date) AS prev_amount,
  LEAD(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS next_amount,
  amount - LAG(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS diff_prev
FROM orders;

💡 Use COALESCE() with LAG() or LEAD() to substitute defaults for missing neighbors, such as COALESCE(LAG(value) …, 0).

Moving averages and cumulative totals

By defining a frame with ROWS or RANGE, you can compute sliding windows such as a 7-day moving average or cumulative sums.

-- 7-day moving average by date
SELECT
  order_date,
  AVG(amount) OVER (ORDER BY order_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS moving_avg
FROM orders;

-- cumulative total
SELECT
  customer_id,
  order_date,
  SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date ROWS UNBOUNDED PRECEDING) AS running_total
FROM orders;

⚠️ Frame units (ROWS vs RANGE) differ: ROWS counts physical rows, while RANGE uses value intervals (for example, numeric or time-based distances).

Pivoting and unpivoting data

Pivoting transforms rows into columns, while unpivoting reverses that process. These operations help reshape data for reports or analytics.

Pivoting with aggregates

The concept is universal, though syntax varies widely. You can use conditional aggregation or vendor-specific PIVOT operators.

-- Portable approach: conditional aggregation
SELECT
  customer_id,
  SUM(CASE WHEN status = 'pending' THEN amount ELSE 0 END) AS pending_total,
  SUM(CASE WHEN status = 'paid'    THEN amount ELSE 0 END) AS paid_total,
  SUM(CASE WHEN status = 'shipped' THEN amount ELSE 0 END) AS shipped_total
FROM orders
GROUP BY customer_id;

SQL Server and Oracle include a dedicated PIVOT clause:

-- SQL Server example
SELECT *
FROM (
  SELECT customer_id, status, amount FROM orders
) AS src
PIVOT (
  SUM(amount) FOR status IN ([pending], [paid], [shipped])
) AS p;

💡 For dynamic pivoting (unknown categories), generate the pivot list using application logic or dynamic SQL, as column names must be explicit in static queries.

Unpivoting (normalizing columns)

Unpivoting converts columns back into rows, often for cleaner storage or joining with metadata tables.

-- SQL Server example
SELECT customer_id, status, amount
FROM orders_pivoted
UNPIVOT (
  amount FOR status IN ([pending], [paid], [shipped])
) AS u;

Portable alternatives rely on UNION ALL constructs:

-- Portable UNION ALL approach
SELECT customer_id, 'pending' AS status, pending_total AS amount FROM orders_pivoted
UNION ALL
SELECT customer_id, 'paid', paid_total FROM orders_pivoted
UNION ALL
SELECT customer_id, 'shipped', shipped_total FROM orders_pivoted;

⚠️ Pivoted or unpivoted datasets can be large. Apply filters or summaries before reshaping whenever possible to reduce memory and I/O cost.

Recursive CTEs

Recursive common table expressions let you query hierarchical or graph-like data. They start with a base case, then repeatedly join or union new rows until no further matches appear. Use them for trees, bill-of-materials, or dependency chains.

Structure of a recursive CTE

A recursive CTE has two parts: an initial query (the base case) and a recursive query that references the CTE itself. Combine them with UNION ALL.

WITH RECURSIVE subordinates AS (
  -- base case
  SELECT employee_id, manager_id, full_name, 1 AS level
  FROM employees
  WHERE manager_id IS NULL

  UNION ALL

  -- recursive step
  SELECT e.employee_id, e.manager_id, e.full_name, s.level + 1
  FROM employees e
  JOIN subordinates s ON e.manager_id = s.employee_id
)
SELECT * FROM subordinates ORDER BY level, employee_id;

💡 Think of the recursive part as adding one level of the hierarchy at a time. The process stops automatically when no new rows are produced.

Limiting recursion depth

Engines typically restrict recursion depth for safety. You can control it explicitly.

-- PostgreSQL / SQLite
WITH RECURSIVE t(n) AS (
  SELECT 1
  UNION ALL
  SELECT n + 1 FROM t WHERE n < 10
)
SELECT * FROM t;

-- SQL Server
WITH t AS (
  SELECT 1 AS n
  UNION ALL
  SELECT n + 1 FROM t WHERE n < 10
)
SELECT * FROM t
OPTION (MAXRECURSION 100);

⚠️ Missing termination conditions can create infinite recursion. Always include a stopping predicate (such as WHERE level < 10 or MAXRECURSION) to prevent runaway loops.

Practical uses

Exploding organizational hierarchies (manager → employee trees)
Walking parent-child product assemblies
Computing dependency graphs (modules, files, prerequisites)
Generating date or number sequences without auxiliary tables

💡 Recursive CTEs express loops declaratively. Test on small sets before applying to large trees, and add indexes on join keys to improve performance.

Chapter 23: Importing and Exporting Data

Importing and exporting are essential for moving data between databases and external systems. SQL engines include commands for bulk loading, copying, and exporting tables in various formats. Efficient data transfer relies on proper file formats, encoding, and matching column definitions.

💡 Always perform imports and exports with explicit column lists and well-defined delimiters. Ambiguous file formats are a frequent source of corruption or silent errors.

`LOAD DATA INFILE`, `COPY`, `BULK INSERT`

Most databases provide fast server-side commands for reading or writing large text files. These bypass normal client insert loops for performance.

`LOAD DATA INFILE` (MySQL / MariaDB)

LOAD DATA INFILE reads data directly from a file into a table. Files can be local or server-side depending on privileges and configuration.

LOAD DATA INFILE '/var/data/users.csv'
INTO TABLE users
FIELDS TERMINATED BY ',' 
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 LINES
(id, name, email, created_at);

Each line represents a row; fields correspond to columns. Use IGNORE 1 LINES to skip headers.

⚠️ The server must have permission to read the file, and the secure_file_priv setting may restrict directories. Use LOAD DATA LOCAL INFILE for client-side files (ensure it is enabled).

`COPY` (PostgreSQL)

PostgreSQL’s COPY reads or writes delimited files in one operation. It supports CSV, text, and binary formats.

-- Import
COPY users (id, name, email, created_at)
FROM '/var/data/users.csv'
WITH (FORMAT csv, HEADER true, DELIMITER ',', QUOTE '"');

-- Export
COPY (SELECT * FROM users)
TO '/var/data/users_export.csv'
WITH (FORMAT csv, HEADER true);

💡 When running COPY from client tools, use \copy (psql) to operate on client-local files rather than server-side paths.

`BULK INSERT` (SQL Server)

SQL Server provides BULK INSERT for high-speed imports. It is similar to PostgreSQL’s COPY.

BULK INSERT dbo.Users
FROM 'C:\data\users.csv'
WITH (
  FIELDTERMINATOR = ',',
  ROWTERMINATOR   = '\n',
  FIRSTROW        = 2
);

Server permissions and file access policies apply. Use OPENROWSET(BULK …) for more flexible data streaming.

⚠️ Always check the target table’s constraints before bulk loading. Disable indexes or triggers temporarily if performance is critical, then rebuild them afterward.

Exporting to CSV or JSON

Export operations let you share query results with other systems, data scientists, or visualization tools. Common targets are CSV and JSON, both supported natively by modern SQL engines.

CSV export

CSV (comma-separated values) is portable and widely supported, though limited in type fidelity. Use text-safe quoting and explicit column names.

-- MySQL / MariaDB
SELECT id, name, email
INTO OUTFILE '/var/data/users_out.csv'
FIELDS TERMINATED BY ',' 
ENCLOSED BY '"' 
LINES TERMINATED BY '\n'
FROM users;

-- PostgreSQL
COPY (SELECT id, name, email FROM users)
TO '/var/data/users_out.csv'
WITH (FORMAT csv, HEADER true);

-- SQL Server (using BCP utility)
EXEC xp_cmdshell 'bcp "SELECT id, name, email FROM mydb.dbo.Users" queryout "C:\data\users_out.csv" -c -t, -T';

💡 Exporting as UTF-8 ensures compatibility with modern tools. Avoid locale-dependent encodings like Latin-1 unless required.

JSON export

JSON export preserves structure and nested data. Engines that support JSON types can serialize rows directly from queries.

-- PostgreSQL
SELECT json_agg(t) 
FROM (SELECT id, name, email FROM users) AS t;

-- MySQL 8+
SELECT JSON_ARRAYAGG(JSON_OBJECT('id', id, 'name', name, 'email', email))
FROM users;

-- SQL Server
SELECT id, name, email
FROM dbo.Users
FOR JSON AUTO;

⚠️ JSON exports can grow large. Add LIMIT or TOP clauses, or use pagination for manageable file sizes.

External tools and scripts

For large-scale exports, use database utilities or scripting languages with proper drivers. Examples include mysqldump, pg_dump, bcp, or Python scripts using pandas.to_csv() or json.dump(). These tools can stream large results efficiently and preserve schema metadata.

ETL considerations

ETL (Extract, Transform, Load) pipelines handle structured data flows between systems. SQL often serves as the transformation layer. Keep imports and exports idempotent, validated, and traceable.

Extract

Extract from reliable sources. Use stable queries or incremental filters (WHERE updated_at > last_run) to fetch only new or changed rows. Document query versions for reproducibility.

Transform

Perform type conversions, normalizations, and validations before loading. Use staging tables to isolate intermediate results.

-- Example: clean email data in staging
UPDATE staging_users
SET email = LOWER(TRIM(email))
WHERE email IS NOT NULL;

💡 Keep staging and target schemas separate. Apply all data quality checks in the staging phase, and only load valid rows into production tables.

Load

Load transformed data efficiently. Use bulk commands (COPY, LOAD DATA INFILE, BULK INSERT) where possible. Wrap in transactions for atomicity and to allow rollback on failure.

Validation and audit

After each load, verify counts and constraints. Record timestamps and file hashes for audit trails.

-- Example validation
SELECT COUNT(*) AS src_count FROM staging_users;
SELECT COUNT(*) AS tgt_count FROM users;

Automation and scheduling

Use external schedulers or orchestration tools (cron, Airflow, etc.) to automate ETL. SQL scripts should log start and end times, row counts, and errors for monitoring.

⚠️ Never overwrite production data without backup. Test imports on isolated copies and validate checksums before committing.

Chapter 24: Optimization and Query Tuning

SQL performance depends on how efficiently the database engine retrieves and processes data. Query tuning involves examining execution plans, managing indexes and statistics, and understanding how caching and memory usage affect results. Even simple queries can slow down when tables grow, so systematic tuning is essential for scalable applications.

💡 Start by measuring before changing. Gather timing, row counts, and plans for your baseline query, then tune iteratively. Guessing usually makes performance worse.

Execution plans and `EXPLAIN`

Every SQL query produces an execution plan (a set of steps the optimizer chooses to retrieve data). The EXPLAIN (or equivalent) command shows how the engine interprets your query, which indexes it uses, and how tables are joined or scanned.

Reading an execution plan

An execution plan reveals the order of operations, join algorithms, and data access paths (index lookup, full table scan, etc.). Use it to detect missing indexes, redundant joins, or expensive sorts.

-- MySQL / MariaDB
EXPLAIN SELECT * FROM orders WHERE customer_id = 5;

-- PostgreSQL
EXPLAIN SELECT * FROM orders WHERE customer_id = 5;

-- SQL Server
SET SHOWPLAN_TEXT ON;
GO
SELECT * FROM orders WHERE customer_id = 5;
GO
SET SHOWPLAN_TEXT OFF;

In MySQL and PostgreSQL, EXPLAIN ANALYZE (or EXPLAIN (ANALYZE, BUFFERS)) executes the query and reports actual run times and row counts.

EXPLAIN ANALYZE
SELECT customer_id, SUM(amount)
FROM orders
GROUP BY customer_id;

⚠️ Always run EXPLAIN on representative data volumes. Plans may differ dramatically between small test sets and full production sizes.

Key indicators in a plan

Seq Scan / Table Scan — full table read; may be slow for large data.
Index Scan / Seek — uses an index to locate rows efficiently.
Nested Loop / Hash Join / Merge Join — join strategy; the optimizer picks the cheapest for data size and available indexes.
Sort — sorting step; watch for large memory usage or disk spills.
Filter — condition applied after scanning; indicates possible index improvements.

💡 If you see repeated full scans, consider adding selective indexes or rewriting joins to limit the number of scanned rows.

Practical tuning workflow

Profile the query using EXPLAIN or the database’s profiler.
Check that filtering columns are indexed appropriately.
Remove unnecessary ORDER BY, DISTINCT, or subqueries.
Consider breaking very complex queries into smaller, staged ones.
Re-run EXPLAIN and compare cost estimates and runtime.

⚠️ High “rows examined” counts indicate filters applied after scanning. Push filters into joins or WHERE clauses as early as possible.

Index hints, statistics, and caching

Indexing, accurate statistics, and caching mechanisms are central to query performance. The optimizer relies on them to choose efficient paths. Poor statistics or unsuitable indexes can lead to slow queries even with correct logic.

Index hints

Engines usually select the best index automatically, but hints allow manual overrides when necessary for specific workloads or query patterns.

-- MySQL / MariaDB
SELECT * FROM orders USE INDEX (idx_customer_id) WHERE customer_id = 5;

-- SQL Server
SELECT * FROM dbo.Orders WITH (INDEX(idx_customer_id)) WHERE customer_id = 5;

-- PostgreSQL
-- No direct hint syntax, but you can influence plans via enable_* parameters
SET enable_seqscan = off;
SELECT * FROM orders WHERE customer_id = 5;
RESET enable_seqscan;

⚠️ Index hints can degrade performance as data changes. Use them sparingly, and only when you have verified that the optimizer’s plan is consistently suboptimal.

Updating and maintaining statistics

Statistics describe data distribution and help the optimizer estimate row counts. Outdated statistics can cause poor plan choices.

-- MySQL / MariaDB
ANALYZE TABLE orders;

-- PostgreSQL
ANALYZE orders;

-- SQL Server
UPDATE STATISTICS dbo.Orders;

Regularly analyze heavily updated tables. Most databases also collect statistics automatically, but manual refreshes may be needed after large data loads.

💡 Keep automatic statistics updates enabled. Schedule manual runs during maintenance windows after bulk inserts or deletions.

Caching and memory tuning

Database engines cache recently accessed pages, index blocks, and execution plans to reduce disk I/O. Proper configuration of cache size and query reuse can drastically improve performance.

MySQL: innodb_buffer_pool_size controls the main cache; ideally set to 60–80% of system memory on dedicated servers.
PostgreSQL: shared_buffers and work_mem affect caching and sort operations; effective_cache_size informs the planner about OS-level caching.
SQL Server: uses dynamic memory management; tune max server memory and inspect plan cache hits with sys.dm_exec_cached_plans.

⚠️ Excessively large cache settings can harm the OS or other services. Monitor memory pressure, and leave headroom for the operating system.

Rewriting queries for index use

Sometimes tuning requires minor rewrites to align with index structures. Common techniques include:

Use exact column matches in WHERE clauses before applying functions (e.g., WHERE date_col >= '2025-01-01' rather than WHERE YEAR(date_col) = 2025).
Prefer EXISTS over IN when joining large subqueries.
Avoid SELECT *; fetch only needed columns to reduce I/O.
Consolidate redundant joins or nested views.

💡 Think like the optimizer: it chooses paths based on available statistics. Clean predicates and explicit joins are easier to optimize than nested or ambiguous expressions.

Monitoring and profiling tools

Use each engine’s built-in monitoring features to detect slow queries and resource bottlenecks.

Engine	Tools
MySQL / MariaDB	`SHOW PROFILES`, `performance_schema`, slow query log
PostgreSQL	`pg_stat_statements`, `EXPLAIN (ANALYZE, BUFFERS)`
SQL Server	Query Store, Dynamic Management Views (DMVs), Activity Monitor

⚠️ Query optimization is an ongoing process. Data volume, patterns, and hardware evolve; periodically re-evaluate indexes, statistics, and plans to keep performance steady.

Denormalization and partitioning trade-offs

Normalization improves consistency and avoids redundancy, but highly normalized schemas can cause performance issues for large analytical queries or heavy joins. Denormalization stores repeated data intentionally to speed up reads, while partitioning divides tables into segments for manageability and performance.

When denormalization helps

Denormalization is useful for read-heavy systems, reporting layers, or data warehouses where query simplicity and speed outweigh small risks of redundancy. For example, storing a customer_name directly in an orders table avoids a frequent join.

-- Example of selective denormalization
ALTER TABLE orders ADD COLUMN customer_name VARCHAR(100);
UPDATE orders
SET customer_name = c.name
FROM customers c
WHERE c.id = orders.customer_id;

💡 Maintain denormalized fields using triggers or scheduled updates. Never update them manually, or you risk data drift between tables.

Partitioning strategies

Partitioning splits large tables into smaller physical units (called partitions) that are queried transparently. It improves performance by reducing the data scanned, and aids in maintenance and archiving.

-- PostgreSQL: range partitioning by date
CREATE TABLE orders_2025 PARTITION OF orders
FOR VALUES FROM ('2025-01-01') TO ('2026-01-01');

-- MySQL: range partitioning
CREATE TABLE orders (
  id BIGINT,
  order_date DATE,
  amount DECIMAL(10,2)
)
PARTITION BY RANGE (YEAR(order_date)) (
  PARTITION p2024 VALUES LESS THAN (2025),
  PARTITION p2025 VALUES LESS THAN (2026)
);

PostgreSQL, MySQL, and Oracle support range, list, and hash partitioning natively. SQL Server uses partitioned tables with schemes and functions.

⚠️ Partitioning does not always improve performance. If most queries span multiple partitions, scanning overhead may offset the gains. Use it primarily for large historical tables or clear data boundaries.

Balancing normalization, denormalization, and partitioning

Combine approaches where each fits best:

Keep OLTP (transactional) data normalized for integrity and small updates.
Use denormalized or materialized summary tables for analytics.
Partition large tables to reduce maintenance time and improve archival performance.

💡 The optimal design balances correctness and speed. Normalize for correctness first, then denormalize or partition deliberately to remove proven bottlenecks.

Chapter 25: Database Design Patterns

Relational design patterns help you express business rules as tables, keys, and constraints. This chapter focuses on the most common relationship shapes and the practical details that keep data consistent (and queries fast). You will see how to model one to many and many to many, how to implement junction tables for flexibility, and how to plan safe schema changes with versioned migrations.

💡 Think in terms of entities, attributes, and relationships. Tables represent entities; columns represent attributes; keys and constraints represent relationships and rules.

One-to-many relationships

In a one to many relationship, a single parent row is associated with zero or more child rows. Typical examples are authors to books, or departments to employees. The usual implementation is a primary key on the parent and a foreign key from the child to that parent.

Canonical layout for `1:N`

-- Parent
CREATE TABLE authors (
  author_id   INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  name        VARCHAR(200) NOT NULL,
  created_at  TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);

-- Child (many side)
CREATE TABLE books (
  book_id     INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  author_id   INTEGER NOT NULL,
  title       VARCHAR(300) NOT NULL,
  published_on DATE,
  CONSTRAINT fk_books_author
    FOREIGN KEY (author_id)
    REFERENCES authors(author_id)
    ON UPDATE CASCADE
    ON DELETE RESTRICT
);

This pattern keeps the relationship in the child, which simplifies inserts for new children and avoids nullable columns in the parent.

Choosing `ON DELETE` behavior

Choose a deletion rule that matches your business rules. The following table summarizes common options.

Rule	Effect on children	When to use
`RESTRICT`	Deletion blocked if children exist	Historical integrity (you must move or delete children first)
`CASCADE`	Children deleted automatically	Dependent data that has no meaning without the parent
`SET NULL`	Child foreign key set to `NULL`	Children can survive without a parent (must allow `NULL`)
`SET DEFAULT`	Child foreign key set to default value	Route orphans to a default record (rare but useful)

⚠️ Not all engines support every clause in exactly the same way. For example, some SQLite configurations need PRAGMA foreign_keys = ON to enforce constraints, and SET DEFAULT requires a defined default value on the column.

Indexing `FOREIGN KEY` columns

Always index the child foreign key column for fast joins and to avoid table scans during parent deletes or updates.

CREATE INDEX ix_books_author_id ON books(author_id);

The parent primary key is already indexed by the primary key. The child index helps both JOIN performance and referential checks.

Enforcing optional vs required relationships

Use NOT NULL on the child foreign key to make the relationship required. Allow NULL for optional relationships.

-- Required relationship
ALTER TABLE books
  ALTER COLUMN author_id SET NOT NULL;  -- Dialect may vary …

Many-to-many relationships

Many to many arises when multiple rows on each side can relate to multiple rows on the other side. Examples include students and courses, or posts and tags. Implement this with a dedicated junction table that holds pairs of foreign keys.

Canonical layout for `M:N`

CREATE TABLE students (
  student_id  INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  full_name   VARCHAR(200) NOT NULL
);

CREATE TABLE courses (
  course_id   INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  code        VARCHAR(20) NOT NULL UNIQUE,
  title       VARCHAR(200) NOT NULL
);

-- Junction table (no own identity needed unless you add attributes)
CREATE TABLE student_course (
  student_id  INTEGER NOT NULL,
  course_id   INTEGER NOT NULL,
  enrolled_on DATE NOT NULL DEFAULT CURRENT_DATE,
  CONSTRAINT pk_student_course PRIMARY KEY (student_id, course_id),
  CONSTRAINT fk_sc_student FOREIGN KEY (student_id)
    REFERENCES students(student_id)
    ON DELETE CASCADE,
  CONSTRAINT fk_sc_course FOREIGN KEY (course_id)
    REFERENCES courses(course_id)
    ON DELETE CASCADE
);

CREATE INDEX ix_sc_course ON student_course(course_id);

The composite primary key prevents duplicate pairs. Index both sides for symmetric join performance.

Adding attributes to relationships

Relationship attributes (for example role, grade, position) belong in the junction table because they describe the pairing.

ALTER TABLE student_course
  ADD COLUMN grade VARCHAR(2);  -- e.g., 'A', 'B', …

Querying across `M:N`

-- All courses for a given student
SELECT c.course_id, c.code, c.title
FROM student_course sc
JOIN courses c ON c.course_id = sc.course_id
WHERE sc.student_id = 42;

-- All students in a course
SELECT s.student_id, s.full_name
FROM student_course sc
JOIN students s ON s.student_id = sc.student_id
WHERE sc.course_id = 7;

💡 If you frequently filter by both keys, keep separate single-column indexes on each foreign key. A composite index on (student_id, course_id) is already provided by the primary key; the extra index on (course_id) makes the second query efficient.

Anti-patterns to avoid

Avoid storing multiple ids in a single column (comma separated values) or variable numbers of columns (for example tag1_id, tag2_id, tag3_id). These break normalization and make queries and constraints unreliable.

Junction tables and referential integrity

Junction tables are the workhorse for flexible relationships. Keep them narrow and well indexed, then enforce integrity with keys and constraints.

Composite keys vs surrogate keys

Use a composite primary key of the two foreign keys when the pair must be unique and you rarely reference the relationship row directly. Add a surrogate key (for example id) only when the relationship itself is a first-class entity (for example you attach many child rows to the relationship or you reference it from other tables).

-- Surrogate key variant for richer relationship entities
CREATE TABLE membership (
  membership_id INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  user_id       INTEGER NOT NULL,
  team_id       INTEGER NOT NULL,
  role          VARCHAR(40) NOT NULL,
  started_at    TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
  CONSTRAINT uq_membership UNIQUE (user_id, team_id),
  CONSTRAINT fk_m_user FOREIGN KEY (user_id) REFERENCES users(user_id),
  CONSTRAINT fk_m_team FOREIGN KEY (team_id) REFERENCES teams(team_id)
);

Ensuring uniqueness and preventing duplicates

Provide a uniqueness guarantee either through a composite primary key or a separate UNIQUE constraint. This avoids accidental duplicate relationships.

-- If you chose a surrogate key, add an explicit UNIQUE pair
CREATE UNIQUE INDEX ux_membership_user_team
  ON membership(user_id, team_id);

Maintaining integrity with `CHECK` constraints

When additional rules apply (for example role values, valid date ranges, non overlapping intervals), enforce them with CHECK constraints so the database guards your data consistently.

ALTER TABLE membership
  ADD CONSTRAINT ck_membership_role
  CHECK (role IN ('owner','admin','member'));

⚠️ Some engines have limitations on complex CHECK expressions or require enabling constraint enforcement explicitly. Verify your target engine and test migrations with real data.

Cascade rules that fit reality

In junction tables, ON DELETE CASCADE on both foreign keys is often appropriate because the relationship has no meaning if either side disappears. If junction rows carry historical facts that must remain, prefer RESTRICT with application policies that retire records instead of deleting them.

Schema versioning and migration strategies

Databases evolve. Treat schema changes as versioned, repeatable steps that can be applied forward (and sometimes backward) across environments. Your goal is reliability, auditability, and minimal downtime.

Versioning with a migrations ledger

Maintain a dedicated table that records applied migrations. Each migration has an ordered id, a description, a checksum, and timestamps.

CREATE TABLE schema_migrations (
  version      VARCHAR(50) PRIMARY KEY,  -- e.g., 20251103_1400_add_index
  description  VARCHAR(200) NOT NULL,
  checksum     VARCHAR(64) NOT NULL,
  applied_at   TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);

Store migration scripts in version control with filenames such as V20251103_1400__add_index.sql. The double underscore separates the version and description; the rest is free text.

Forward-only vs reversible migrations

Forward only favors simplicity (roll forward with new scripts to fix issues). Reversible migrations include an UP and DOWN step that can revert changes when safe. Use reversible for non destructive changes; use forward only when data loss or complex transformations make reversal unsafe.

Change	Safe to reverse	Notes
Add column (nullable)	Yes	Provide `DOWN` to drop the column
Add index	Yes	`DOWN` drops the index
Rename column	Sometimes	Use two phase copy with populated values …
Drop column/table	No	Forward only; export data first

Zero downtime patterns

When your application must remain available, use additive, compatible changes first, followed by code rollout, then cleanup.

-- Phase 1: additive
ALTER TABLE orders ADD COLUMN order_total_cents BIGINT;

-- Backfill in batches (pseudo code shown in SQL for brevity)
UPDATE orders
SET order_total_cents = (SELECT SUM(quantity * unit_price_cents) FROM order_lines ol
                         WHERE ol.order_id = orders.order_id)
WHERE order_total_cents IS NULL
  AND order_id BETWEEN :lo AND :hi;

-- Phase 2: deploy code reading the new column
-- Phase 3: make non-nullable once all rows are backfilled
ALTER TABLE orders ALTER COLUMN order_total_cents SET NOT NULL;  -- Dialect may vary …

💡 For high write volumes, backfill in id ranges or by time windows and commit in small batches. This reduces lock contention and replication lag.

Idempotent and environment aware scripts

Migrations should be idempotent where possible. Use IF NOT EXISTS guards or detect state before applying changes, then record the version in schema_migrations.

-- Example guard (dialect specific)
CREATE TABLE IF NOT EXISTS audit_log (
  audit_id    INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  actor       VARCHAR(100) NOT NULL,
  action      VARCHAR(100) NOT NULL,
  occurred_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);

Data migrations with validation

Structural changes often require moving data. Pair transformation steps with validation queries so you can prove correctness before cleanup.

-- Transform
UPDATE membership SET role = 'member' WHERE role = 'basic';

-- Validate (counts must match)
SELECT COUNT(*) FROM membership WHERE role = 'basic';  -- Expect 0

⚠️ Always test migrations against a production like copy with realistic row counts. Time the operations, ensure indexes support the updates, and verify long running transactions do not exhaust resources.

Tooling and workflow

Adopt a migration tool that suits your stack (for example pure SQL runners or code based tools). Integrate migrations into continuous integration so every deployment applies the exact same steps. Gate releases on successful migration application and verification queries.

Keep every migration small, isolated, and named for its purpose. Avoid modifying old migration files after they have shipped. If you must change history, create a new migration that corrects the state.

Chapter 26: Integration with Applications

Applications talk to databases through drivers and connection libraries. This chapter shows practical patterns for using SQL from Python, Java, and JavaScript. You will learn how to connect, execute parameterized queries, handle transactions, and avoid security mistakes that lead to injection and data leaks.

💡 Keep SQL in clear, testable units (for example repository functions). Use parameters for values; reserve string concatenation for building optional clause fragments only when absolutely necessary and still bind values.

SQL in Python (`sqlite3`/`mysql-connector-python`)

Python ships with sqlite3 and supports other engines through third party packages. The essential steps are: create a connection, get a cursor, execute statements with parameters, then commit or roll back.

Connecting and querying with `sqlite3`

# Python 3
import sqlite3

conn = sqlite3.connect("app.db", isolation_level=None)  # autocommit off when None varies across versions …
cur = conn.cursor()

cur.execute("CREATE TABLE IF NOT EXISTS users (id INTEGER PRIMARY KEY, name TEXT NOT NULL, email TEXT UNIQUE)")
cur.execute("INSERT INTO users (name, email) VALUES (?, ?)", ("Robin", "r@example.com"))

cur.execute("SELECT id, name, email FROM users WHERE email = ?", ("r@example.com",))
row = cur.fetchone()
print(row)

conn.commit()
conn.close()

Use ? placeholders in sqlite3. The driver binds values safely and handles quoting.

Connecting to MySQL with `mysql-connector-python`

# pip install mysql-connector-python
import mysql.connector as mysql

conn = mysql.connect(
    host="localhost",
    user="app_user",
    password="…",
    database="appdb",
    autocommit=False
)

with conn.cursor() as cur:
    cur.execute("INSERT INTO accounts (email, plan) VALUES (%s, %s)", ("a@example.com", "pro"))
    cur.execute("SELECT id, plan FROM accounts WHERE email = %s", ("a@example.com",))
    print(cur.fetchone())
    conn.commit()

conn.close()

MySQL drivers use %s placeholders. Do not interpolate strings directly. Let the driver bind values.

Transactions and context managers

from contextlib import contextmanager
import sqlite3

@contextmanager
def tx(db_path):
    conn = sqlite3.connect(db_path)
    try:
        yield conn
        conn.commit()
    except Exception:
        conn.rollback()
        raise
    finally:
        conn.close()

with tx("app.db") as conn:
    conn.execute("UPDATE balances SET cents = cents - ? WHERE user_id = ?", (500, 1))
    conn.execute("UPDATE balances SET cents = cents + ? WHERE user_id = ?", (500, 2))

⚠️ For high concurrency, prefer a server engine (for example PostgreSQL or MySQL). sqlite3 uses file level locking which limits concurrent writes.

SQL in Java (`JDBC`)

JDBC provides a uniform interface to many databases. Use a connection pool (HikariCP or similar), prepare statements with placeholders, and manage transactions with try with resources.

Basic `JDBC` pattern with prepared statements

// Requires driver on classpath, for example MySQL: com.mysql:mysql-connector-j …
import java.sql.*;

public class Repo {
  private final DataSource ds;
  public Repo(DataSource ds) { this.ds = ds; }

  public User findByEmail(String email) throws SQLException {
    String sql = "SELECT id, name, email FROM users WHERE email = ?";
    try (Connection c = ds.getConnection();
         PreparedStatement ps = c.prepareStatement(sql)) {
      ps.setString(1, email);
      try (ResultSet rs = ps.executeQuery()) {
        if (rs.next()) {
          return new User(rs.getLong("id"), rs.getString("name"), rs.getString("email"));
        }
        return null;
      }
    }
  }

  public void transfer(long fromId, long toId, long cents) throws SQLException {
    String debit = "UPDATE balances SET cents = cents - ? WHERE user_id = ?";
    String credit = "UPDATE balances SET cents = cents + ? WHERE user_id = ?";
    try (Connection c = ds.getConnection()) {
      c.setAutoCommit(false);
      try (PreparedStatement d = c.prepareStatement(debit);
           PreparedStatement cr = c.prepareStatement(credit)) {
        d.setLong(1, cents); d.setLong(2, fromId); d.executeUpdate();
        cr.setLong(1, cents); cr.setLong(2, toId);  cr.executeUpdate();
        c.commit();
      } catch (SQLException ex) {
        c.rollback();
        throw ex;
      }
    }
  }
}

💡 Always use a pool. Opening raw connections per request is expensive. HikariCP is a widely used default in modern stacks.

Placeholder styles across drivers

Driver	Placeholder form	Example
MySQL / PostgreSQL	`?`	`WHERE email = ?`
Oracle	`:`name	`WHERE email = :email`
SQL Server	`?` (in `PreparedStatement`)	`WHERE id = ?`

SQL in JavaScript (`mysql2`/`pg`/`better-sqlite3`)

In Node, use battle tested drivers. Prefer parameter arrays or named parameters. For pooling, create a singleton pool for the process and reuse it.

MySQL with `mysql2` and a pool

// npm i mysql2
const mysql = require("mysql2/promise");

const pool = mysql.createPool({
  uri: "mysql://app_user:…@localhost:3306/appdb",
  connectionLimit: 10
});

async function getUser(email) {
  const [rows] = await pool.query(
    "SELECT id, name FROM users WHERE email = ?",
    [email]
  );
  return rows[0] || null;
}

PostgreSQL with `pg`

// npm i pg
const { Pool } = require("pg");
const pool = new Pool({ connectionString: "postgres://app_user:…@localhost/appdb" });

async function transfer(client, fromId, toId, cents) {
  await client.query("UPDATE balances SET cents = cents - $1 WHERE user_id = $2", [cents, fromId]);
  await client.query("UPDATE balances SET cents = cents + $1 WHERE user_id = $2", [cents, toId]);
}

async function run() {
  const client = await pool.connect();
  try {
    await client.query("BEGIN");
    await transfer(client, 1, 2, 500);
    await client.query("COMMIT");
  } catch (e) {
    await client.query("ROLLBACK");
    throw e;
  } finally {
    client.release();
  }
}

SQLite with `better-sqlite3` (sync API)

// npm i better-sqlite3
const Database = require("better-sqlite3");
const db = new Database("app.db");

db.prepare("CREATE TABLE IF NOT EXISTS users (id INTEGER PRIMARY KEY, name TEXT, email TEXT UNIQUE)").run();
db.prepare("INSERT INTO users (name, email) VALUES (?, ?)").run("Robin", "r@example.com");
const row = db.prepare("SELECT id, name FROM users WHERE email = ?").get("r@example.com");
console.log(row);

⚠️ Avoid building SQL strings with template literals that include user input. Use placeholders like ?, $1, or driver specific named parameters.

Avoiding SQL injection

SQL injection happens when user input becomes part of the SQL syntax. The primary defense is parameterized statements. Additional defenses include least privilege accounts, input validation for shape and length, and escaping only as a last resort when a driver lacks parameters for special cases.

Good vs bad patterns

-- Bad (string concatenation)
sql = "SELECT * FROM users WHERE email = '" + email + "'";  -- vulnerable

-- Good (parameters)
sql = "SELECT * FROM users WHERE email = ?";
cur.execute(sql, (email,))

// Bad in Node
const sql = `SELECT * FROM users WHERE id = ${req.query.id}`; // vulnerable

// Good
const [rows] = await pool.query("SELECT * FROM users WHERE id = ?", [req.query.id]);

Dynamic filters without concatenation

# Python example that builds WHERE safely
base = "SELECT id, name FROM products WHERE 1=1"
conds = []
vals = []

if q := params.get("q"):
    conds.append("AND name ILIKE ?")
    vals.append("%" + q + "%")

if brand := params.get("brand"):
    conds.append("AND brand = ?")
    vals.append(brand)

sql = " ".join([base] + conds)
cur.execute(sql, tuple(vals))

Handling `LIKE` wildcards and escaping

Bind the pattern as a parameter and escape driver specific wildcard characters in the value when necessary. Do not inline user input inside quotes.

Stored procedures and ORMs are not automatic protection

Procedures and ORMs can still be vulnerable if they concatenate strings. Verify parameter binding and review generated SQL. Inspect logs for the final SQL text with placeholders rather than values.

Security pitfalls

Security depends on layers. The following items cover frequent mistakes and the safer alternative.

Secrets in code and connection strings

Pitfall	Safer approach
Hardcoded passwords in source	Use environment variables or a secrets manager; rotate regularly
Plain text connection strings in logs	Redact credentials; configure logger filters
Credentials shared across services	Use per service principals with least privileges

Network and transport

Require TLS for connections, validate server certificates, and restrict inbound access to the database from application subnets only. Disable direct public access. Use firewall rules or security groups that allow only known application nodes.

Authorization inside the database

Create dedicated roles for application tasks, then GRANT only the permissions required. Avoid using a superuser or root in production. Separate read and write roles. For admin jobs, use a separate connection and role with elevated privileges.

Migration safety

Schema changes can cause outages when long locks block queries. Prefer additive changes, backfill in batches, and create needed indexes before setting columns to NOT NULL. Run heavy operations during maintenance windows.

Exposing raw errors

Never return database error messages directly to users. Map them to generic messages. Log full details to a secure sink for operators to review.

Resource exhaustion

Use connection pools with sensible limits. Set query timeouts and statement timeouts. Monitor for slow queries and add indexes where appropriate. Guard endpoints with rate limits so expensive queries cannot be spammed.

💡 Treat the database as part of the app. Capture migration steps, pool settings, timeouts, and roles in code and configuration so environments are reproducible end to end.

Chapter 27: Working with Large Data Sets

As data volume grows, a single database instance or table can become a bottleneck. Large data sets require strategies that spread data across partitions, nodes, or even multiple engines while still providing a coherent query interface. This chapter introduces partitioning, sharding, and distributed query techniques used by modern systems to manage scale without losing consistency.

💡 Always scale with simplicity first. Optimize schema design and indexing before introducing distributed complexity. Partitioning and sharding add maintenance and operational overhead, so use them when single node scaling is no longer enough.

Partitioning and sharding basics

Partitioning divides one logical table into multiple physical parts within the same database. Sharding goes a step further, splitting data across multiple database servers. Both approaches improve performance by limiting how much data each query must scan and by parallelizing operations.

Horizontal and vertical partitioning

Partitioning can happen by rows (horizontal) or columns (vertical):

Type	Description	Use case
Horizontal	Each partition holds a subset of rows based on a key (for example date, region, id range)	Common in time series or multi tenant applications
Vertical	Split columns into separate tables, often by access pattern	Improves performance when some columns are rarely queried

Example: range partitioning by date

Most engines support declarative partitioning syntax. This example shows PostgreSQL style range partitions.

CREATE TABLE sales (
  id          BIGSERIAL PRIMARY KEY,
  region      TEXT NOT NULL,
  sale_date   DATE NOT NULL,
  amount      NUMERIC(10,2)
) PARTITION BY RANGE (sale_date);

CREATE TABLE sales_2025q1 PARTITION OF sales
  FOR VALUES FROM ('2025-01-01') TO ('2025-04-01');

CREATE TABLE sales_2025q2 PARTITION OF sales
  FOR VALUES FROM ('2025-04-01') TO ('2025-07-01');

Queries automatically use the correct partition when filtering by the partition key, and each partition can have its own index and storage parameters.

Hash partitioning

Hash partitioning distributes rows more evenly when ranges are not predictable, such as user ids or session keys.

CREATE TABLE sessions (
  session_id BIGINT,
  user_id    BIGINT,
  started_at TIMESTAMP,
  PRIMARY KEY (session_id)
) PARTITION BY HASH (user_id);

CREATE TABLE sessions_p0 PARTITION OF sessions FOR VALUES WITH (MODULUS 4, REMAINDER 0);
CREATE TABLE sessions_p1 PARTITION OF sessions FOR VALUES WITH (MODULUS 4, REMAINDER 1);
CREATE TABLE sessions_p2 PARTITION OF sessions FOR VALUES WITH (MODULUS 4, REMAINDER 2);
CREATE TABLE sessions_p3 PARTITION OF sessions FOR VALUES WITH (MODULUS 4, REMAINDER 3);

⚠️ Not all engines allow changing partition keys later. Plan them for long term scalability and monitor how data distributes across partitions to avoid hotspots.

Sharding across servers

Sharding divides data across multiple servers (each with its own schema). Applications or middleware determine which shard to query based on a shard key. For example, users 1–10000 on shard 1, 10001–20000 on shard 2, and so on.

# Pseudo code
def shard_for_user(user_id):
    shard_index = (user_id % NUM_SHARDS)
    return connections[shard_index]

conn = shard_for_user(15234)
cur = conn.cursor()
cur.execute("SELECT * FROM orders WHERE user_id = %s", (15234,))

Each shard runs independently, and joins across shards require application logic or special frameworks to merge results.

Rebalancing and growth

Shards can fill unevenly as data grows. To rebalance, resharding tools copy data to new nodes based on a new key distribution. Use consistent hashing to minimize the amount of data moved when adding shards.

-- Conceptual representation
Hash(user_id) % N_shards   -- determines target shard
-- When N_shards increases, only a fraction of data moves

💡 Design shard keys that are stable, high cardinality, and evenly distributed. Avoid keys that change (like usernames) or that correlate with time, which cause uneven load.

Federated or external partitions

Some databases (for example MySQL with the FEDERATED engine, or PostgreSQL with postgres_fdw) support tables that point to data in remote servers. These act as partitions hosted elsewhere but queried locally.

CREATE SERVER remote_db
  FOREIGN DATA WRAPPER postgres_fdw
  OPTIONS (host 'reports.example.com', dbname 'analytics');

CREATE FOREIGN TABLE analytics.sales_2025
  (id BIGINT, region TEXT, amount NUMERIC)
  SERVER remote_db
  OPTIONS (table_name 'sales_2025');

Federated queries and distributed SQL

Federated and distributed SQL systems allow a single query to span multiple nodes or databases as if they were one. They handle data location, parallel execution, and result aggregation transparently to the client.

Federated queries

Federation is common when integrating heterogeneous sources such as operational and analytics databases. Tools like postgres_fdw, mysql_fdw, or cross engine layers like Presto or Trino can query across data sets and merge results at runtime.

SELECT o.id, o.amount, c.region
FROM orders o
JOIN foreign_customers c ON o.customer_id = c.id
WHERE c.region = 'EU';

The federated layer manages connections to both data sources and streams results back to the client. This allows unified analysis without duplicating data, though at a performance cost for large joins.

Distributed SQL engines

Distributed SQL platforms (for example CockroachDB, YugabyteDB, Google Spanner) distribute both data and queries automatically. They replicate data across nodes, provide ACID transactions, and scale horizontally without manual sharding.

System	Model	Key features
CockroachDB	Shared nothing, range based	Automatic rebalancing, PostgreSQL compatible
YugabyteDB	Tablet partitioned, PostgreSQL compatible	Strong consistency, global replication
Google Spanner	Globally distributed, TrueTime API	Serializable transactions, global scale

These systems appear as a single logical database but partition data internally. They use consensus protocols (such as Raft or Paxos) to coordinate replicas and ensure consistency.

Trade offs and considerations

Distributed SQL simplifies scaling but introduces coordination overhead. Latency increases when replicas span regions, and schema changes may propagate slowly. Evaluate whether your workload requires true distributed consistency or can tolerate eventual models (for example in caching or analytics).

⚠️ Not every problem needs distributed SQL. Centralized databases with read replicas or well designed partitioning often handle large workloads efficiently with simpler administration.

Query federation vs data replication

Federation queries remote data on demand. Replication copies data locally. Federation reduces duplication but adds latency; replication speeds queries but needs synchronization jobs. Many systems combine both—federating rarely used sources while replicating high traffic tables.

Practical scaling workflow

Start with a single node, strong indexes, and tuned queries.
Add read replicas to handle heavy read workloads.
Introduce partitioning for large tables (by date, tenant, or id ranges).
Move to sharding or distributed SQL only when you outgrow single host capacity.

💡 Measure before scaling. Profiling slow queries and adding appropriate indexes often yields the biggest performance gains before structural changes are required.

Chapter 28: Modern Extensions and SQL 2023 Features

SQL continues to evolve beyond its original relational roots. Modern standards extend it with richer data types, JSON support, table value constructors, and new syntactic features that make queries more expressive and concise. SQL:2023 builds on earlier versions by standardizing patterns already used across major engines and introducing new capabilities for analytics and JSON handling.

💡 Even when using SQL:2023 features, always verify support in your target engine. Vendors adopt standards gradually, sometimes with different names or slightly altered behavior.

JSON data types and functions

SQL:2023 consolidates JSON handling that many engines already implemented independently. It introduces a formal JSON data type and standardized functions for creation, extraction, and validation. This allows relational and semi structured data to coexist efficiently within the same schema.

Storing JSON values

Many engines (PostgreSQL, MySQL, SQL Server, Oracle) already provide a JSON or JSONB type. SQL:2023 defines a general form:

CREATE TABLE events (
  id        BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  payload   JSON NOT NULL,
  received  TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

INSERT INTO events (payload)
VALUES (JSON_OBJECT('type': 'click', 'user': 'r@example.com', 'x': 120, 'y': 480));

The JSON_OBJECT constructor creates JSON text directly from SQL expressions. The standard defines JSON_ARRAY, JSON_OBJECTAGG, JSON_ARRAYAGG, and more.

Querying JSON fields

Use JSON_VALUE, JSON_QUERY, and JSON_EXISTS to extract or filter by parts of a JSON document.

SELECT JSON_VALUE(payload, '$.user') AS username
FROM events
WHERE JSON_VALUE(payload, '$.type') = 'click';

JSON_VALUE extracts a scalar; JSON_QUERY extracts an object or array; JSON_EXISTS tests for path existence.

Validation and constraints

The standard adds IS JSON for structural validation and defines JSON_SCHEMA_VALIDATION for checking against schema definitions (implementation dependent).

SELECT id
FROM events
WHERE NOT (payload IS JSON VALUE);

⚠️ Real engines vary in JSON path syntax. PostgreSQL uses -> and ->> operators; MySQL uses JSON_EXTRACT() and ->; SQL Server uses JSON_VALUE(). SQL:2023 standardizes these under unified function names but adoption may differ.

Indexing JSON data

Some engines support functional or path based indexes for fast JSON lookups. For example, PostgreSQL can index computed expressions:

CREATE INDEX ix_events_user
  ON events ((payload->>'user'));

Other engines use virtual columns to materialize JSON keys and index them. Standard SQL now allows generated columns based on JSON_VALUE().

Table value constructors

Table value constructors (TVCs) let you create in-memory sets of rows directly within a query. They are part of the SQL standard but expanded in SQL:2023 for better portability and syntax clarity.

Basic inline table construction

SELECT *
FROM (VALUES
  (1, 'Alice', 'UK'),
  (2, 'Bob', 'US'),
  (3, 'Chen', 'CN')
) AS temp(id, name, country);

This creates a virtual table for quick testing or joins without a permanent table. It is especially useful for parameter lists or lookups.

Using `VALUES` in `INSERT`

You can use TVCs to insert multiple rows in a single statement:

INSERT INTO products (id, name, price)
VALUES
  (1, 'Pen', 1.20),
  (2, 'Pencil', 0.80),
  (3, 'Notebook', 3.50);

Combining with `JOIN` and `IN` clauses

Table constructors integrate naturally with joins and subqueries:

SELECT p.id, p.name, v.country
FROM products p
JOIN (VALUES
  ('UK'), ('US')
) AS v(country)
  ON p.origin = v.country;

💡 Table constructors are portable and efficient for small lookups. For larger sets, load data into a staging table or use a bulk import mechanism.

Expanded forms in SQL:2023

SQL:2023 clarifies that table constructors can appear in FROM, WITH, and subquery expressions uniformly, and allows use of default column names when omitted. It also standardizes the optional TABLE keyword for readability.

SELECT *
FROM TABLE (VALUES (1, 'X'), (2, 'Y'));

New syntax updates in SQL:2023

The SQL:2023 revision adds multiple clarifications and new constructs across core syntax, JSON, temporal features, and analytic expressions. The main updates include:

Expanded `MERGE` capabilities

MERGE now supports multiple WHEN MATCHED and WHEN NOT MATCHED clauses with conditional logic. This makes upserts and complex synchronization operations more expressive.

MERGE INTO customers AS c
USING new_data AS n
ON (c.id = n.id)
WHEN MATCHED AND n.status = 'inactive' THEN
  UPDATE SET c.active = FALSE
WHEN MATCHED THEN
  UPDATE SET c.email = n.email
WHEN NOT MATCHED THEN
  INSERT (id, email, active) VALUES (n.id, n.email, TRUE);

Enhanced `WINDOW` clause

The WINDOW clause can now be reused across multiple analytic functions in the same query, avoiding repetition of PARTITION BY and ORDER BY definitions.

SELECT
  region,
  SUM(sales) OVER win AS total_sales,
  RANK() OVER win AS rank_in_region
FROM sales
WINDOW win AS (PARTITION BY region ORDER BY sales DESC);

Expanded `INTERVAL` and temporal precision

SQL:2023 refines temporal arithmetic, allowing microsecond precision and standardizing INTERVAL literals across engines.

SELECT CURRENT_TIMESTAMP + INTERVAL '3.5' SECOND AS in_future;

Standardized `TRIM_ARRAY`, `CONTAINS_SUBSTRING`, and `MULTISET` functions

New built-ins extend operations on strings, arrays, and multisets. These provide cross-engine consistency for manipulating structured values.

SELECT TRIM_ARRAY(ARRAY[1, NULL, 2, NULL]) AS compact,
       CONTAINS_SUBSTRING('This is SQL', 'SQL') AS found;

Row pattern recognition refinements

SQL:2023 builds on MATCH_RECOGNIZE (introduced in SQL:2016) to better support event stream analytics and state detection. This allows declarative sequence pattern queries.

SELECT *
FROM trades
MATCH_RECOGNIZE (
  PARTITION BY symbol
  ORDER BY ts
  MEASURES FIRST(A.ts) AS start_time, LAST(C.ts) AS end_time
  PATTERN (A B* C)
  DEFINE
    A AS price < 100,
    B AS price BETWEEN 100 AND 110,
    C AS price > 110
);

⚠️ Row pattern recognition remains advanced and is only partly implemented in most commercial databases. Always test syntax in your specific engine before relying on it in production.

Summary of adoption

Feature	Introduced	Common support (2025)
JSON data type	SQL:2023	PostgreSQL, MySQL, SQL Server, Oracle (variant syntax)
Table value constructors	SQL:99 – expanded in 2023	All major engines
Enhanced MERGE	SQL:2023	Partial support (SQL Server, Oracle)
Reusable WINDOW clause	SQL:2023	PostgreSQL, Oracle
Row pattern recognition	SQL:2016 – refined 2023	Oracle, Snowflake, Trino (partial)

SQL remains a living standard. Each release brings the relational model closer to a unified language for both structured and semi structured data, capable of scaling from small embedded systems to distributed analytical clusters.

💡 Follow your engine’s documentation for SQL:2023 compliance notes. Even partial support can simplify application code if you align with standardized functions early.

Chapter 29: SQL and Analytics

SQL is not only a language for defining and modifying data; it is also a powerful tool for analysis and reporting. Its aggregation, windowing, and statistical functions can produce detailed insights directly within the database, reducing the need for external processing. This chapter explores how to use SQL for analytical workloads and integrate it into business intelligence pipelines.

💡 Analytical SQL runs closest to the data, reducing data transfer and pre aggregation needs. Use it for dashboards, summaries, and ETL logic where performance and accuracy matter most.

Aggregate and window functions for reporting

Aggregate functions summarize multiple rows into single results. Window functions extend this idea by allowing aggregates over specific partitions of data while still returning individual rows. Together, they form the backbone of reporting queries.

Common aggregate functions

Function	Description
`COUNT()`	Counts rows or non null values
`SUM()`	Totals numeric values
`AVG()`	Computes average value
`MIN()` / `MAX()`	Finds smallest or largest value
`GROUP_CONCAT()` / `STRING_AGG()`	Concatenates values within a group (syntax varies)

SELECT department, COUNT(*) AS staff, AVG(salary) AS avg_salary
FROM employees
GROUP BY department
ORDER BY avg_salary DESC;

Aggregates collapse multiple rows into summaries per group. You can add HAVING to filter results after aggregation.

SELECT department
FROM employees
GROUP BY department
HAVING AVG(salary) > 70000;

Window functions for running totals and rankings

Window functions compute values across a related set of rows, defined by OVER clauses, without collapsing the result set.

SELECT
  department,
  employee_id,
  salary,
  SUM(salary) OVER (PARTITION BY department) AS total_in_dept,
  RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rank_in_dept
FROM employees;

This calculates per department totals and ranks each employee by salary while still listing individual rows.

Running totals and moving averages

SELECT
  sale_date,
  SUM(amount) OVER (ORDER BY sale_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total,
  AVG(amount) OVER (ORDER BY sale_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS seven_day_avg
FROM sales;

These patterns are essential for time series and financial analysis, where each new row builds on previous ones.

⚠️ Avoid using ORDER BY without a clear frame clause for large data sets. Explicit frames like ROWS BETWEEN improve predictability and may optimize better in some engines.

Combining aggregates and windows

Both types of functions can appear in the same query, allowing summaries and details together:

SELECT
  region,
  store_id,
  SUM(sales) AS store_total,
  SUM(SUM(sales)) OVER (PARTITION BY region) AS region_total
FROM sales
GROUP BY region, store_id;

Here, the grouped result provides store_total, and the window computes region_total across grouped rows.

Using SQL for data analysis

SQL can perform most classical analytical steps: filtering, aggregation, correlation, and cohort analysis. When datasets fit within the database, these queries can deliver results faster than exporting to external tools.

Descriptive statistics

Basic statistical summaries can be computed directly in SQL:

SELECT
  AVG(amount) AS mean,
  STDDEV(amount) AS stddev,
  MIN(amount) AS min,
  MAX(amount) AS max,
  COUNT(*) AS samples
FROM orders;

Some engines provide VARIANCE(), MEDIAN(), and PERCENTILE_CONT() for more advanced measures.

Cohort and retention analysis

Cohort analysis groups users or transactions by their first event date and measures retention or performance over time.

WITH first_use AS (
  SELECT user_id, MIN(order_date) AS cohort_date
  FROM orders
  GROUP BY user_id
)
SELECT
  f.cohort_date,
  DATE_TRUNC('month', o.order_date) AS active_month,
  COUNT(DISTINCT o.user_id) AS active_users
FROM orders o
JOIN first_use f ON o.user_id = f.user_id
GROUP BY f.cohort_date, active_month
ORDER BY f.cohort_date, active_month;

This pattern reveals how many users return after their initial period. It is widely used in product analytics.

Correlation and outlier detection

SQL:2023 defines COVAR_POP(), COVAR_SAMP(), CORR(), and REGR_SLOPE() to measure relationships between columns.

SELECT
  CORR(height, weight) AS correlation,
  REGR_SLOPE(weight, height) AS slope
FROM measurements;

Outliers can be flagged by comparing values to standard deviation ranges:

SELECT id, amount
FROM sales
WHERE amount > (
  SELECT AVG(amount) + 3 * STDDEV(amount)
  FROM sales
);

💡 Analytical SQL can serve as a data preparation layer before exporting results to visualization tools. Use common table expressions to organize multi step analysis for readability.

Pivoting and unpivoting for summaries

Pivoting turns row data into columns, while unpivoting does the reverse. Syntax varies by engine, but SQL:2023 provides PIVOT and UNPIVOT constructs.

SELECT *
FROM sales
PIVOT (
  SUM(amount) FOR region IN ('EU', 'US', 'APAC')
) AS p;

This produces one row per grouping key and a column per region. Engines without PIVOT can emulate it with conditional aggregation:

SELECT
  product,
  SUM(CASE WHEN region = 'EU' THEN amount END) AS eu_sales,
  SUM(CASE WHEN region = 'US' THEN amount END) AS us_sales,
  SUM(CASE WHEN region = 'APAC' THEN amount END) AS apac_sales
FROM sales
GROUP BY product;

Using SQL for BI integration

Modern business intelligence (BI) tools rely on SQL to extract and visualize insights. Whether connected through live queries or cached extracts, SQL defines the data model for dashboards and analytics layers.

Connecting BI tools

BI platforms such as Power BI, Tableau, Looker, and Metabase connect through standard protocols (ODBC, JDBC, or native drivers). They send SQL queries to the database and render results as charts or tables.

-- Example: a BI tool may issue
SELECT region, SUM(amount) AS total_sales
FROM orders
GROUP BY region
ORDER BY total_sales DESC;

Views are ideal interfaces for BI tools, encapsulating business logic and simplifying joins. Expose only the necessary columns and apply row level permissions where needed.

Creating analytical views

CREATE VIEW v_sales_summary AS
SELECT
  region,
  DATE_TRUNC('month', order_date) AS month,
  SUM(amount) AS total_sales,
  COUNT(DISTINCT customer_id) AS customers
FROM orders
GROUP BY region, month;

Analytical views provide a stable surface for BI dashboards to query without risk of schema changes breaking charts.

Caching and extracts

Many BI systems store precomputed extracts for performance. Use scheduled jobs or materialized views to populate them efficiently.

CREATE MATERIALIZED VIEW mv_sales_monthly AS
SELECT
  DATE_TRUNC('month', order_date) AS month,
  SUM(amount) AS total
FROM orders
GROUP BY month;

Refresh materialized views regularly, or trigger updates based on data change events.

Data modeling for analytics

Adopt a star schema or snowflake schema to make analytical queries simpler and faster. Separate fact tables (transactions) from dimension tables (descriptive attributes).

-- Simplified star schema
FACT_SALES(order_id, date_id, product_id, amount)
DIM_DATE(date_id, date, month, year)
DIM_PRODUCT(product_id, category, brand)

Analysts can then write concise joins for reports:

SELECT
  d.month,
  p.category,
  SUM(f.amount) AS total_sales
FROM fact_sales f
JOIN dim_date d ON f.date_id = d.date_id
JOIN dim_product p ON f.product_id = p.product_id
GROUP BY d.month, p.category;

⚠️ Keep BI queries read only. Grant write access only to ETL or ingestion processes. Limit resource heavy queries with timeouts or query governors to maintain database stability.

Automating analytics pipelines

Use stored procedures, scheduled jobs, or external orchestrators (for example Airflow or dbt) to automate analysis refreshes and report generation.

-- Example task scheduling pseudo SQL
CALL refresh_materialized_view('mv_sales_monthly');

By combining SQL’s analytical power with BI integration, you can build self updating dashboards, real time metrics, and consistent data models across your organization.

💡 Treat analytical SQL as code. Store it in version control, review changes, and test for accuracy before deploying to BI environments.

BChapter 30: ringing It All Together

This final chapter combines all the principles covered so far into a coherent picture of how SQL fits into real projects. You will see how to design a complete database, write efficient queries for practical problems, explore advanced techniques, compare SQL with NoSQL approaches, and understand SQL’s role in the wider field of data engineering.

💡 Think of SQL not just as a query language but as the backbone of most data ecosystems. Whether embedded in an application, powering dashboards, or feeding data pipelines, SQL shapes how data is organized, accessed, and transformed.

Designing and building a complete sample database

Let’s consider an example project for an online retail platform. The database must support customers, products, orders, payments, and inventory tracking. We will use a normalized relational schema, designed for clarity and integrity.

Schema overview

CREATE TABLE customers (
  customer_id    BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  name           VARCHAR(200) NOT NULL,
  email          VARCHAR(200) UNIQUE NOT NULL,
  created_at     TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE products (
  product_id     BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  name           VARCHAR(200) NOT NULL,
  category       VARCHAR(100),
  price          NUMERIC(10,2) NOT NULL,
  stock_quantity INTEGER NOT NULL DEFAULT 0
);

CREATE TABLE orders (
  order_id       BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  customer_id    BIGINT NOT NULL,
  order_date     TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  total_amount   NUMERIC(10,2),
  CONSTRAINT fk_orders_customer FOREIGN KEY (customer_id)
    REFERENCES customers(customer_id)
);

CREATE TABLE order_items (
  order_id       BIGINT NOT NULL,
  product_id     BIGINT NOT NULL,
  quantity       INTEGER NOT NULL,
  price_each     NUMERIC(10,2) NOT NULL,
  CONSTRAINT pk_order_items PRIMARY KEY (order_id, product_id),
  CONSTRAINT fk_oi_order FOREIGN KEY (order_id) REFERENCES orders(order_id),
  CONSTRAINT fk_oi_product FOREIGN KEY (product_id) REFERENCES products(product_id)
);

CREATE TABLE payments (
  payment_id     BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  order_id       BIGINT NOT NULL,
  amount         NUMERIC(10,2) NOT NULL,
  method         VARCHAR(40),
  paid_at        TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  CONSTRAINT fk_payments_order FOREIGN KEY (order_id)
    REFERENCES orders(order_id)
);

This schema implements one to many relationships between customers and orders, and a many to many relationship between orders and products through order_items.

⚠️ Always enforce referential integrity with FOREIGN KEY constraints and keep indexes on these columns for performance. Missing indexes are a common cause of slow joins.

Populating the database

INSERT INTO customers (name, email)
VALUES
  ('Alice', 'alice@example.com'),
  ('Bob', 'bob@example.com');

INSERT INTO products (name, category, price, stock_quantity)
VALUES
  ('Laptop', 'Electronics', 899.00, 20),
  ('Headphones', 'Accessories', 49.99, 100),
  ('Mouse', 'Accessories', 25.50, 150);

Use INSERT statements, CSV imports, or bulk loading tools depending on volume. After populating tables, test the schema by joining related data.

SELECT c.name, o.order_id, p.name AS product, oi.quantity, oi.price_each
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id;

Real-world query examples and optimization

With the schema in place, you can write reporting and analytical queries that mirror business needs. Optimization comes from using indexes, caching, and avoiding unnecessary scans.

Sales by customer and category

SELECT
  c.name AS customer,
  p.category,
  SUM(oi.quantity * oi.price_each) AS total_spent
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
GROUP BY c.name, p.category
ORDER BY total_spent DESC;

Identifying top selling products

SELECT
  p.name,
  SUM(oi.quantity) AS total_sold
FROM products p
JOIN order_items oi ON p.product_id = oi.product_id
GROUP BY p.name
ORDER BY total_sold DESC
LIMIT 10;

Using execution plans

Most engines support an EXPLAIN or EXPLAIN ANALYZE statement to view how a query executes. Look for full table scans, missing indexes, and join order inefficiencies.

EXPLAIN ANALYZE
SELECT * FROM orders WHERE customer_id = 12345;

💡 The optimizer uses statistics to choose query plans. Keep them current with ANALYZE or engine specific maintenance commands so the planner has accurate data distribution estimates.

Caching and indexing

Frequently accessed data should be cached at the application layer, or stored in materialized views. Index columns used in JOIN, WHERE, and ORDER BY clauses. Avoid over indexing; each index slows down inserts and updates.

Advanced SQL

Advanced SQL techniques extend beyond simple querying. They include recursive queries, common table expressions, and analytical window functions that enable complex data transformations.

Recursive CTE for hierarchies

WITH RECURSIVE category_tree AS (
  SELECT category_id, parent_id, name
  FROM categories
  WHERE parent_id IS NULL
  UNION ALL
  SELECT c.category_id, c.parent_id, c.name
  FROM categories c
  JOIN category_tree p ON c.parent_id = p.category_id
)
SELECT * FROM category_tree;

This pattern handles hierarchical data like departments, categories, or organizational charts.

Advanced analytics with window functions

SELECT
  customer_id,
  order_date,
  SUM(total_amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS running_total
FROM orders;

Window functions let you build progressive totals or rankings directly in SQL, reducing the need for procedural logic.

Combining SQL with procedural code

Stored procedures, functions, and triggers allow you to embed logic that runs inside the database. Use them for validation, complex updates, or background processing.

CREATE OR REPLACE FUNCTION update_stock()
RETURNS TRIGGER AS $$
BEGIN
  UPDATE products
  SET stock_quantity = stock_quantity - NEW.quantity
  WHERE product_id = NEW.product_id;
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER after_order_item_insert
AFTER INSERT ON order_items
FOR EACH ROW EXECUTE FUNCTION update_stock();

⚠️ Procedural code inside databases can simplify logic but may tie you to a specific engine. Keep business rules modular so they can be migrated or refactored if needed.

NoSQL contrasts

NoSQL databases evolved to handle unstructured or highly variable data where strict relational modeling may not fit. They trade strong consistency and complex joins for scalability and flexibility. Common types include document stores, key value stores, column families, and graph databases.

Document stores

Document databases (such as MongoDB) store JSON like objects. Each record is self contained, with nested arrays and objects replacing foreign keys. This simplifies reads but duplicates data.

{
  "customer": "Alice",
  "orders": [
    { "product": "Laptop", "quantity": 1, "price": 899.00 },
    { "product": "Mouse", "quantity": 2, "price": 25.50 }
  ]
}

Key value and column stores

Key value databases (like Redis) and wide column stores (like Cassandra) are optimized for rapid lookups and high volume writes. They require explicit design for query access patterns instead of flexible joins.

Graph databases

Graph databases model entities as nodes and relationships as edges. They use query languages such as Cypher or Gremlin for path traversal and relationship queries. SQL extensions like SQL/PGQ (in SQL:2023) bridge some of these ideas back into relational databases.

MATCH (p1:Person)-[:FRIEND_OF]->(p2:Person)
WHERE p1.name = 'Alice'
RETURN p2.name;

SQL and NoSQL convergence

Many modern engines blur the line between SQL and NoSQL. PostgreSQL supports JSONB and key value operations; MySQL and SQL Server handle document columns; and distributed SQL platforms integrate both structured and unstructured models under one query engine.

💡 SQL’s continued evolution means the old division between relational and non relational is fading. The best design depends on workload, consistency needs, and query patterns.

Data engineering

Data engineering connects storage, transformation, and analytics into a complete data pipeline. SQL is at the core of this ecosystem, serving as the language of ETL, warehousing, and analytics orchestration.

ETL and ELT patterns

Traditional ETL (extract, transform, load) performs transformations before loading data into a warehouse. Modern systems often use ELT (extract, load, transform), loading raw data first and transforming it with SQL inside the database.

-- Example transformation step
INSERT INTO clean.orders
SELECT
  order_id,
  customer_id,
  CAST(order_total AS NUMERIC(10,2)) AS amount,
  order_date::DATE
FROM raw.orders_raw
WHERE order_total IS NOT NULL;

Data pipelines and orchestration

Tools like Apache Airflow, dbt, or Dagster run SQL transformations as scheduled tasks, coordinating dependencies between stages.

dbt run --models staging+  # Example command

Each transformation becomes a versioned SQL file. Tests validate row counts, null constraints, and referential links automatically.

Data warehouses and lakes

Data warehouses (Snowflake, BigQuery, Redshift) use columnar storage optimized for analytics. Data lakes extend this to semi structured data formats like Parquet and JSON. SQL remains the unifying query interface across both layers.

Streaming and real-time analytics

Streaming SQL engines (Flink SQL, Materialize, ksqlDB) process continuous event streams using familiar syntax.

SELECT region, COUNT(*) FROM orders_stream
WINDOW TUMBLE (INTERVAL '5' MINUTE)
GROUP BY region;

This model enables near real-time dashboards and alerts powered by SQL queries on live data.

The unified data ecosystem

Modern data engineering treats SQL as the universal language across OLTP, OLAP, and streaming systems. Whether structured, semi structured, or distributed, SQL unites diverse tools under a shared analytical vocabulary.

⚠️ Mastering SQL means mastering the foundation of nearly all modern data systems. Even when using Python, R, or Spark, the transformations underneath often compile down to SQL or relational algebra.

By now, you have seen how SQL evolved from simple relational tables into a full scale analytical and operational language. Its clarity, portability, and power ensure its place at the heart of modern computing for decades to come.

💡 Continue experimenting. Build real databases, run complex queries, and explore how SQL connects to every layer of today’s data infrastructure. Every improvement in your understanding of SQL deepens your understanding of data itself.

No content may be re-used, sold, given away, or used for training AI without express permission

Questions? Feedback? Get in touch