Chapter 1: What is SQL?
Structured Query Language (SQL) is the standard language for describing, querying, and changing data in relational databases. It is declarative, which means you describe the result you want and the database engine decides how to produce it. With SQL you can define tables, state relationships, add and modify rows, and ask questions that combine and filter many rows at once. The goal is consistent, reliable data that many applications can share.
This chapter introduces the purpose of relational databases, the difference between declarative and procedural thinking, and how SQL fits inside a typical application stack. You will see the core ideas that guide everything that follows in later chapters.
SQL you describe a result set; the engine chooses an efficient plan.
Throughout the book you will see code blocks and small schemas. Names, sizes, and data are examples. Real projects will have different columns and constraints…
Purpose and scope of relational databases
Relational databases store data in tables that relate to each other by keys. Each table models a single subject (for example, customers or orders). Rows are individual records. Columns are attributes with fixed types. Keys and constraints preserve correctness, while transactions keep changes reliable even when many users work at the same time.
Relational basics
A table has a defined set of columns with types and constraints. A primary key uniquely identifies each row. Foreign keys relate rows between tables. Queries combine tables by matching key values and return a new table as the result set.
-- A tiny, portable demo schema
CREATE TABLE customers (
customer_id INTEGER PRIMARY KEY,
name TEXT NOT NULL
);
CREATE TABLE orders (
order_id INTEGER PRIMARY KEY,
customer_id INTEGER NOT NULL,
total NUMERIC NOT NULL,
FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
);
With this shape you can ask questions such as which customers placed orders, totals per customer, and which orders match given criteria.
Integrity and transactions
Constraints prevent invalid data from entering the system. Transactions group statements so they succeed or fail together. This produces consistent state, even with concurrent sessions and failures.
BEGIN;
INSERT INTO orders (order_id, customer_id, total)
VALUES (1001, 1, 49.99);
UPDATE customers
SET name = 'Robin N.'
WHERE customer_id = 1;
COMMIT; -- Or ROLLBACK; to undo the whole unit
Declarative vs procedural thinking
Procedural code tells the computer step by step how to get a result. Declarative SQL states the result shape (columns, filters, groupings) and leaves the steps to the optimizer. Thinking in sets, rather than loops, is the key shift.
Set thinking in practice
Instead of looping through rows and testing conditions, write one statement that operates on the whole set. The database uses indexes and joins to find matches efficiently.
-- Declarative: return names of customers with orders over 50
SELECT c.name
FROM customers AS c
JOIN orders AS o ON o.customer_id = c.customer_id
WHERE o.total > 50;
-- Procedural mindset (illustrative pseudocode)
for customer in customers:
for order in orders:
if order.customer_id == customer.customer_id and order.total > 50:
emit(customer.name)
Both describe the same intent. The declarative version lets the engine choose an efficient plan.
The role of the optimizer
The optimizer examines your SELECT statement, available indexes, statistics, and constraints. It then chooses a strategy (join order, join method, index use) that should run quickly for the current data. You express filters and joins; the engine works out the path.
ORDER BY. If order matters, always include ORDER BY with explicit columns.
How SQL fits into application stacks
Applications call the database through a driver or library. The app sends statements such as SELECT, INSERT, UPDATE, and DELETE over a connection. The database parses and plans the statement, executes it, and returns a result set or a status.
Typical request flow
A common path is: UI or service layer → application code → database driver → database server → result set → application. Connection pooling keeps a small number of open connections that many requests can share.
-- Parameterized query (safe pattern)
SELECT name, total
FROM orders_view
WHERE customer_id = :customer_id
AND total > :min_total;
Parameters avoid string concatenation and reduce the risk of injection. The driver binds values before execution.
Data definition and data manipulation
SQL statements fall into broad families. DDL (data definition) creates or changes schema objects. DML (data manipulation) reads or changes row data. DCL (access control) grants privileges. TCL (transaction control) manages COMMIT and ROLLBACK. You will use all of these across real projects.
With these foundations in place, you can begin to write clear, portable SQL that supports reliable applications at any scale.
Chapter 2: A Brief History of SQL
SQL was born from the idea that data could be managed with a mathematical model rather than through custom file formats or procedural code. In the early 1970s, researchers at IBM applied the relational model (first proposed by Dr. Edgar F. Codd) to practical database systems. They created a language called SEQUEL (Structured English Query Language) to express operations on relational data. It soon evolved into SQL after a trademark issue forced a name change, but the core concept stayed the same: describe what you want, not how to get it.
SQL’s success was abstraction. It hides physical storage and indexing details so that users can think in terms of relations, rows, and columns rather than file structures.
Over the following decades, SQL became the foundation of nearly every relational database system. It went through several standardization phases and vendor interpretations, but its declarative nature and strong theoretical base made it remarkably durable. Even with the rise of NoSQL and newer paradigms, SQL remains the most widely used database language in the world.
From SEQUEL to SQL-92/99/2003/2016/2023
The evolution of SQL can be traced through its major standard releases, each adding features to reflect the changing needs of applications and hardware over time.
Early development (1970s–1980s)
IBM’s first prototypes, System R and SQL/DS, demonstrated that the relational model could outperform file-based systems in flexibility and data integrity. In 1979, Oracle released the first commercial implementation of SQL, followed by IBM’s DB2 and other systems throughout the 1980s. During this period, the core commands (SELECT, INSERT, UPDATE, DELETE, and CREATE) were firmly established.
SQL-86 and SQL-89
The first official standard, SQL-86, was published by ANSI and ISO to provide a common foundation. SQL-89 refined syntax and clarified behaviors but added little new functionality. These versions unified the core vocabulary and helped different vendors align basic behavior.
SQL-92 (SQL2)
SQL-92 marked a huge expansion. It added set operations, CASE expressions, more join types, schema definitions, and stricter data typing rules. Compliance levels (Entry, Intermediate, and Full) were introduced so vendors could claim partial compatibility while adding proprietary features.
SQL:1999 (SQL3)
This version introduced procedural and object-oriented extensions. Features such as recursive queries (WITH RECURSIVE), triggers, user-defined types, and large objects (BLOB, CLOB) made SQL capable of expressing more complex business logic directly in the database. It also introduced BOOLEAN types and regular expression-like pattern matching.
SQL:2003 to SQL:2016
SQL:2003 added XML integration, MERGE statements, and the window function framework. Later standards (SQL:2008, SQL:2011, SQL:2016) refined analytic functions, introduced temporal tables for tracking changes over time, and standardized JSON operations.
SQL:2023 and beyond
The most recent updates include further JSON handling, improvements to polymorphic table functions, and clarifications for modern distributed and cloud databases. Despite decades of growth, the core philosophy remains intact: maintain data integrity, express queries declaratively, and ensure forward compatibility.
ANSI and ISO standards
SQL is governed by two standard bodies. The American National Standards Institute (ANSI) maintains the U.S. standard, and the International Organization for Standardization (ISO) manages the global equivalent. In practice, the two collaborate, and the published standard documents are nearly identical. Each revision is identified by the release year, for example, SQL:2016 or SQL:2023.
Structure of the standard
The standard is divided into numbered parts, each covering an area such as core language, data definition, triggers, or external bindings. The complete text runs to thousands of pages and defines precise behavior for every keyword and clause.
Levels of conformance
Vendors can claim conformance to different levels of the standard. Core conformance covers basic SELECT, INSERT, UPDATE, and DELETE. Higher levels add support for advanced types, recursive queries, and temporal data. This layered structure encourages adoption without forcing every feature at once.
Vendor dialects and cross-compatibility
Although all major relational databases implement SQL, each adds its own syntax extensions, data types, and functions. These variations (often called dialects) arose from competition and innovation. For example, LIMIT in MySQL and PostgreSQL corresponds to TOP in SQL Server, while Oracle uses FETCH FIRST n ROWS ONLY.
Common dialect families
| Vendor | Common Extension Example |
| MySQL | AUTO_INCREMENT columns |
| PostgreSQL | SERIAL and RETURNING clauses |
| SQL Server | TOP n in SELECT |
| Oracle | DUAL table and ROWNUM filtering |
| SQLite | Dynamic typing with relaxed constraints |
Portable practices
To keep your code portable, rely on standard features unless a vendor-specific function is essential. Always consult documentation for type names, string concatenation operators, and date functions, which often differ between engines.
Why dialects persist
Vendors compete by offering extra capabilities such as procedural extensions, optimized data types, integration with cloud services, and special indexes. These differences drive innovation, and many features later migrate into future SQL standards. The result is a rich but sometimes inconsistent ecosystem.
Understanding the shared foundations and differences among dialects helps you write clear, portable queries while still taking advantage of each system’s strengths.
Chapter 3: Relational Database Concepts
At the heart of every relational database lies a simple yet powerful model: data organized into related tables. Each table represents one subject; each row represents one record; and relationships between tables preserve logical connections. This chapter introduces the essential components of the relational model (tables, keys, normalization, constraints, and referential integrity) that make relational databases consistent and reliable.
Tables, rows, columns, keys, and relationships
Data in a relational database is stored in tables. Each table consists of rows (records) and columns (fields). Rows hold the actual data values, while columns define the data’s structure and constraints. A table’s design defines what kind of data is allowed and how it connects to other tables.
Tables and columns
Every table should describe one type of entity. Columns describe attributes of that entity, such as a person’s name or an order’s date. Each column has a data type that restricts the kind of values it can hold.
CREATE TABLE customers (
customer_id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
email TEXT UNIQUE,
joined_on DATE DEFAULT CURRENT_DATE
);
This table represents customers. Each column has a purpose: customer_id uniquely identifies the row, name is required, email must be unique, and joined_on defaults to the current date.
Rows and relationships
Each row is a single record, and relationships connect rows in one table to rows in another. By linking tables with keys, you can query data across multiple subjects without duplicating information.
CREATE TABLE orders (
order_id INTEGER PRIMARY KEY,
customer_id INTEGER NOT NULL,
order_date DATE NOT NULL,
total NUMERIC NOT NULL,
FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
);
Here, orders refers to customers using customer_id. The relationship is one-to-many: one customer can have many orders, but each order belongs to a single customer.
Normalization and denormalization
Normalization is the process of organizing data to reduce redundancy and improve integrity. It divides large, repetitive tables into smaller, related ones. Each table focuses on a single subject, and relationships connect them. The goal is to avoid anomalies when inserting, updating, or deleting data.
Normal forms
Several “normal forms” define stages of improvement. Each form builds upon the previous one. The first three are most commonly applied in practical design.
- First Normal Form (1NF): Each column holds atomic (indivisible) values, and each row is unique.
- Second Normal Form (2NF): Every non-key column depends on the entire primary key, not just part of it.
- Third Normal Form (3NF): No non-key column depends on another non-key column.
-- Bad design: repeated address fields cause redundancy
CREATE TABLE employees (
emp_id INTEGER PRIMARY KEY,
name TEXT,
dept_name TEXT,
dept_addr TEXT
);
-- Better design: separate departments into their own table
CREATE TABLE departments (
dept_id INTEGER PRIMARY KEY,
dept_name TEXT,
dept_addr TEXT
);
CREATE TABLE employees (
emp_id INTEGER PRIMARY KEY,
name TEXT,
dept_id INTEGER REFERENCES departments(dept_id)
);
Denormalization
Denormalization deliberately combines related data for performance or simplicity. It reduces joins by storing precomputed or redundant information. While this can speed up queries, it increases the chance of inconsistencies if data changes in one place but not another.
For example, an orders table might include a customer_name column to avoid a join in frequent reports. If a customer changes their name, all affected orders must be updated to keep data correct.
Primary vs foreign keys
Keys are special columns (or sets of columns) that define relationships and uniqueness. The primary key uniquely identifies each row in its table. A foreign key references a primary key in another table, establishing a link between the two.
Primary keys
A primary key ensures that each record is distinct. It can be a single column or a combination. A good primary key never changes.
CREATE TABLE products (
product_id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
price NUMERIC
);
Each product has a unique product_id. The database will reject any attempt to insert a duplicate.
Foreign keys
A foreign key defines a relationship between two tables by referencing another table’s primary key. It enforces consistency between related records.
CREATE TABLE order_items (
order_id INTEGER NOT NULL,
product_id INTEGER NOT NULL,
quantity INTEGER NOT NULL,
FOREIGN KEY (order_id) REFERENCES orders(order_id),
FOREIGN KEY (product_id) REFERENCES products(product_id),
PRIMARY KEY (order_id, product_id)
);
Here, each item belongs to both an order and a product. The composite key (order_id, product_id) prevents duplicate entries for the same product in the same order.
Constraints
Constraints are rules that protect data integrity. They define what values are allowed, ensure required data is present, and maintain logical consistency. Most databases enforce constraints automatically at insert or update time.
Common constraint types
PRIMARY KEY: Ensures unique identification of each row.FOREIGN KEY: Links tables and enforces referential integrity.UNIQUE: Prevents duplicate values in a column.NOT NULL: Requires a value for the column.CHECK: Validates data against a condition.DEFAULT: Supplies a fallback value when none is provided.
CREATE TABLE employees (
emp_id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
salary NUMERIC CHECK (salary >= 0),
hired_on DATE DEFAULT CURRENT_DATE
);
If you try to insert a negative salary or omit the name, the database will reject the statement.
Deferrable and cascading constraints
Some databases allow constraints to be deferred until transaction commit. This lets you perform several related updates that temporarily violate rules but end up consistent when the transaction completes. Cascading actions automatically apply updates or deletions to related rows.
CREATE TABLE orders (
order_id INTEGER PRIMARY KEY,
customer_id INTEGER REFERENCES customers(customer_id)
ON UPDATE CASCADE
ON DELETE SET NULL
);
If a customer’s ID changes, related orders update automatically. If the customer is deleted, their orders remain but the customer_id becomes NULL.
Referential integrity
Referential integrity ensures that relationships between tables remain valid. Every foreign key must match an existing primary key or be NULL. This rule prevents orphaned records and maintains consistency across tables.
How it works
When you insert or update data, the database checks foreign keys to confirm that referenced rows exist. When you delete data, it checks for dependent rows that might be affected.
-- Insert succeeds only if the customer_id exists
INSERT INTO orders (order_id, customer_id, order_date, total)
VALUES (2001, 1, '2025-11-02', 199.99);
Cascading effects
To preserve referential integrity automatically, most systems support cascading actions. You can specify what should happen when a referenced row is changed or deleted:
ON UPDATE CASCADE– Update foreign keys when the parent key changes.ON DELETE CASCADE– Delete dependent rows when the parent is deleted.ON DELETE SET NULL– Set the foreign key toNULLif the parent is deleted.ON DELETE RESTRICT– Prevent deletion if dependent rows exist.
CREATE TABLE order_items (
order_id INTEGER REFERENCES orders(order_id) ON DELETE CASCADE,
product_id INTEGER REFERENCES products(product_id),
quantity INTEGER NOT NULL
);
ON DELETE CASCADE. It can remove large amounts of data unintentionally if parent rows are deleted without proper checks.
Maintaining integrity over time
Referential integrity protects against data drift as applications evolve. Even if multiple programs or services write to the same database, properly defined keys and constraints ensure that the structure remains sound.
The principles of tables, keys, constraints, and relationships form the backbone of all relational databases. Mastering them will make every future SQL statement more reliable and every schema easier to maintain.
Chapter 4: Setting Up Your Environment
Before you can start writing and running SQL statements, you need a working database environment. Most systems today use one of a few major engines (MySQL, PostgreSQL, SQLite, or SQL Server) and all of them can be installed quickly on a local machine. This chapter walks through installing a database, connecting from both command-line and graphical tools, and creating your first database and table.
SQL. Even a small laptop can host lightweight databases such as SQLite or a Docker-based PostgreSQL image.
Installing and connecting to a server
Every database system provides its own installer or package. Choose one that matches your platform and preferred workflow. MySQL and PostgreSQL run as background services, while SQLite is an embedded library that stores data in a single file.
Installing MySQL
MySQL can be downloaded from the official Oracle distribution or installed with a package manager:
# Ubuntu or Debian
sudo apt install mysql-server
# macOS (Homebrew)
brew install mysql
Once installed, start the service and connect to it:
sudo service mysql start
mysql -u root -p
The -u flag specifies a username and -p prompts for a password. After connecting, you can create databases and users directly from the prompt.
Installing PostgreSQL
PostgreSQL offers similar installation methods. On most systems you can use the default package manager:
# Ubuntu or Debian
sudo apt install postgresql
# macOS (Homebrew)
brew install postgresql
To connect, use the psql client:
sudo service postgresql start
psql -U postgres
Inside psql you can execute SQL statements directly. Commands starting with a backslash (like \l or \d) control the session rather than sending SQL to the server.
postgres; you will often create your own user with limited privileges for daily work.
Using SQLite
SQLite is built into many operating systems and requires no setup beyond downloading a small binary if it is not already installed. To start an interactive session, simply run:
sqlite3 mydata.db
This creates (or opens) a file called mydata.db. All data lives inside this file. SQLite is ideal for experiments, testing, or small applications that do not require a networked server.
Using CLI and GUI tools
Every major database comes with a command-line client, and many also offer graphical tools for browsing and editing data. Both approaches are valuable: the CLI is fast and scriptable, while GUIs provide context and visualization.
Command-line clients
The command line is the quickest way to execute SQL. Once connected, you type statements and press Enter. A semicolon (;) marks the end of a statement. For example:
mysql> SHOW DATABASES;
psql> \l
sqlite> .tables
These commands list databases or tables depending on the system. You can also redirect input from a file to run a complete script:
mysql -u user -p < setup.sql
.sql files for schema creation and testing. They can be version-controlled and reused on new systems.
Graphical tools
If you prefer a visual interface, there are many free and commercial clients available:
- MySQL Workbench – Official graphical client for MySQL.
- pgAdmin – The primary management tool for PostgreSQL.
- SQL Server Management Studio – Used for Microsoft SQL Server.
- DBeaver and HeidiSQL – Popular cross-platform options supporting many databases.
Most GUI tools let you browse tables, run queries, and export data in CSV or JSON formats. They are convenient for learning and for checking results of your SQL scripts.
SQL they execute, especially when performing changes that affect data integrity.
Creating your first database and table
After installing your chosen system, you can create your first database and a simple table. The syntax differs slightly between engines, but the core idea is the same: define a database, connect to it, and create a table with columns and constraints.
Creating a database
-- MySQL and PostgreSQL
CREATE DATABASE sampledb;
-- SQLite (creates a file automatically)
sqlite3 sampledb.db
Once created, connect to the new database:
USE sampledb; -- MySQL
\c sampledb -- PostgreSQL
Creating a table
CREATE TABLE users (
user_id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
email TEXT UNIQUE,
joined DATE DEFAULT CURRENT_DATE
);
This table stores user information. Try inserting and selecting rows:
INSERT INTO users (user_id, name, email)
VALUES (1, 'Alice', 'alice@example.com');
SELECT * FROM users;
The output confirms that your database and table are working correctly. You now have a functioning environment for running every example in this book.
With the tools installed and your first database ready, you can begin learning the language itself—starting with how to query and retrieve data using SELECT statements in the next chapter.
Chapter 5: Data Types and Storage
Every piece of data in a relational database has a defined type that determines what values it can hold and how they are stored internally. Choosing the correct types improves precision, saves space, and helps the database process queries efficiently. This chapter introduces the major categories of SQL data types (numeric, string, date/time, and special types) and explains how to select the right ones for your tables.
Numeric, string, date/time, and special types
The SQL standard defines several broad families of data types. Each database vendor implements them with small variations, but the concepts remain the same. The four most common categories are numeric, character, temporal, and special-purpose types.
Numeric types
Numeric types represent numbers used in calculations, identifiers, and measurements. They can be exact or approximate.
- INTEGER (or
INT): Whole numbers, typically 32-bit. - BIGINT: Larger integers for identifiers and counters.
- DECIMAL(p,s) or NUMERIC(p,s): Exact fixed-point values with precision
pand scales. - FLOAT or REAL: Approximate floating-point numbers for scientific or statistical data.
CREATE TABLE products (
id INTEGER PRIMARY KEY,
name TEXT,
price DECIMAL(8,2),
quantity INT
);
Use DECIMAL or NUMERIC for currency values to avoid rounding errors, and integer types for counts or IDs.
FLOAT.
String types
Character data types store text, identifiers, and other symbolic data. The main distinction is between fixed and variable length.
- CHAR(n): Fixed-length text padded with spaces if shorter than
n. - VARCHAR(n): Variable-length text up to
ncharacters. - TEXT or CLOB: Large, unbounded text values (nonstandard length).
CREATE TABLE customers (
id INTEGER PRIMARY KEY,
name VARCHAR(100),
email VARCHAR(200) UNIQUE
);
Choose a length that fits your data comfortably without wasting space. VARCHAR is the usual default for user input and short strings.
VARCHAR limits, base them on expected real data rather than arbitrary round numbers. This makes validation easier and signals intent to future developers.
Date and time types
Temporal types represent calendar dates, times of day, or combined timestamps. They are crucial for tracking when events occur and for performing time-based queries.
- DATE: Year, month, and day.
- TIME: Time of day (optionally with fractional seconds).
- TIMESTAMP: Combined date and time.
- INTERVAL: Duration (such as “3 days” or “2 hours”).
CREATE TABLE orders (
order_id INTEGER PRIMARY KEY,
order_date DATE DEFAULT CURRENT_DATE,
shipped_at TIMESTAMP
);
Use date functions like NOW(), CURRENT_DATE, and AGE() (PostgreSQL) to work with temporal values. Storing times in UTC makes it easier to convert and compare across regions.
Special types
Modern SQL engines support additional specialized types for structured and binary data:
- BOOLEAN: Logical true/false values.
- BLOB: Binary large object, for files or images.
- JSON: Structured documents in JavaScript Object Notation.
- UUID: Universally unique identifiers.
- ARRAY (in PostgreSQL and others): Lists of elements of the same type.
CREATE TABLE settings (
id INTEGER PRIMARY KEY,
active BOOLEAN DEFAULT TRUE,
config JSON,
file BLOB
);
Special types allow you to store complex or non-tabular data without breaking the relational model. However, they should be used carefully to avoid portability issues between different database engines.
JSON support varies widely. Some systems store it as text, while others use binary formats and allow direct indexing. Write portable code by using JSON for flexible metadata, not core relational data.
Choosing correct types for efficiency and precision
Efficient database design depends as much on choosing the right types as on writing good queries. Smaller and more precise types save space, speed up joins, and make indexing faster.
Match type to purpose
Choose the simplest type that correctly expresses the data’s meaning. If a value is always whole, use an integer rather than a floating point. If a value can be absent, decide whether NULL is acceptable or if a default should apply.
-- Better: explicit types and constraints
CREATE TABLE inventory (
item_id INTEGER PRIMARY KEY,
quantity SMALLINT CHECK (quantity >= 0),
restocked DATE DEFAULT CURRENT_DATE
);
Consider storage and indexing
Larger data types consume more disk space and can slow queries. For instance, using BIGINT where INTEGER would suffice doubles the index size without benefit. Similarly, overly large VARCHAR columns can waste memory and disk bandwidth.
Precision and scale in decimals
When defining DECIMAL(p,s) or NUMERIC(p,s), choose precision and scale that fit realistic data ranges. For currency, two decimal places usually suffice. Overly large precision can slow arithmetic operations.
-- Price up to 999999.99
price DECIMAL(8,2)
This definition stores six digits before the decimal point and two after, using a fixed binary representation for exactness.
Use defaults wisely
Defaults supply values automatically when none are provided, reducing repetitive input. Use them for predictable data such as creation timestamps, flags, or counters.
CREATE TABLE logs (
log_id INTEGER PRIMARY KEY,
created TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
active BOOLEAN DEFAULT TRUE
);
NULL instead of substituting a placeholder.
Portability considerations
Different database engines have slightly different names or behaviors for equivalent types. Here are some common equivalents:
| Concept | ANSI SQL | MySQL | PostgreSQL | SQLite |
| Auto-increment integer | GENERATED AS IDENTITY | AUTO_ INCREMENT | SERIAL | INTEGER PRIMARY KEY |
| Fixed text | CHAR(n) | CHAR(n) | CHAR(n) | TEXT |
| Variable text | VARCHAR(n) | VARCHAR(n) | VARCHAR(n) | TEXT |
| Date/time | TIMESTAMP | DATETIME | TIMESTAMP | TEXT (ISO format) |
Understanding these differences will help you design schemas that are easy to port between systems or share across teams using different engines.
By selecting appropriate data types and storage options, you lay the foundation for efficient, maintainable databases that behave predictably under any workload.
Chapter 6: Designing Tables
Designing tables is one of the most important steps in building a relational database. A good table design enforces clear structure, minimizes redundancy, and provides a foundation for reliable queries. The CREATE TABLE statement defines the structure of each table—its columns, data types, and integrity rules. This chapter explores the statement in depth, explains how constraints ensure valid data, and shows how to create automatically incrementing identifiers.
The CREATE TABLE statement in depth
The CREATE TABLE statement defines a new table and its structure. You specify column names, data types, optional constraints, and sometimes table-level rules. Here is a simple but complete example:
CREATE TABLE employees (
emp_id INTEGER PRIMARY KEY,
name VARCHAR(100) NOT NULL,
position VARCHAR(50),
salary DECIMAL(10,2) CHECK (salary >= 0),
hired_on DATE DEFAULT CURRENT_DATE
);
This statement creates a table with five columns, assigns types, and includes constraints to ensure data validity. The database engine records this definition in its system catalog and enforces the rules automatically.
Basic structure
The general form of CREATE TABLE is:
CREATE TABLE table_name (
column_name data_type [column_constraints],
...,
[table_constraints]
);
Square brackets indicate optional elements. Column-level constraints apply only to one column. Table-level constraints can refer to multiple columns.
Table options
Many databases allow additional options such as specifying storage engines, schema names, or table inheritance (in PostgreSQL). For example:
CREATE TABLE sales.orders (
order_id SERIAL PRIMARY KEY,
customer_id INTEGER NOT NULL,
total DECIMAL(8,2)
);
Here, the table resides in the sales schema rather than the default one. The exact syntax for such options differs among engines, but the core concept remains consistent.
PARTITION BY or TABLESPACE. Stick to standard clauses for maximum portability unless you need vendor-specific features.
Constraints
Constraints define the logical rules that keep data valid. They act as safeguards inside the database engine, preventing bad data from ever being stored. Constraints can be applied per column or across multiple columns at once.
NOT NULL
NOT NULL requires a value in the column. Use it for fields that are always mandatory, such as names, IDs, or timestamps.
CREATE TABLE users (
user_id INTEGER PRIMARY KEY,
username VARCHAR(50) NOT NULL,
email VARCHAR(100)
);
Here, username must always contain a value, but email may be NULL.
DEFAULT
DEFAULT assigns a value when none is provided in an INSERT statement. It can be a constant or an expression supported by the database.
CREATE TABLE logs (
id INTEGER PRIMARY KEY,
created TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
active BOOLEAN DEFAULT TRUE
);
Defaults simplify inserts and maintain consistent initialization of rows.
CHECK
CHECK validates that column values satisfy a logical condition. If the condition fails, the statement is rejected.
CREATE TABLE accounts (
account_id INTEGER PRIMARY KEY,
balance DECIMAL(10,2) CHECK (balance >= 0)
);
Here, the database will not allow negative balances. You can also place CHECK constraints at the table level to involve multiple columns.
CHECK constraints. Always verify your system enforces them correctly.
UNIQUE
UNIQUE ensures that all values in a column (or combination of columns) are distinct. It can be used alongside a PRIMARY KEY or independently.
CREATE TABLE customers (
id INTEGER PRIMARY KEY,
email VARCHAR(200) UNIQUE,
phone VARCHAR(20)
);
This rule prevents two customers from sharing the same email address.
PRIMARY KEY
The PRIMARY KEY constraint uniquely identifies each row. It can consist of one or more columns. A table can have only one primary key definition.
CREATE TABLE products (
product_id INTEGER,
sku VARCHAR(20),
name TEXT,
PRIMARY KEY (product_id)
);
When a primary key is defined, the database automatically enforces uniqueness and disallows NULL values in the key columns.
FOREIGN KEY
A FOREIGN KEY establishes a relationship between two tables. It ensures that each referenced value exists in the parent table, preserving referential integrity.
CREATE TABLE orders (
order_id INTEGER PRIMARY KEY,
customer_id INTEGER REFERENCES customers(id)
ON DELETE CASCADE
ON UPDATE RESTRICT
);
Here, if a customer is deleted, their orders will be removed automatically because of ON DELETE CASCADE. The ON UPDATE RESTRICT clause prevents changing a customer ID that still has related orders.
Auto-increment / identity columns
Most tables need a unique numeric key that increases automatically. Rather than calculating it manually, you can use an auto-increment or identity column. This feature generates a new value for each inserted row.
MySQL
CREATE TABLE employees (
emp_id INT AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(100)
);
The AUTO_INCREMENT keyword tells MySQL to assign the next available integer automatically. Values remain unique even when rows are deleted.
PostgreSQL
PostgreSQL traditionally used the SERIAL pseudo-type and now supports the standard GENERATED AS IDENTITY clause:
CREATE TABLE employees (
emp_id INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
name VARCHAR(100)
);
This approach conforms to the SQL standard and provides better control over sequence options.
SQLite
CREATE TABLE employees (
emp_id INTEGER PRIMARY KEY,
name TEXT
);
In SQLite, a column defined as INTEGER PRIMARY KEY automatically acts as an alias for the internal row identifier. It increments with each insert.
SQL Server
CREATE TABLE employees (
emp_id INT IDENTITY(1,1) PRIMARY KEY,
name NVARCHAR(100)
);
The parameters in IDENTITY(1,1) specify the starting value and increment. The first inserted row receives 1, the next 2, and so on.
Auto-generated keys are convenient, efficient, and widely supported. Combined with proper constraints, they give every table a reliable way to identify its rows.
By mastering CREATE TABLE and constraints, you can build schemas that maintain data integrity and make future queries predictable and efficient.
Chapter 7: Altering and Dropping Structures
Once a database is in use, structures evolve. You add columns, rename objects, create or remove indexes, and carefully roll out changes so applications continue to run. This chapter covers the core statements for structural change; safe patterns for index management; and practical guidance for schema migrations.
ALTER TABLE, RENAME, and DROP TABLE
ALTER TABLE changes an existing table without recreating it. Typical operations include adding or dropping columns, changing data types, setting or removing constraints, and renaming columns or the table itself.
Common ALTER TABLE patterns
-- Add a nullable column with a default (ANSI-style)
ALTER TABLE accounts
ADD COLUMN last_login TIMESTAMP DEFAULT CURRENT_TIMESTAMP;
-- Backfill and then enforce NOT NULL (safer two-step)
UPDATE accounts SET last_login = CURRENT_TIMESTAMP WHERE last_login IS NULL;
ALTER TABLE accounts
ALTER COLUMN last_login SET NOT NULL; -- PostgreSQL
-- MySQL
ALTER TABLE accounts
MODIFY last_login TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP;
-- SQL Server
ALTER TABLE accounts
ALTER COLUMN last_login DATETIME2 NOT NULL;
-- Rename a column
-- PostgreSQL / SQLite
ALTER TABLE accounts RENAME COLUMN last_login TO last_seen;
-- MySQL 8+
ALTER TABLE accounts RENAME COLUMN last_login TO last_seen;
-- SQL Server
EXEC sp_rename 'dbo.accounts.last_login', 'last_seen', 'COLUMN';
-- Change a column type (with USING when needed)
-- PostgreSQL
ALTER TABLE orders
ALTER COLUMN total TYPE NUMERIC(12,2)
USING total::NUMERIC(12,2);
-- Add a CHECK constraint
ALTER TABLE accounts
ADD CONSTRAINT chk_age CHECK (age >= 13);
-- Drop a constraint (name varies by engine)
-- PostgreSQL
ALTER TABLE accounts DROP CONSTRAINT chk_age;
-- MySQL (checks are named automatically; use SHOW CREATE TABLE to find the name)
-- SQL Server
ALTER TABLE accounts DROP CONSTRAINT chk_age;
Renaming tables and columns with RENAME
Most engines provide a simple rename that preserves data and constraints.
-- Rename a table
-- PostgreSQL / SQLite
ALTER TABLE session RENAME TO session_archive;
-- MySQL
RENAME TABLE session TO session_archive;
-- SQL Server
EXEC sp_rename 'dbo.session', 'session_archive';
Dropping tables with DROP TABLE
Dropping a table removes its data and metadata. Use conditional clauses to avoid errors if the table may not exist; and use dependency options carefully.
-- Safer: avoid errors if missing
DROP TABLE IF EXISTS temp_import;
-- Handle dependencies
-- PostgreSQL: CASCADE drops dependent objects; RESTRICT fails if any exist
DROP TABLE report CASCADE;
DROP TABLE report RESTRICT;
-- MySQL: CASCADE/RESTRICT are parsed but not enforced for tables; check FKs
DROP TABLE report;
-- SQL Server
DROP TABLE IF EXISTS dbo.report;
DROP TABLE is irreversible without a backup. Consider ALTER TABLE ... RENAME TO ..._archive for a deprecation period, or export data before removal.
Quick reference: common operations and dialect notes
| Operation | PostgreSQL | MySQL | SQL Server | SQLite |
| Add column | ALTER TABLE ... ADD COLUMN ... |
ALTER TABLE ... ADD COLUMN ... |
ALTER TABLE ... ADD ... |
ALTER TABLE ... ADD COLUMN ... |
| Rename column | ALTER TABLE ... RENAME COLUMN a TO b |
ALTER TABLE ... RENAME COLUMN a TO b (8+) |
sp_rename 'tbl.a','b','COLUMN' |
ALTER TABLE ... RENAME COLUMN a TO b (3.25+) |
| Change type | ALTER COLUMN ... TYPE ... USING ... |
MODIFY COLUMN ... |
ALTER COLUMN ... |
Often requires create-copy-swap |
| Drop constraint | DROP CONSTRAINT name |
Drop by altering column/index; FK by name | DROP CONSTRAINT name |
Limited; often recreate table |
| Rename table | ALTER TABLE ... RENAME TO ... |
RENAME TABLE ... TO ... |
sp_rename 'schema.tbl','new' |
ALTER TABLE ... RENAME TO ... |
| Drop table | DROP TABLE [IF EXISTS] ... [CASCADE|RESTRICT] |
DROP TABLE [IF EXISTS] ... |
DROP TABLE IF EXISTS ... |
DROP TABLE IF EXISTS ... |
Managing indexes
Indexes accelerate reads by organizing data for fast lookups. Design them for the queries you run; balance read speed against write overhead and storage.
Creating and dropping indexes
-- Basic non-unique index
CREATE INDEX idx_users_email ON users (email);
-- Unique index (enforces no duplicates)
CREATE UNIQUE INDEX uq_users_email ON users (email);
-- Composite (multi-column) index
CREATE INDEX idx_orders_customer_date ON orders (customer_id, order_date);
-- Functional / expression index
-- PostgreSQL
CREATE INDEX idx_users_email_lower ON users ((lower(email)));
-- SQLite
CREATE INDEX idx_users_email_lower ON users (lower(email));
-- MySQL (use generated column prior to functional indexes)
ALTER TABLE users ADD COLUMN email_lower VARCHAR(255) GENERATED ALWAYS AS (lower(email)) STORED;
CREATE INDEX idx_users_email_lower ON users (email_lower);
-- Include (covering) columns
-- PostgreSQL 11+
CREATE INDEX idx_orders_cover ON orders (customer_id, order_date) INCLUDE (status);
-- SQL Server
CREATE INDEX idx_orders_cover ON dbo.orders (customer_id, order_date) INCLUDE (status);
-- Partial / filtered index
-- PostgreSQL
CREATE INDEX idx_orders_open ON orders (customer_id) WHERE status = 'OPEN';
-- SQL Server
CREATE INDEX idx_orders_open ON dbo.orders (customer_id) WHERE status = 'OPEN';
-- SQLite supports partial indexes with WHERE
CREATE INDEX idx_orders_open ON orders (customer_id) WHERE status = 'OPEN';
-- Drop indexes (syntax differs)
-- PostgreSQL / SQLite
DROP INDEX IF EXISTS idx_users_email;
-- MySQL
DROP INDEX idx_users_email ON users;
-- SQL Server
DROP INDEX idx_users_email ON dbo.users;
Choosing columns and order
- Place the most selective column first for equality lookups; follow with columns used in join predicates and range filters.
- Create one good composite index instead of many overlapping single-column indexes.
- Cover common queries (all needed columns available by index) using include columns where supported; otherwise consider wider composite keys carefully.
- Prefer expression indexes for case-insensitive search, JSON keys, or computed fields (engine support varies).
Rebuilding and concurrent indexing
-- Build an index without blocking writes (engine-specific)
-- PostgreSQL
CREATE INDEX CONCURRENTLY idx_events_created_at ON events (created_at);
-- Rebuild or reorganize to defragment
-- SQL Server
ALTER INDEX ALL ON dbo.orders REORGANIZE; -- light
ALTER INDEX ALL ON dbo.orders REBUILD; -- heavy
Schema migrations
Schema migrations are controlled, versioned changes to database structures applied in sequence. A reliable process reduces risk and keeps environments aligned.
Versioning and tracking
-- A simple migrations ledger
CREATE TABLE schema_migrations (
version VARCHAR(50) PRIMARY KEY,
applied_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);
-- Mark a migration as applied
INSERT INTO schema_migrations (version) VALUES ('2025-11-02_1200_add_last_seen');
Each migration is an ordered script with an id (for example 2025-11-02_1200_add_last_seen.sql) that can be applied once. Tools like Flyway and Liquibase manage this automatically; the principle is the same.
Designing safe, incremental changes
- Backward compatibility: deploy code that can work with both old and new schema; then migrate the data; then enforce stricter constraints.
- Two-step constraint tightening: add column as nullable; backfill; set
NOT NULL; add foreign keys at the end when possible. - Online operations: use online/instant algorithms when supported (for example MySQL instant add column; PostgreSQL concurrently built indexes; SQL Server online index operations in suitable editions).
- Roll forward: prefer forward fixes over rollbacks; keep rollback scripts only for small reversible steps (for example dropping a new index).
Example migration set
-- 001_expand_add_column.sql
ALTER TABLE accounts
ADD COLUMN last_seen TIMESTAMP;
-- 002_backfill_last_seen.sql (run in batches in application code or a job)
UPDATE accounts
SET last_seen = GREATEST(last_login, created_at)
WHERE last_seen IS NULL
AND id BETWEEN :min_id AND :max_id;
-- 003_enforce_not_null.sql
-- PostgreSQL
ALTER TABLE accounts ALTER COLUMN last_seen SET NOT NULL;
-- MySQL
ALTER TABLE accounts MODIFY last_seen TIMESTAMP NOT NULL;
-- SQL Server
ALTER TABLE accounts ALTER COLUMN last_seen DATETIME2 NOT NULL;
-- 004_add_index.sql
-- PostgreSQL (concurrent)
CREATE INDEX CONCURRENTLY idx_accounts_last_seen ON accounts (last_seen);
-- MySQL
CREATE INDEX idx_accounts_last_seen ON accounts (last_seen);
-- SQL Server
CREATE INDEX idx_accounts_last_seen ON dbo.accounts (last_seen);
-- 005_contract_drop_old.sql
-- After monitoring period
ALTER TABLE accounts DROP COLUMN last_login;
Blue-green and compatibility layers
For renames that risk breaking code, introduce a compatibility layer:
-- Keep old name available during transition
-- PostgreSQL view
CREATE VIEW session AS SELECT * FROM session_archive;
-- SQL Server synonym
CREATE SYNONYM dbo.session FOR dbo.session_archive;
-- MySQL view
CREATE VIEW session AS SELECT * FROM session_archive;
Structure of a migration file
-- <version>__<summary>.sql
-- 2025-11-02__rename_session_to_session_archive.sql
BEGIN; -- or implicit transaction if the engine handles it
-- … DDL / DML statements here
COMMIT;
Where the engine does not support transactional DDL (or only for some statements), split the migration into safe chunks and ensure idempotency checks are present (for example IF EXISTS, IF NOT EXISTS, or guards in application code).
ALTER TABLE in some modes). Read engine notes and plan accordingly.
With careful use of ALTER TABLE, disciplined index management, and versioned migrations, you can evolve a schema smoothly while keeping systems available.
Chapter 8: Indexes and Performance Basics
Indexes speed up reads by providing a smaller, ordered structure that the query planner can search quickly; they trade extra storage and slower writes for faster lookups. This chapter explains physical organization (clustered versus non-clustered using alternative wording), composite index design and maintenance settings, and practical guidance on when to index or leave data unindexed.
Clustered vs non-clustered
A clustered organization stores table rows in the same order as a chosen key; a non-clustered index stores a separate structure that points to table rows. Different engines implement this differently.
Engine behaviors at a glance
| Engine | Clustered storage | Notes |
| MySQL (InnoDB) | Yes (primary key defines physical order) | The table is organized by the primary key; secondary indexes store primary key values as row locators. |
| SQL Server | Optional clustered index | One clustered index per table; non-clustered indexes reference the clustered key or a row identifier. |
| PostgreSQL | No permanent clustering | CLUSTER rewrites the table in index order but does not keep it permanently aligned; normal indexes are non-clustered. |
| SQLite | No separate clustered type | Rowids define storage order unless using WITHOUT ROWID; indexes are non-clustered. |
Practical implications
- Range scans: clustered order helps queries that read contiguous key ranges (for example by date).
- Covering lookups: non-clustered indexes can include needed columns to avoid visiting the base table (engine support varies).
- Write cost: maintaining order causes page splits and extra I/O under random insert patterns; bulk loads benefit from pre-sorted input where feasible.
-- SQL Server: create a clustered index on (created_at, id)
CREATE CLUSTERED INDEX cix_orders_created_id
ON dbo.orders (created_at, id);
-- PostgreSQL: opportunistic clustering (one-off rewrite)
CLUSTER orders USING idx_orders_created_at;
-- Later, to refresh the order again:
CLUSTER orders;
-- MySQL (InnoDB): primary key defines clustering
ALTER TABLE orders
ADD PRIMARY KEY (created_at, id);
customer_id then created_at), keep the surrogate primary key for simplicity and add a composite secondary index for query speed.
Composite indexes, fillfactor, and maintenance
Composite indexes store multiple columns in a single index key. The column order matters; the index can support lookups that match a left-most prefix of the key (and sometimes more with advanced checks by the planner).
Designing composite keys
- Put equality filters first, then range filters, then columns used for ordering.
- Prefer one well-chosen composite index over several overlapping single-column indexes.
- Use expression or functional indexes for case-insensitive or computed lookups where supported.
- Use include columns to cover queries without widening the key (SQL Server and PostgreSQL have explicit
INCLUDE).
-- Composite equality then range
CREATE INDEX idx_orders_customer_date
ON orders (customer_id, order_date);
-- Expression index for case-insensitive search
-- PostgreSQL
CREATE INDEX idx_users_email_lower ON users ((lower(email)));
-- SQLite
CREATE INDEX idx_users_email_lower ON users (lower(email));
-- Covering index with include columns
-- PostgreSQL 11+
CREATE INDEX idx_orders_cov ON orders (customer_id, order_date) INCLUDE (status, total);
-- SQL Server
CREATE INDEX idx_orders_cov ON dbo.orders (customer_id, order_date) INCLUDE (status, total);
(a, b) helps queries that filter on a or on a and b. It usually does not help queries that filter only on b. Add a second index only if that pattern is common and material.Fillfactor (page fullness) and fragmentation
Fillfactor controls how full to pack index pages at build time so that future inserts have room to grow. Lower values reduce page splits at the cost of more disk space.
| Engine | Per-index setting | Example |
| PostgreSQL | Yes | CREATE INDEX ... WITH (fillfactor = 90) |
| SQL Server | Yes | CREATE INDEX ... WITH (FILLFACTOR = 90) |
| MySQL (InnoDB) | Global knob for some operations | Engine manages page fill; per-index setting is not typical in common builds |
| SQLite | No | Engine manages internal layout automatically |
-- PostgreSQL
CREATE INDEX idx_events_created_at
ON events (created_at)
WITH (fillfactor = 90);
-- SQL Server
CREATE INDEX idx_events_created_at
ON dbo.events (created_at)
WITH (FILLFACTOR = 90, PAD_INDEX = ON);
Index maintenance and statistics
- Rebuild versus reorganize: heavy rebuild recreates the index; light reorg defragments leaf pages where supported.
- Statistics: up-to-date statistics inform the planner; autovacuum or auto-stats usually keep these fresh, but manual refresh can help after big changes.
- Concurrent builds: build or rebuild without long blocking where the engine supports it.
-- PostgreSQL
REINDEX INDEX idx_users_email; -- full rebuild
ANALYZE users; -- refresh stats
CREATE INDEX CONCURRENTLY idx_e_c ON events (created_at);
-- SQL Server
ALTER INDEX ALL ON dbo.orders REORGANIZE;
ALTER INDEX ALL ON dbo.orders REBUILD WITH (ONLINE = ON); -- edition dependent
UPDATE STATISTICS dbo.orders WITH FULLSCAN;
-- MySQL
ANALYZE TABLE orders;
OPTIMIZE TABLE orders; -- can rebuild table and indexes (engine specific)
When (and when not) to index
Not every column should be indexed. Good candidates are columns used in frequent search predicates, joins, and orderings where the index avoids scanning large portions of the table. Poor candidates either do not filter much or change so often that the write cost outweighs the read benefit.
Good candidates
- Foreign key columns used in joins (for example
orders.customer_id). - Columns used in equality filters on large tables (for example
WHERE email = ?). - Composite patterns that match common queries (for example
(customer_id, created_at)for recent-by-customer listings). - Columns used for sorting or range scans (for example
ORDER BY created_at DESC LIMIT 50). - Expressions that normalize search (for example
lower(email)or JSON key extraction) where the engine supports expression indexes.
-- Typical lookup and ordering needs
CREATE INDEX idx_orders_customer_created
ON orders (customer_id, created_at DESC);
-- Join helper for a foreign key
CREATE INDEX idx_orders_product_id
ON orders (product_id);
Poor candidates
- Columns with very low cardinality (for example a boolean flag) unless paired in a composite index that increases selectivity.
- Small tables that fit in memory or are scanned quickly.
- Columns updated in nearly every write (for example counters that change on each event).
- Leading wildcard patterns that cannot use a normal b-tree (for example
LIKE '%term'); consider full-text features instead. - Highly volatile temporary data that is created and deleted in short cycles.
Measuring impact
- Record query latency and throughput before and after adding an index.
- Inspect plan changes (for example index-only scans versus heap visits; key lookups versus scans).
- Track write latency and storage growth to catch harmful side effects.
-- PostgreSQL: show why the planner chooses a path
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders
WHERE customer_id = $1
ORDER BY created_at DESC
LIMIT 50;
-- SQL Server
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
SELECT TOP (50) *
FROM dbo.orders
WHERE customer_id = @cid
ORDER BY created_at DESC;
With a clear understanding of physical layout, careful composite design, sensible fillfactor choices, and a policy for measuring benefits, you can keep read performance high without over-indexing or harming write throughput.
Chapter 8: Inserting Data
Adding rows to tables is one of the most common database operations. The INSERT family of statements supports single and multiple row inserts, copying data from other tables, and creating new tables directly from query results. This chapter covers the core syntax and useful patterns across engines.
INSERT statements. It prevents silent mismatches when the table structure changes.
INSERT INTO syntax and multi-row inserts
The standard form specifies the target table, a list of columns, and the values to insert. Each value list corresponds to one row.
INSERT INTO employees (first_name, last_name, role, salary)
VALUES ('Alice', 'Green', 'Manager', 90000);
To insert multiple rows in one statement, provide additional value sets separated by commas. This is efficient and supported by all major engines.
INSERT INTO employees (first_name, last_name, role, salary)
VALUES
('Bob', 'White', 'Engineer', 75000),
('Carla', 'Smith', 'Designer', 70000),
('David', 'Brown', 'Engineer', 77000);
Column omission and defaults
If a column has a default value or allows nulls, you can omit it from the insert list. The database uses the default expression or sets it to NULL.
-- 'created_at' defaults to CURRENT_TIMESTAMP
INSERT INTO accounts (username, email)
VALUES ('jdoe', 'jdoe@example.com');
INSERT INTO and SELECT clauses.
Engine-specific bulk patterns
- PostgreSQL: supports multi-row inserts and
COPYfor high-volume imports. - MySQL: supports multi-row inserts and
LOAD DATA INFILE. - SQL Server: supports multi-row inserts and
BULK INSERT. - SQLite: supports multi-row inserts since 3.7.11 and can use transactions for batches.
-- PostgreSQL bulk load (server-side)
COPY employees (first_name, last_name, role, salary)
FROM '/tmp/employees.csv'
WITH (FORMAT csv, HEADER true);
-- MySQL
LOAD DATA INFILE '/tmp/employees.csv'
INTO TABLE employees
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
IGNORE 1 ROWS;
Inserting from subqueries or SELECT … INTO
You can populate a table using data from another query. This pattern is common for data transformations, archiving, or building summary tables.
Using INSERT INTO ... SELECT
The INSERT INTO ... SELECT form inserts the results of a query into an existing table.
-- Copy all engineers into a new department
INSERT INTO engineers (first_name, last_name, salary)
SELECT first_name, last_name, salary
FROM employees
WHERE role = 'Engineer';
The columns in the SELECT must align in number and type with those in the target list.
Transforming during insert
-- Apply transformations as you insert
INSERT INTO archive_orders (order_id, customer_id, total, archived_at)
SELECT id, customer_id, total, CURRENT_TIMESTAMP
FROM orders
WHERE order_date < CURRENT_DATE - INTERVAL '1 year';
Creating new tables with SELECT ... INTO
Some engines allow you to create a new table directly from a query result. The table is created with column names and types inferred from the query.
-- SQL Server
SELECT customer_id, COUNT(*) AS order_count
INTO customer_summary
FROM orders
GROUP BY customer_id;
-- PostgreSQL and MySQL
CREATE TABLE customer_summary AS
SELECT customer_id, COUNT(*) AS order_count
FROM orders
GROUP BY customer_id;
-- SQLite
CREATE TABLE customer_summary AS
SELECT customer_id, COUNT(*) AS order_count
FROM orders
GROUP BY customer_id;
SELECT ... INTO creates a physical table only once. If you run it again with the same table name, you get an error unless you drop or truncate first.Combining inserts with conflict handling
When inserting from a subquery, duplicates can occur. Each engine offers options to handle them gracefully.
-- PostgreSQL: upsert pattern
INSERT INTO users (email, name)
SELECT email, name FROM staging_users
ON CONFLICT (email) DO UPDATE
SET name = EXCLUDED.name;
-- MySQL
INSERT INTO users (email, name)
SELECT email, name FROM staging_users
ON DUPLICATE KEY UPDATE
name = VALUES(name);
-- SQL Server (using MERGE)
MERGE INTO users AS target
USING staging_users AS source
ON target.email = source.email
WHEN MATCHED THEN
UPDATE SET target.name = source.name
WHEN NOT MATCHED THEN
INSERT (email, name) VALUES (source.email, source.name);
Understanding how to insert efficiently—whether single, multi-row, or from queries—sets the foundation for loading, migrating, and transforming data safely and quickly.
Chapter 10: Updating and Deleting Data
Databases evolve continuously; records change, become obsolete, or must be removed. The UPDATE, DELETE, and TRUNCATE statements perform these operations, and each carries risk if used without precision. This chapter shows best practices, safe filters, and recovery options when changes go wrong.
UPDATE and DELETE best practices
Updates modify data in place; deletes remove it. Both are straightforward, yet dangerous if filters are missing or too broad. Always specify a clear WHERE clause and verify the expected row count.
Basic UPDATE form
UPDATE employees
SET salary = salary * 1.05
WHERE department = 'Engineering';
Multiple columns can be updated in the same statement:
UPDATE products
SET price = price * 0.9,
updated_at = CURRENT_TIMESTAMP
WHERE category = 'Clearance';
updated_by and updated_at. Populate them through triggers or in the same UPDATE statement.
Basic DELETE form
DELETE FROM sessions
WHERE last_active < CURRENT_DATE - INTERVAL '30 days';
Omitting WHERE removes every row from the table, which is almost never intended. Add safety limits in scripts that generate SQL automatically.
DELETE without a filter removes all data but keeps the table definition. Use TRUNCATE only when you are sure you want a complete reset.
Engine-specific batch updates
- PostgreSQL: supports
RETURNINGto capture changed rows. - MySQL: supports
LIMITfor partial deletes or updates (non-standard). - SQL Server: supports
OUTPUTto capture old and new values.
-- PostgreSQL
UPDATE accounts
SET balance = balance - 50
WHERE id = 123
RETURNING id, balance;
-- SQL Server
UPDATE dbo.accounts
SET balance = balance - 50
OUTPUT deleted.balance AS old_balance, inserted.balance AS new_balance
WHERE id = 123;
-- MySQL
DELETE FROM logs
WHERE created_at < NOW() - INTERVAL 90 DAY
LIMIT 10000; -- avoid long locks
Using WHERE, subqueries, and safeguards
The WHERE clause filters which rows to affect. Combine it with joins or subqueries to match related records precisely.
Targeting related rows
-- Reduce salaries for departments that exceeded budget
UPDATE employees
SET salary = salary * 0.95
WHERE department_id IN (
SELECT id FROM departments WHERE budget_used > budget_limit
);
You can also join directly in engines that support UPDATE ... JOIN or DELETE ... USING syntax.
-- MySQL and SQL Server
UPDATE e
SET e.salary = e.salary * 0.95
FROM employees e
JOIN departments d ON e.department_id = d.id
WHERE d.budget_used > d.budget_limit;
-- PostgreSQL
DELETE FROM orders o
USING customers c
WHERE o.customer_id = c.id
AND c.status = 'INACTIVE';
Verifying before execution
First preview the affected rows with a SELECT that uses the same filter. When certain, replace SELECT with UPDATE or DELETE.
-- Preview what would be deleted
SELECT id, email FROM users
WHERE email LIKE '%@test.invalid';
-- Then safely delete
DELETE FROM users
WHERE email LIKE '%@test.invalid';
sql_safe_updates=1 in MySQL) to block statements without WHERE or LIMIT.
Transactional safeguards
Use transactions to isolate changes and roll back on error.
BEGIN;
UPDATE inventory
SET quantity = quantity - 1
WHERE product_id = 1001;
DELETE FROM carts
WHERE product_id = 1001
AND user_id = 42;
COMMIT;
COMMIT or ROLLBACK before stepping away.
Using conditions from other tables
You can base updates or deletes on computed or aggregated values from related tables.
-- Deactivate customers with no orders in the past year
UPDATE customers
SET active = FALSE
WHERE id NOT IN (
SELECT DISTINCT customer_id FROM orders
WHERE order_date > CURRENT_DATE - INTERVAL '1 year'
);
NOT IN can misbehave with nulls. Use NOT EXISTS instead for reliable results.UPDATE customers c
SET active = FALSE
WHERE NOT EXISTS (
SELECT 1 FROM orders o
WHERE o.customer_id = c.id
AND o.order_date > CURRENT_DATE - INTERVAL '1 year'
);
TRUNCATE and restoring from backups
TRUNCATE quickly removes all rows from a table while keeping its structure. It typically resets identity counters and uses less logging than DELETE.
TRUNCATE TABLE logs;
| Engine | Transactional | Resets identity | Notes |
| PostgreSQL | Yes (within transaction) | Yes | Can cascade to related tables with TRUNCATE ... CASCADE. |
| MySQL | No (implicit commit) | Yes | Cannot be rolled back; requires DROP + CREATE privileges. |
| SQL Server | Yes (non-logged minimal mode) | Yes | Cannot target tables referenced by foreign keys. |
| SQLite | Emulated via DELETE FROM |
Yes | Acts as a fast full delete. |
TRUNCATE is irreversible. It usually skips triggers and does not fire row-level delete events. Always ensure no dependent tables rely on the data being cleared.Alternatives for safe clearing
-- Use DELETE with WHERE to preserve control
DELETE FROM logs
WHERE created_at < CURRENT_DATE - INTERVAL '90 days';
To reclaim storage space after mass deletions, run engine-specific maintenance such as VACUUM (PostgreSQL, SQLite) or OPTIMIZE TABLE (MySQL).
Restoring deleted or truncated data
Accidental data removal happens. Your ability to recover depends on logging and backup strategies.
- Point-in-time recovery: use write-ahead logs or transaction logs to restore up to the moment before the error.
- Regular full backups: restore the latest backup and replay logs if possible.
- Logical dumps: maintain daily
pg_dump,mysqldump, orBACKUP DATABASEoutputs for selective restore.
-- PostgreSQL example recovery
pg_restore --dbname=production --table=users backup.dump
-- MySQL
mysql -u root -p production < backup.sql
-- SQL Server
RESTORE DATABASE Production
FROM DISK = 'D:\backups\prod_full.bak'
WITH REPLACE, RECOVERY;
By combining disciplined WHERE filters, short transactions, and reliable backups, you can manage data changes confidently and avoid irreversible mistakes.
Chapter 111: Selecting and Querying Data
The SELECT statement is the heart of SQL. It retrieves and shapes data from one or more tables and can include expressions, filters, sorting, and limits. Every analytic, report, and application query builds on its structure. This chapter covers its syntax, derived columns, and key modifiers for distinctness, limits, and ordering.
SELECT structure and clause order
The typical SELECT query has a predictable layout. Not every clause is required, but the order is fixed by the SQL standard:
SELECT [DISTINCT] column_list
FROM table_name
[JOIN ...]
WHERE condition
GROUP BY grouping_columns
HAVING group_condition
ORDER BY sort_columns
LIMIT number OFFSET start;
Logical evaluation order
Although you write SELECT first, the database engine logically processes the clauses in this sequence:
| Step | Clause | Description |
| 1 | FROM / JOIN |
Identify source tables and combine them. |
| 2 | WHERE |
Filter rows before grouping. |
| 3 | GROUP BY |
Aggregate rows into groups. |
| 4 | HAVING |
Filter groups based on aggregate conditions. |
| 5 | SELECT |
Produce output columns and expressions. |
| 6 | ORDER BY |
Sort the result set. |
| 7 | LIMIT / OFFSET |
Return a subset of rows. |
WHERE clause because WHERE is evaluated before SELECT. Use the full expression instead, or repeat it in both places if necessary.
Basic example
SELECT first_name, last_name, department, salary
FROM employees
WHERE salary > 80000
ORDER BY department, salary DESC;
* in production queries. It avoids surprises when new columns are added and reduces I/O.
Column aliases, expressions, and derived columns
Aliases rename columns in the output or simplify references to computed expressions. Use AS for clarity, though it is optional in most engines.
Using aliases
SELECT
first_name AS given,
last_name AS family,
salary * 1.1 AS adjusted_salary
FROM employees;
Aliased columns can make results easier to read and can be reused in the ORDER BY clause.
SELECT
id,
department_id,
salary * 1.1 AS adjusted_salary
FROM employees
ORDER BY adjusted_salary DESC;
Expressions and derived data
SQL expressions allow arithmetic, string concatenation, conditional logic, and type conversion directly in the query.
SELECT
first_name || ' ' || last_name AS full_name,
EXTRACT(YEAR FROM hire_date) AS hire_year,
CASE
WHEN salary >= 100000 THEN 'High'
WHEN salary >= 70000 THEN 'Medium'
ELSE 'Low'
END AS pay_band
FROM employees;
- Use arithmetic for numeric calculations (
price * quantity). - Use functions for transformations (
UPPER(name),SUBSTR(email, 1, 10)). - Use
CASEfor conditional derivations.
Table aliases and joins
Short aliases for tables make multi-table queries readable. Use clear, short forms like e for employees.
SELECT e.first_name, e.last_name, d.name AS department
FROM employees e
JOIN departments d ON e.department_id = d.id;
You can even reuse derived data in subqueries or common table expressions (CTEs).
WITH high_earners AS (
SELECT id, first_name, last_name, salary
FROM employees
WHERE salary > 100000
)
SELECT first_name, last_name
FROM high_earners
ORDER BY last_name;
The following modifiers control result uniqueness, size, and ordering. Their use and syntax vary slightly between engines.
DISTINCT
DISTINCT removes duplicate rows from the result set. It applies to the combination of all selected columns.
SELECT DISTINCT department
FROM employees;
SELECT DISTINCT department, role
FROM employees;
DISTINCT can be expensive on large datasets. If possible, filter or aggregate before applying it.
TOP, LIMIT, and OFFSET
These clauses restrict the number of rows returned. Syntax differs among engines.
| Engine | Keyword | Example |
| PostgreSQL / MySQL / SQLite | LIMIT / OFFSET |
SELECT * FROM employees ORDER BY salary DESC LIMIT 10 OFFSET 20; |
| SQL Server | TOP or OFFSET FETCH |
SELECT TOP (10) * FROM employees ORDER BY salary DESC; |
| Oracle | FETCH FIRST |
SELECT * FROM employees ORDER BY salary DESC FETCH FIRST 10 ROWS ONLY; |
-- PostgreSQL / MySQL
SELECT first_name, last_name, salary
FROM employees
ORDER BY salary DESC
LIMIT 5;
-- SQL Server
SELECT TOP (5) first_name, last_name, salary
FROM employees
ORDER BY salary DESC;
LIMIT with OFFSET for pagination, but use keyset pagination (for example, “where id > last_seen_id”) for large tables to avoid scanning overhead.
ORDER BY
ORDER BY defines the sort order of the result set. Without it, row order is not guaranteed, even if it appears consistent in testing.
SELECT first_name, last_name, salary
FROM employees
ORDER BY salary DESC, last_name ASC;
- Specify ascending (
ASC) or descending (DESC) per column. - Use column positions (
ORDER BY 3 DESC) only for quick tests, not production code. - Sort by expressions or aliases (
ORDER BY adjusted_salary DESC).
ORDER BY, SQL does not define any particular row order. Engines may return rows differently between executions.
By mastering SELECT structure, aliases, and output controls such as DISTINCT, LIMIT, and ORDER BY, you can build efficient, readable queries that form the backbone of all SQL usage.
Chapter 12: Filtering Results
Filtering limits query results to only the rows that meet specified conditions. The WHERE clause forms the backbone of filtering, combining comparison operators, ranges, pattern matching, and logical connectors to express precise criteria. This chapter explores essential filters, logical combinations, and how operator precedence affects evaluation.
WHERE conditions before grouping and sorting, which can save significant processing time.
WHERE, BETWEEN, LIKE, IN, IS NULL
The WHERE clause specifies a condition for each row. Only rows evaluating to true are included in the result set.
Basic comparisons
SELECT *
FROM employees
WHERE department = 'Sales'
AND salary > 60000;
=equal<>or!=not equal>,<,>=,<=comparisons
Use single quotes for string literals and avoid implicit conversions between text and numbers.
Using BETWEEN for ranges
SELECT id, total
FROM orders
WHERE total BETWEEN 100 AND 500;
BETWEEN includes both boundary values. For exclusive ranges, use explicit comparisons.
WHERE total > 100 AND total < 500;
BETWEEN on timestamps can include unintended results if time components differ.
Pattern matching with LIKE
LIKE uses wildcards to match string patterns:
%matches any sequence of characters._matches a single character.
SELECT first_name, last_name
FROM employees
WHERE last_name LIKE 'Sm%'; -- starts with Sm
SELECT *
FROM customers
WHERE email LIKE '%@example.com'; -- ends with @example.com
ILIKE (PostgreSQL) or apply LOWER() to both sides. Create functional indexes on LOWER(column) for performance.
Matching sets with IN
SELECT *
FROM products
WHERE category IN ('Electronics', 'Books', 'Home');
IN checks membership in a list of values or subquery results:
SELECT id, name
FROM employees
WHERE department_id IN (
SELECT id FROM departments WHERE region = 'EU'
);
NOT IN behaves differently when subqueries return NULL values. Prefer NOT EXISTS to avoid unexpected filtering.Checking for NULL values
NULL represents missing or unknown data. You must use IS NULL or IS NOT NULL to test it.
SELECT id, email
FROM customers
WHERE email IS NULL;
SELECT id, email
FROM customers
WHERE email IS NOT NULL;
NULL using = or <> always return false. Handle missing values explicitly with IS NULL or COALESCE().
Logical operators
Logical operators combine multiple conditions in a WHERE clause. They control how filters interact to include or exclude rows.
AND and OR
SELECT *
FROM employees
WHERE department = 'Engineering'
AND salary > 90000;
All conditions joined by AND must be true for a row to qualify. OR allows any one of its conditions to be true.
SELECT *
FROM employees
WHERE department = 'Sales'
OR department = 'Marketing';
You can mix them, but operator precedence determines evaluation order (see below). Parentheses clarify intent and prevent logical errors.
NOT operator
NOT negates a condition. It can be used with any expression that evaluates to true or false.
SELECT *
FROM employees
WHERE NOT (department = 'HR' OR salary < 50000);
NOT IN and NOT EXISTS differ in how they treat nulls. If the subquery result can include NULL, NOT IN may return no rows at all.
Combining filters logically
SELECT *
FROM orders
WHERE (status = 'SHIPPED' OR status = 'DELIVERED')
AND total > 100;
Always test combinations carefully. A missing parenthesis or inverted logic can drastically change results.
Operator precedence and parentheses
When combining multiple conditions, SQL follows a fixed order of evaluation:
| Order | Operator | Description |
| 1 | NOT |
Negation |
| 2 | AND |
Conjunction (both must be true) |
| 3 | OR |
Disjunction (either condition true) |
Without parentheses, AND conditions are evaluated before OR. This means:
-- This query returns rows where
-- (department = 'Sales' AND salary > 50000) OR department = 'HR'
SELECT *
FROM employees
WHERE department = 'Sales' AND salary > 50000
OR department = 'HR';
To make intent explicit, group conditions with parentheses:
SELECT *
FROM employees
WHERE department = 'HR'
OR (department = 'Sales' AND salary > 50000);
WHERE clauses. Always use parentheses for clarity instead of relying on default order.
Mastering filters and logical combinations makes queries accurate, readable, and efficient. By understanding how SQL evaluates conditions, you can write powerful searches that return exactly the intended data.
Chapter 13: Joining Tables
Relational data lives across multiple tables; joins let you combine rows that share related keys. This chapter shows the core join types, how to read and write them, and how to avoid common mistakes when matching data across tables.
INNER, LEFT, RIGHT, FULL, CROSS joins
A join combines rows from two tables based on a match condition (except for a true Cartesian product). The typical pattern is:
SELECT {column_list …}
FROM table1
[INNER] JOIN table2
ON table1.key = table2.key;
orders.customer_id to customers.customer_id), prefer a simple equality predicate and select only the columns you need.
INNER JOIN: matches in both tables
Returns only rows where the join condition is true in both tables.
SELECT o.order_id, c.customer_name
FROM orders AS o
INNER JOIN customers AS c
ON o.customer_id = c.customer_id;
If a customer has no orders, or an order references a missing customer, that pair does not appear in the result.
LEFT JOIN and RIGHT JOIN: preserving one side
A LEFT JOIN returns all rows from the left table and fills unmatched columns from the right with NULL. A RIGHT JOIN is symmetric for the right table.
-- All customers; show matching orders if any
SELECT c.customer_id, c.customer_name, o.order_id
FROM customers AS c
LEFT JOIN orders AS o
ON o.customer_id = c.customer_id;
-- All products; show suppliers if any
SELECT p.product_id, p.product_name, s.supplier_name
FROM products AS p
RIGHT JOIN suppliers AS s
ON p.supplier_id = s.supplier_id; -- Example; prefer LEFT with swapped sides
LEFT JOIN for readability; most teams avoid RIGHT JOIN and instead swap tables so the preserved table is on the left.
FULL OUTER JOIN: union of LEFT and RIGHT
Returns every row from both tables; rows that do not match on either side appear with NULLs for the other side.
SELECT a.id, a.name, b.name AS b_name
FROM table_a AS a
FULL OUTER JOIN table_b AS b
ON a.id = b.id;
FULL OUTER JOIN. Use LEFT JOIN plus RIGHT JOIN with UNION (or UNION ALL with filters) to emulate:
SELECT a.id, a.name, b.name AS b_name
FROM table_a AS a
LEFT JOIN table_b AS b ON a.id = b.id
UNION
SELECT a.id, a.name, b.name AS b_name
FROM table_a AS a
RIGHT JOIN table_b AS b ON a.id = b.id;
CROSS JOIN (Cartesian product)
Produces every combination of rows from both tables. Use carefully; it scales as rows(table1) × rows(table2).
-- Equivalent forms in many engines
SELECT c.color, s.size
FROM colors AS c
CROSS JOIN sizes AS s;
SELECT c.color, s.size
FROM colors AS c, sizes AS s; -- Avoid this legacy comma form in new code
ON clause on a regular JOIN can accidentally produce a large Cartesian product. Always verify your ON predicate.
ON vs WHERE with outer joins
With outer joins, filters in WHERE can turn an outer join into an inner join by removing the NULL-extended rows. Put row-matching logic in ON, and post-join filtering in WHERE only when you intend to filter both matched and unmatched rows.
-- Keep customers even when they have no orders
SELECT c.customer_id, o.order_id
FROM customers AS c
LEFT JOIN orders AS o
ON o.customer_id = c.customer_id
AND o.status = 'OPEN' -- filter applies to right side only
WHERE c.country = 'UK'; -- filter applies to preserved left rows as well
ON; global constraints go in WHERE. This keeps outer join semantics correct.
Self-joins and aliasing
A self-join relates a table to itself (for example hierarchical data, predecessor links, or de-duplicating near matches). Aliases keep the references readable.
Aliasing with AS (or implicit aliases)
SELECT o.order_id, c.customer_name
FROM orders AS o
JOIN customers AS c ON c.customer_id = o.customer_id;
-- Most engines allow omitting AS (use sparingly for clarity):
SELECT o.order_id, c.customer_name
FROM orders o
JOIN customers c ON c.customer_id = o.customer_id;
c for customers, o for orders). Avoid single-letter aliases that collide (t for many tables).
Classic self-join: employees and managers
-- Each employee row has manager_id pointing to another employee's employee_id
SELECT e.employee_id,
e.employee_name,
m.employee_name AS manager_name
FROM employees AS e
LEFT JOIN employees AS m
ON m.employee_id = e.manager_id;
The alias e stands for the employee row; m stands for the manager row in the same table.
Self-join for de-duplication and matching
-- Find pairs of customers with the same email (case-insensitive), different ids
SELECT c1.customer_id AS id1,
c2.customer_id AS id2,
LOWER(c1.email) AS email_normalized
FROM customers AS c1
JOIN customers AS c2
ON LOWER(c1.email) = LOWER(c2.email)
AND c1.customer_id < c2.customer_id; -- prevent symmetric duplicates
employees.manager_id, customers.email) to keep scans efficient.
Hierarchy traversal (adjacency list)
For a simple parent → child relationship stored as (id, parent_id), a self-join returns parent attributes alongside children.
SELECT child.category_id,
child.category_name,
parent.category_name AS parent_name
FROM categories AS child
LEFT JOIN categories AS parent
ON parent.category_id = child.parent_id;
Practical join examples
This section uses a small sales schema:
| Table | Key columns | Notes |
| customers | customer_id |
One row per customer |
| orders | order_id, customer_id |
Order header; one row per order |
| order_items | order_id, product_id |
Line items per order |
| products | product_id |
Catalog of products |
Joining three or more tables
SELECT o.order_id,
c.customer_name,
p.product_name,
oi.quantity,
oi.unit_price
FROM orders AS o
JOIN customers AS c
ON c.customer_id = o.customer_id
JOIN order_items AS oi
ON oi.order_id = o.order_id
JOIN products AS p
ON p.product_id = oi.product_id;
Add joins one step at a time; verify each join multiplies rows as expected.
Aggregations across joins
-- Revenue per customer
SELECT c.customer_id,
c.customer_name,
SUM(oi.quantity * oi.unit_price) AS revenue
FROM customers AS c
JOIN orders AS o
ON o.customer_id = c.customer_id
JOIN order_items AS oi
ON oi.order_id = o.order_id
GROUP BY c.customer_id, c.customer_name
ORDER BY revenue DESC;
Finding unmatched rows (LEFT JOIN with NULL check)
-- Customers with no orders
SELECT c.customer_id, c.customer_name
FROM customers AS c
LEFT JOIN orders AS o
ON o.customer_id = c.customer_id
WHERE o.order_id IS NULL;
This pattern (outer join plus IS NULL on the non-preserved side) is the standard anti-join approach.
Many-to-many bridges
Use a bridge (junction) table to connect two entities. Join through the bridge to reach the far table.
-- Authors <- book_authors -> Books
SELECT a.author_name, b.title
FROM authors AS a
JOIN book_authors AS ba
ON ba.author_id = a.author_id
JOIN books AS b
ON b.book_id = ba.book_id;
Filtering on the joined table safely
-- Customers with at least one OPEN order; preserve the "at least one" intent
SELECT DISTINCT c.customer_id, c.customer_name
FROM customers AS c
JOIN orders AS o
ON o.customer_id = c.customer_id
WHERE o.status = 'OPEN';
If you need to keep customers with zero open orders as well, move the status predicate into ON and use LEFT JOIN.
USING and NATURAL JOIN
-- USING collapses duplicate column names in the output
SELECT o.order_id, customer_id
FROM orders AS o
JOIN customers AS c
USING (customer_id);
-- NATURAL JOIN matches columns with the same name automatically (avoid in production)
SELECT *
FROM orders NATURAL JOIN customers;
NATURAL JOIN depends on column names that might change. Prefer explicit ON or USING so behavior is clear.
Join performance checklist
- Index foreign keys (
orders.customer_id,order_items.order_id,order_items.product_id). - Project only needed columns; wide selects slow I/O.
- Filter early where possible (push selective predicates into
ONfor outer joins orWHEREfor inner joins). - Watch cardinality multiplication; understand one-to-many and many-to-many expansion.
SELECT COUNT(*) at each step to see how join choices affect row counts, then switch to detailed columns when satisfied.
Dialect notes
| Topic | MySQL | PostgreSQL | SQL Server | SQLite | Oracle |
FULL OUTER JOIN |
Not supported; emulate with UNION |
Supported | Supported | Not supported; emulate with UNION |
Supported |
USING clause |
Supported | Supported | Not supported; use ON |
Supported | Supported |
| Legacy outer join syntax | N/A | N/A | *=, =* (deprecated) |
N/A | (+) (deprecated; use ANSI joins) |
With these patterns, you can confidently combine tables for analysis and application queries.
Chapter 14: Grouping and Aggregating
Aggregations turn many rows into summarized results; grouping decides which rows belong together. This chapter explains GROUP BY and HAVING, introduces core aggregate functions, and shows advanced multi-level summaries with ROLLUP, CUBE, and GROUPING SETS.
GROUP BY, HAVING
A basic grouped query follows this pattern:
SELECT {group_columns …}, {aggregates …}
FROM {table}
[WHERE {row filters …}]
GROUP BY {group_columns …}
[HAVING {aggregate filters …}]
[ORDER BY {columns …}];
SELECT list must be either grouped (appear in GROUP BY) or aggregated. Some engines allow functional dependency shortcuts; write the explicit grouping for clarity and portability.
Single and multi-column GROUP BY
-- Orders per customer
SELECT o.customer_id, COUNT(*) AS order_count
FROM orders AS o
GROUP BY o.customer_id;
-- Revenue per customer and year
SELECT o.customer_id,
EXTRACT(YEAR FROM o.order_date) AS yr,
SUM(oi.quantity * oi.unit_price) AS revenue
FROM orders AS o
JOIN order_items AS oi ON oi.order_id = o.order_id
GROUP BY o.customer_id, EXTRACT(YEAR FROM o.order_date);
Grouping by expressions is allowed; you do not need to repeat the expression in an alias in every engine. Use the same expression in GROUP BY that you select.
WHERE vs HAVING
Use WHERE to filter input rows before grouping. Use HAVING to filter groups after aggregation.
-- Count only OPEN orders per customer; drop customers with zero OPEN orders
SELECT o.customer_id, COUNT(*) AS open_orders
FROM orders AS o
WHERE o.status = 'OPEN'
GROUP BY o.customer_id
HAVING COUNT(*) > 0;
HAVING can be slower because it waits until after grouping. Prefer WHERE when possible.
Handling NULL in groups
Rows where the grouping expression evaluates to NULL are grouped together in a single NULL group. Aggregate functions ignore NULL values except COUNT(*) which counts rows regardless of NULLs.
-- Count orders per coupon; NULL coupon rows form one group
SELECT o.coupon_code, COUNT(*) AS n
FROM orders AS o
GROUP BY o.coupon_code;
Aliasing in GROUP BY and ORDER BY
-- Most engines allow using select-list aliases in ORDER BY
SELECT c.customer_id AS id, COUNT(*) AS orders
FROM orders AS o
JOIN customers AS c ON c.customer_id = o.customer_id
GROUP BY c.customer_id
ORDER BY orders DESC;
-- Some engines allow ordinal positions in ORDER BY; avoid for readability
-- ORDER BY 2 DESC;
COUNT, SUM, AVG, MIN, MAX, DISTINCT aggregates
These standard aggregates work across engines.
COUNT variants
-- Count rows regardless of NULLs
SELECT COUNT(*) FROM orders;
-- Count non-NULL values in a column
SELECT COUNT(ship_date) FROM orders;
-- Count distinct customers who ordered this year
SELECT COUNT(DISTINCT customer_id)
FROM orders
WHERE EXTRACT(YEAR FROM order_date) = 2025;
COUNT(DISTINCT a, b) (multiple columns) is not supported in all engines. Use a concatenation or a derived key where needed, or group first then count.
SUM and AVG
-- Revenue per product
SELECT oi.product_id,
SUM(oi.quantity * oi.unit_price) AS revenue,
AVG(oi.unit_price) AS avg_price
FROM order_items AS oi
GROUP BY oi.product_id;
AVG ignores NULL values. Be mindful of integer division in some engines; cast to a decimal type when precision matters.
MIN and MAX
-- First and latest order per customer
SELECT o.customer_id,
MIN(o.order_date) AS first_order,
MAX(o.order_date) AS latest_order
FROM orders AS o
GROUP BY o.customer_id;
DISTINCT inside aggregates
-- Number of distinct products per order
SELECT oi.order_id,
COUNT(DISTINCT oi.product_id) AS unique_products
FROM order_items AS oi
GROUP BY oi.order_id;
DISTINCT applies to the aggregate's argument only. It does not affect other selected columns.
Conditional aggregates
Use CASE expressions for portable conditional sums and counts.
-- Order status breakdown per customer
SELECT o.customer_id,
SUM(CASE WHEN o.status = 'OPEN' THEN 1 ELSE 0 END) AS open_orders,
SUM(CASE WHEN o.status = 'CLOSED' THEN 1 ELSE 0 END) AS closed_orders
FROM orders AS o
GROUP BY o.customer_id;
PostgreSQL and a few engines support FILTER for concise conditional aggregates:
-- PostgreSQL example
SELECT customer_id,
COUNT(*) FILTER (WHERE status = 'OPEN') AS open_orders,
COUNT(*) FILTER (WHERE status = 'CLOSED') AS closed_orders
FROM orders
GROUP BY customer_id;
OVER clause) compute per-row analytics and are different from grouped aggregates. This book covers windows later…
ROLLUP, CUBE, and GROUPING SETS
Advanced grouping operators produce multi-level totals in one query. They extend GROUP BY and return subtotal rows alongside detail rows.
ROLLUP: hierarchical subtotals
ROLLUP creates subtotals moving from right to left through the grouping list and a grand total at the end.
-- Revenue by year, then month, plus yearly totals and a grand total
SELECT EXTRACT(YEAR FROM o.order_date) AS yr,
EXTRACT(MONTH FROM o.order_date) AS mon,
SUM(oi.quantity * oi.unit_price) AS revenue
FROM orders AS o
JOIN order_items AS oi ON oi.order_id = o.order_id
GROUP BY ROLLUP (EXTRACT(YEAR FROM o.order_date),
EXTRACT(MONTH FROM o.order_date))
ORDER BY yr, mon;
Subtotal rows show NULL in the rolled-up columns. Use GROUPING or GROUPING_ID (where available) to detect them cleanly.
-- Label subtotal rows (syntax varies by engine)
SELECT
EXTRACT(YEAR FROM o.order_date) AS yr,
EXTRACT(MONTH FROM o.order_date) AS mon,
SUM(oi.quantity * oi.unit_price) AS revenue,
GROUPING(EXTRACT(YEAR FROM o.order_date)) AS g_yr,
GROUPING(EXTRACT(MONTH FROM o.order_date)) AS g_mon
FROM orders AS o
JOIN order_items AS oi ON oi.order_id = o.order_id
GROUP BY ROLLUP (EXTRACT(YEAR FROM o.order_date),
EXTRACT(MONTH FROM o.order_date));
CASE to replace NULL in subtotal rows (for example CASE WHEN GROUPING(mon)=1 THEN 'All months' ELSE TO_CHAR(mon,'FM00') END).
CUBE: all combinations of groups
CUBE(a, b, c) returns subtotals for every combination of the listed columns plus the grand total. This grows quickly with each additional column.
-- Revenue by region, channel, and product category plus all subtotals
SELECT r.region, o.channel, p.category,
SUM(oi.quantity * oi.unit_price) AS revenue
FROM orders AS o
JOIN order_items AS oi ON oi.order_id = o.order_id
JOIN products AS p ON p.product_id = oi.product_id
JOIN regions AS r ON r.region_id = o.region_id
GROUP BY CUBE (r.region, o.channel, p.category);
CUBE emits many rows. Limit dimensions or apply HAVING to keep result sizes manageable.
GROUPING SETS: explicit subtotal patterns
GROUPING SETS let you specify exactly which groupings you want.
-- Get revenue by region, by channel, and the grand total; avoid unwanted combos
SELECT r.region, o.channel,
SUM(oi.quantity * oi.unit_price) AS revenue
FROM orders AS o
JOIN order_items AS oi ON oi.order_id = o.order_id
JOIN regions AS r ON r.region_id = o.region_id
GROUP BY GROUPING SETS (
(r.region),
(o.channel),
() -- grand total
)
ORDER BY r.region NULLS LAST, o.channel NULLS LAST;
Emulating advanced groupings
If your engine lacks these operators, use UNION ALL across several grouped queries.
-- Emulate ROLLUP(year, month)
SELECT yr, mon, SUM(revenue) AS revenue
FROM (
SELECT EXTRACT(YEAR FROM o.order_date) AS yr,
EXTRACT(MONTH FROM o.order_date) AS mon,
oi.quantity * oi.unit_price AS revenue
FROM orders AS o
JOIN order_items AS oi ON oi.order_id = o.order_id
) AS t
GROUP BY yr, mon
UNION ALL
SELECT yr, NULL AS mon, SUM(revenue)
FROM (
SELECT EXTRACT(YEAR FROM o.order_date) AS yr,
oi.quantity * oi.unit_price AS revenue
FROM orders AS o
JOIN order_items AS oi ON oi.order_id = o.order_id
) AS y
GROUP BY yr
UNION ALL
SELECT NULL AS yr, NULL AS mon, SUM(oi.quantity * oi.unit_price)
FROM orders AS o
JOIN order_items AS oi ON oi.order_id = o.order_id;
UNION ALL, add a column that tags the row level (for example 'detail', 'by_year', 'total') so you can sort and format consistently.
Dialect notes
| Feature | MySQL | PostgreSQL | SQL Server | SQLite | Oracle |
ROLLUP / WITH ROLLUP |
Supported (WITH ROLLUP and ROLLUP() in recent versions) |
Supported | Supported | Not supported; emulate with UNION |
Supported |
CUBE |
Not supported; emulate | Supported | Supported | Not supported; emulate | Supported |
GROUPING SETS |
Supported (recent); check version | Supported | Supported | Not supported; emulate | Supported |
GROUPING/GROUPING_ID |
Partial support; no GROUPING_ID in older releases |
Supported (GROUPING()) |
Supported (both) | N/A | Supported (both) |
Master the basics of grouping, then add subtotals with ROLLUP, full combinatorial summaries with CUBE, or targeted totals with GROUPING SETS as your analysis requires.
Chapter 15: Subqueries and Derived Tables
Subqueries allow you to nest one query inside another. They return values or tables that the outer query can use for filtering, comparison, or joining. This chapter explores scalar, row, and table subqueries, explains EXISTS, ANY, and ALL, and concludes with Common Table Expressions (CTEs) for clarity and reuse.
Scalar, row, and table subqueries
A subquery is enclosed in parentheses and can appear almost anywhere an expression or table name could. The outer query treats the subquery’s result as a single value, a row, or a full virtual table.
Scalar subqueries (single value)
A scalar subquery returns exactly one value (one row and one column). Use it wherever a literal could appear.
-- Find customers whose total spend exceeds the average order total
SELECT c.customer_id, c.customer_name
FROM customers AS c
WHERE (
SELECT SUM(oi.quantity * oi.unit_price)
FROM orders AS o
JOIN order_items AS oi ON oi.order_id = o.order_id
WHERE o.customer_id = c.customer_id
) > (
SELECT AVG(total)
FROM (
SELECT SUM(oi.quantity * oi.unit_price) AS total
FROM orders AS o
JOIN order_items AS oi ON oi.order_id = o.order_id
GROUP BY o.order_id
) AS sub
);
Row subqueries (multiple columns, one row)
A row subquery returns one row with several columns. It can be compared to another row using tuple syntax.
-- Compare composite (city, state) values
SELECT c1.customer_id, c1.customer_name
FROM customers AS c1
WHERE (c1.city, c1.state) = (
SELECT c2.city, c2.state
FROM customers AS c2
WHERE c2.customer_id = 42
);
(col1, col2) = (val1, val2) is supported in PostgreSQL, Oracle, and some others; MySQL supports it partially; SQL Server does not.
Table subqueries (derived tables)
When a subquery returns a full result set, it can be treated as a table source. These are often called derived tables or inline views.
-- Derived table to compute revenue per customer
SELECT t.customer_id, t.revenue
FROM (
SELECT o.customer_id,
SUM(oi.quantity * oi.unit_price) AS revenue
FROM orders AS o
JOIN order_items AS oi ON oi.order_id = o.order_id
GROUP BY o.customer_id
) AS t
WHERE t.revenue > 1000;
Always give derived tables an alias. Without one, most databases reject the query.
Correlated subqueries
A correlated subquery references columns from its outer query. It executes once per outer row, not just once overall.
-- Customers whose latest order is within 30 days
SELECT c.customer_id, c.customer_name
FROM customers AS c
WHERE EXISTS (
SELECT 1
FROM orders AS o
WHERE o.customer_id = c.customer_id
AND o.order_date > CURRENT_DATE - INTERVAL '30' DAY
);
EXISTS, ANY, ALL
These operators compare outer query rows to subquery results. They provide powerful logic for inclusion and comparison tests.
EXISTS
EXISTS tests whether a subquery returns at least one row. It stops scanning as soon as one match is found.
-- Customers who have ever placed an order
SELECT c.customer_id, c.customer_name
FROM customers AS c
WHERE EXISTS (
SELECT 1
FROM orders AS o
WHERE o.customer_id = c.customer_id
);
EXISTS ignores the subquery’s actual values; only presence matters.
EXISTS instead of IN when the subquery may produce NULLs or large results. EXISTS often optimizes better.
NOT EXISTS
To find outer rows with no matches, invert the condition.
-- Customers with no orders
SELECT c.customer_id, c.customer_name
FROM customers AS c
WHERE NOT EXISTS (
SELECT 1
FROM orders AS o
WHERE o.customer_id = c.customer_id
);
NOT IN when subquery columns may contain NULL; it can return no rows unexpectedly. NOT EXISTS is safer.
ANY and ALL
ANY (or SOME) compares a value to each subquery result and succeeds if any comparison is true. ALL requires all comparisons to be true.
-- Orders larger than the average for any customer in Germany
SELECT o.order_id, o.customer_id, o.total_amount
FROM orders AS o
WHERE o.total_amount > ANY (
SELECT AVG(oi.quantity * oi.unit_price)
FROM orders AS g
JOIN order_items AS oi ON oi.order_id = g.order_id
JOIN customers AS c ON c.customer_id = g.customer_id
WHERE c.country = 'Germany'
GROUP BY g.customer_id
);
-- Products cheaper than all others in their category
SELECT p.product_id, p.product_name
FROM products AS p
WHERE p.unit_price < ALL (
SELECT p2.unit_price
FROM products AS p2
WHERE p2.category_id = p.category_id
AND p2.product_id <> p.product_id
);
ANY as “at least one” and ALL as “every.” For clarity, many developers replace them with MIN or MAX aggregates where possible.
Common Table Expressions
Common Table Expressions (CTEs) provide named, reusable subqueries that improve readability. They are introduced with WITH and can be referenced by later queries in the same statement.
Basic CTE syntax
WITH totals AS (
SELECT o.customer_id, SUM(oi.quantity * oi.unit_price) AS revenue
FROM orders AS o
JOIN order_items AS oi ON oi.order_id = o.order_id
GROUP BY o.customer_id
)
SELECT c.customer_name, t.revenue
FROM customers AS c
JOIN totals AS t ON t.customer_id = c.customer_id
WHERE t.revenue > 1000;
A CTE acts like a temporary named table visible only to the main query (and to other CTEs declared after it).
WITH clause.
Recursive CTEs
Recursive CTEs reference themselves to walk hierarchies or compute sequences. They use UNION ALL between an anchor member and a recursive member.
-- Organization hierarchy
WITH RECURSIVE org_chart AS (
SELECT employee_id, manager_id, employee_name, 1 AS level
FROM employees
WHERE manager_id IS NULL
UNION ALL
SELECT e.employee_id, e.manager_id, e.employee_name, oc.level + 1
FROM employees AS e
JOIN org_chart AS oc ON e.manager_id = oc.employee_id
)
SELECT *
FROM org_chart
ORDER BY level, employee_name;
UNION ALL instead of UNION in recursive CTEs to avoid redundant sorting and distinct checks.
CTEs vs derived tables
CTEs are easier to read and maintain, but engines may not always materialize them as temporary results. Derived tables can be faster in some cases. Most optimizers treat them equivalently unless hints or recursion are involved.
-- Derived table version of the same revenue query
SELECT c.customer_name, t.revenue
FROM customers AS c
JOIN (
SELECT o.customer_id, SUM(oi.quantity * oi.unit_price) AS revenue
FROM orders AS o
JOIN order_items AS oi ON oi.order_id = o.order_id
GROUP BY o.customer_id
) AS t ON t.customer_id = c.customer_id;
Dialect notes
| Feature | MySQL | PostgreSQL | SQL Server | SQLite | Oracle |
| CTE support | Supported (8.0+) | Supported | Supported | Supported | Supported |
| Recursive CTEs | Supported (8.0+) | Supported | Supported | Supported | Supported |
| Materialization hints | Not available | MATERIALIZED / NOT MATERIALIZED supported |
Optimizer controlled | Optimizer controlled | Optimizer controlled |
Subqueries, derived tables, and CTEs all provide ways to compose complex logic. Choose the simplest structure that expresses your intent clearly while remaining efficient for the database engine.
Chapter 16: Transactions and ACID
Transactions ensure that groups of SQL statements are executed safely and consistently. They bundle multiple operations into a single logical unit of work that either fully succeeds or has no effect at all. This chapter introduces transaction control commands, the ACID properties that define reliability, and how databases handle isolation, concurrency, and locking.
COMMIT, ROLLBACK, SAVEPOINT
A transaction starts automatically when you execute the first statement after a previous one has committed (or when you issue BEGIN or START TRANSACTION). You can then decide whether to make changes permanent or discard them.
-- Basic transaction pattern
START TRANSACTION;
UPDATE accounts SET balance = balance - 100 WHERE account_id = 1;
UPDATE accounts SET balance = balance + 100 WHERE account_id = 2;
COMMIT; -- make changes permanent
-- or
ROLLBACK; -- undo all changes
COMMIT permanently applies all changes made since the transaction began. ROLLBACK cancels them and restores the database to its prior state.
COMMIT or ROLLBACK explicitly to avoid leaving open transactions that hold locks.
SAVEPOINT and partial rollbacks
SAVEPOINT creates a named checkpoint inside a transaction. You can roll back to it without discarding the whole transaction.
START TRANSACTION;
UPDATE products SET stock = stock - 1 WHERE product_id = 101;
SAVEPOINT before_discount;
UPDATE products SET price = price * 0.8 WHERE category = 'Sale';
-- Change of mind: cancel the discount only
ROLLBACK TO SAVEPOINT before_discount;
COMMIT;
After a ROLLBACK TO SAVEPOINT, the savepoint remains defined in most engines, allowing further rollbacks to earlier points if needed.
ROLLBACK or COMMIT. Recreate them if you continue the transaction afterward.
Autocommit mode
Autocommit executes each statement in its own implicit transaction. This is convenient for read operations but risky for multi-step updates that must stay consistent.
-- Disable autocommit in MySQL
SET autocommit = 0;
-- Explicitly start and end a transaction
START TRANSACTION;
UPDATE accounts SET balance = balance - 50 WHERE account_id = 10;
UPDATE accounts SET balance = balance + 50 WHERE account_id = 20;
COMMIT;
-- Re-enable autocommit
SET autocommit = 1;
Use autocommit for safe single-statement updates and disable it for atomic multi-statement operations.
Isolation levels
Isolation defines how visible uncommitted changes are between concurrent transactions. SQL defines four standard isolation levels. Each prevents certain anomalies but may reduce performance or concurrency.
| Level | Prevents | Allows |
READ UNCOMMITTED |
Nothing | Dirty reads, non-repeatable reads, phantom rows |
READ COMMITTED |
Dirty reads | Non-repeatable reads, phantom rows |
REPEATABLE READ |
Dirty and non-repeatable reads | Phantom rows (varies by engine) |
SERIALIZABLE |
All above anomalies | None |
Setting isolation level
-- Session-level in MySQL and PostgreSQL
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
-- SQL Server
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-- Oracle default is READ COMMITTED
Higher isolation reduces concurrency, so choose the weakest level that preserves correctness for your workload.
READ COMMITTED often suffices. For financial updates, SERIALIZABLE ensures full consistency.
Common anomalies
- Dirty read: Transaction A reads uncommitted data from B, which B later rolls back.
- Non-repeatable read: Transaction A rereads the same row and sees different data due to B’s commit.
- Phantom read: A range query in A returns new or missing rows after B inserts or deletes matching rows.
Isolation levels balance performance and correctness by deciding which of these anomalies to allow.
Concurrency control
Databases use concurrency control to let many transactions operate safely at once. Two major strategies dominate: pessimistic locking (block conflicting operations) and optimistic control (detect conflicts and retry).
Pessimistic concurrency
Pessimistic systems lock resources as soon as they are read or updated, preventing conflicts but reducing parallelism.
-- Explicit lock in SQL Server
SELECT * FROM accounts WITH (UPDLOCK)
WHERE account_id = 100;
MySQL’s InnoDB and PostgreSQL automatically lock rows written inside a transaction. You can request explicit locks using SELECT ... FOR UPDATE.
-- Reserve a row for editing
SELECT * FROM accounts
WHERE account_id = 100
FOR UPDATE;
Optimistic concurrency
Optimistic control allows concurrent transactions without locking, checking later whether data changed. It suits read-heavy workloads.
-- Example using a version column
UPDATE accounts
SET balance = balance + 50, version = version + 1
WHERE account_id = 10 AND version = 5;
If zero rows are affected, another transaction modified the same row, and the update must be retried.
Locking and deadlocks
Locks control concurrent access to data. They can apply to rows, pages, or entire tables depending on the engine and query. Proper lock management prevents inconsistent reads but may introduce deadlocks when transactions wait on each other.
Lock types
| Lock type | Purpose |
| Shared (S) | Allows concurrent reads but blocks writes. |
| Exclusive (X) | Blocks all other access until commit or rollback. |
| Intent locks | Signal upcoming shared or exclusive locks at a coarser level. |
| Table locks | Used when statements affect entire tables (DDL, TRUNCATE, etc.). |
Most engines acquire and release locks automatically. Manual control is rarely needed except for specialized concurrency logic.
Deadlocks
A deadlock occurs when two transactions each hold a lock that the other needs. Neither can proceed, so the database aborts one to resolve the conflict.
-- Example scenario
-- Transaction A:
UPDATE accounts SET balance = balance - 10 WHERE account_id = 1;
UPDATE accounts SET balance = balance + 10 WHERE account_id = 2;
-- Transaction B (in parallel):
UPDATE accounts SET balance = balance - 10 WHERE account_id = 2;
UPDATE accounts SET balance = balance + 10 WHERE account_id = 1;
Each transaction locks one row, then waits on the other’s lock, creating a circular wait. The engine detects this and rolls back one transaction.
Monitoring locks
Databases provide diagnostic views to inspect locks and waits:
- MySQL:
SHOW ENGINE INNODB STATUS; - PostgreSQL:
pg_lockssystem view - SQL Server:
sys.dm_tran_locksDMV - Oracle:
v$lockandv$session
Regularly reviewing these can reveal long-held locks or frequent deadlocks, which suggest missing indexes or inefficient transaction design.
Transactions, isolation, and concurrency control together form the foundation of reliable SQL systems. Mastering them lets you balance correctness with performance and design applications that remain consistent even under heavy load.
Chapter 17: Users, Roles, and Permissions
Security in SQL databases begins with users, roles, and privileges. Every statement runs under a specific identity, and permissions determine which objects and operations are allowed. This chapter covers privilege management with GRANT and REVOKE, shows how to organize permissions with roles, applies the principle of least privilege, and explains how views can act as protective layers around data.
GRANT and REVOKE
Permissions control access to database objects such as tables, views, and routines. You can assign or remove them with GRANT and REVOKE. Each database defines its own privilege types, but core privileges are widely supported.
| Privilege | Purpose |
SELECT |
Read data from tables or views. |
INSERT |
Add new rows. |
UPDATE |
Modify existing rows. |
DELETE |
Remove rows. |
EXECUTE |
Run stored procedures or functions. |
USAGE |
Access objects like sequences or schemas (varies by engine). |
ALL PRIVILEGES |
Grant every available permission for the object. |
Granting privileges
-- Give SELECT and UPDATE rights on a table
GRANT SELECT, UPDATE ON employees TO analyst;
-- Allow a role to run a procedure
GRANT EXECUTE ON PROCEDURE payroll_run TO payroll_role;
-- Grant all privileges to an admin
GRANT ALL PRIVILEGES ON database company TO admin;
Privileges can be granted to individual users or to roles. Roles then group privileges for easier management.
sales_reader, data_entry). Then assign roles to users instead of repeating the same GRANT statements.
Revoking privileges
REVOKE removes privileges previously granted. Without specific privileges, users lose access to the affected objects.
REVOKE UPDATE ON employees FROM analyst;
REVOKE ALL PRIVILEGES ON database company FROM temp_user;
Some databases also support REVOKE GRANT OPTION FOR, which prevents a user from passing privileges to others without removing their own access.
Ownership and inheritance
Object owners automatically have full rights to their objects. In some systems (like PostgreSQL), ownership can be transferred with ALTER ... OWNER TO. In others (like MySQL), privileges depend on grant tables and not strict ownership.
Roles may inherit privileges from other roles, forming a hierarchy. Use this to model layered access (for example, junior_analyst inherits reader privileges but adds limited write access).
Managing privileges
Privilege management involves creating users, assigning roles, and auditing access. Each system provides its own syntax for creating accounts and linking authentication credentials.
Creating users and roles
-- MySQL
CREATE USER 'jane'@'%' IDENTIFIED BY 'password123';
GRANT SELECT ON company.* TO 'jane'@'%';
-- PostgreSQL
CREATE ROLE analyst LOGIN PASSWORD 'strongpass';
GRANT SELECT ON ALL TABLES IN SCHEMA public TO analyst;
-- SQL Server
CREATE LOGIN jane WITH PASSWORD = 'password123';
CREATE USER jane FOR LOGIN jane;
EXEC sp_addrolemember 'db_datareader', 'jane';
Use CREATE ROLE to define reusable sets of permissions. In many databases, users and roles are both represented as roles with or without the LOGIN privilege.
Viewing existing privileges
-- MySQL
SHOW GRANTS FOR 'jane'@'%';
-- PostgreSQL
\du jane -- psql shortcut to show roles and privileges
-- SQL Server
SELECT * FROM sys.database_permissions WHERE grantee_principal_id = USER_ID('jane');
Regularly auditing privileges helps maintain a secure environment and ensures compliance with organizational policies.
Grant chains and cascading effects
When a user is granted privileges with WITH GRANT OPTION, they can grant the same privileges to others. Revoking the original grant also removes privileges cascaded from it.
GRANT SELECT ON employees TO alice WITH GRANT OPTION;
GRANT SELECT ON employees TO bob; -- by alice
REVOKE SELECT ON employees FROM alice CASCADE; -- removes bob’s access too
WITH GRANT OPTION privileges. They can spread access beyond intended boundaries.
Least-privilege design
The principle of least privilege means giving users only the minimum rights required to perform their tasks. This limits the impact of mistakes and intrusions.
Implementing least privilege
- Start from no privileges, then grant only what is necessary.
- Use roles to represent job functions (for example, reader, editor, auditor).
- Revoke default privileges that new users inherit in some systems.
- Audit privilege use periodically and remove unused ones.
- Separate duties: do not combine administration and data entry in one account.
Least privilege applies not only to human users but also to application connections. Each application should connect with a role that has access only to the schema objects it needs.
Temporary or time-limited privileges
Some databases support time-limited grants or scripts to revoke privileges after audits. When not supported natively, manage expiry manually via scheduled jobs or automation tools.
-- Example manual revocation
GRANT SELECT ON reports TO contractor;
-- Later
REVOKE SELECT ON reports FROM contractor;
Views as security boundaries
Views provide a controlled interface to data. They can hide sensitive columns, filter rows, or present derived information. By granting access to the view instead of the base table, you restrict what users can see or modify.
Creating secure views
-- Create a view that hides salary details
CREATE VIEW staff_public AS
SELECT employee_id, name, department
FROM employees;
GRANT SELECT ON staff_public TO hr_assistant;
REVOKE SELECT ON employees FROM hr_assistant;
This allows HR assistants to read staff information without seeing salaries or private data.
Updatable views
In many systems, simple views are updatable. More complex views (with joins, aggregates, or groupings) are read-only unless triggers or INSTEAD OF rules are defined.
-- SQL Server example
CREATE VIEW v_customers AS
SELECT customer_id, name, email FROM customers
WHERE active = 1
WITH CHECK OPTION; -- prevents updates that violate the filter
WITH CHECK OPTION to ensure updates through a view cannot modify rows outside its defined filter.
Definer and invoker security
Views can run with the privileges of either the view’s creator (definer rights) or the calling user (invoker rights). Definer rights let users access underlying data indirectly, while invoker rights enforce their own privileges.
-- MySQL example
CREATE DEFINER = 'admin'@'%' VIEW confidential_summary AS
SELECT COUNT(*) AS employee_count FROM employees;
When a normal user queries this view, it executes under the admin privileges defined at creation time, not the caller’s privileges.
Auditing access via views
Views can also log or trace access if combined with triggers or audit tables. For example, you can record every read of a sensitive view by inserting a log entry into an audit table through a trigger or wrapper function.
By combining roles, privileges, and carefully designed views, you can enforce fine-grained access control and maintain both usability and security within your database systems.
Chapter 18: Views and Virtual Tables
Views are stored SQL queries that act like virtual tables. They simplify complex logic, present consistent query interfaces, and protect underlying schema details. This chapter covers how to create and maintain views, explains when they can be updated, introduces materialized views for stored query results, and outlines caching strategies for performance.
Creating and updating views
A view is defined with a CREATE VIEW statement that stores a query for later use. When queried, the database runs the underlying SQL dynamically, presenting results as a table.
-- Create a view of active customers
CREATE VIEW active_customers AS
SELECT customer_id, name, country
FROM customers
WHERE active = 1;
-- Query it like a table
SELECT * FROM active_customers;
Views can join multiple tables, filter rows, or include calculated columns. They are often used to encapsulate business logic or simplify repetitive queries.
v_sales_summary, v_active_users). This helps distinguish them from base tables.
Replacing or altering a view
To modify an existing view, use CREATE OR REPLACE VIEW where supported, or drop and recreate it.
-- PostgreSQL and Oracle
CREATE OR REPLACE VIEW active_customers AS
SELECT customer_id, name, email
FROM customers
WHERE active = 1;
-- MySQL (before 5.7) requires drop first
DROP VIEW IF EXISTS active_customers;
CREATE VIEW active_customers AS
SELECT customer_id, name, email
FROM customers
WHERE active = 1;
Altering a view does not change dependent privileges or references, but if underlying columns are renamed or removed, the view may break until redefined.
Dropping views
DROP VIEW active_customers;
Dropping a view removes only its definition; it does not affect underlying tables or data.
Updatable vs read-only views
Some views can be updated directly with INSERT, UPDATE, or DELETE. Others are inherently read-only because their definition makes it ambiguous which base rows should change.
Updatable views
Views are generally updatable if they meet these conditions:
- They reference a single base table.
- They include all non-nullable columns from the base table.
- They contain no
GROUP BY,DISTINCT, aggregates, or set operations.
-- Updatable view
CREATE VIEW department_staff AS
SELECT employee_id, name, department_id
FROM employees;
-- Update through the view
UPDATE department_staff
SET department_id = 3
WHERE employee_id = 102;
The database propagates such updates directly to the base table.
Read-only views
Views that use joins, aggregates, or derived columns are read-only because updates cannot be unambiguously mapped back to base data.
-- Read-only view
CREATE VIEW department_summary AS
SELECT d.department_name,
COUNT(e.employee_id) AS staff_count
FROM departments AS d
LEFT JOIN employees AS e ON e.department_id = d.department_id
GROUP BY d.department_name;
Attempting to insert or update through such a view usually causes an error.
INSTEAD OF triggers, which translate the view updates into custom logic.
Enforcing rules with WITH CHECK OPTION
WITH CHECK OPTION ensures that inserts or updates through a view do not produce rows that fall outside the view’s WHERE condition.
CREATE VIEW active_employees AS
SELECT employee_id, name, department_id
FROM employees
WHERE active = 1
WITH CHECK OPTION;
If a user tries to insert an inactive employee through this view, the database rejects the operation because it violates the condition.
Materialized views
A materialized view stores the results of a query physically, unlike a regular view which executes the query each time. This improves performance for expensive aggregations or joins but requires refresh management to stay up to date.
Creating a materialized view
-- PostgreSQL and Oracle
CREATE MATERIALIZED VIEW sales_summary AS
SELECT region_id,
SUM(total_amount) AS total_sales,
COUNT(order_id) AS num_orders
FROM orders
GROUP BY region_id;
After creation, querying the materialized view reads from the stored result instead of recomputing the query.
Refreshing materialized views
Because data changes over time, materialized views must be refreshed manually or automatically.
-- Manual refresh (PostgreSQL)
REFRESH MATERIALIZED VIEW sales_summary;
-- Oracle supports ON COMMIT refresh
CREATE MATERIALIZED VIEW sales_summary
REFRESH FAST ON COMMIT AS
SELECT region_id, SUM(total_amount) AS total_sales
FROM orders
GROUP BY region_id;
Fast refresh updates only changed rows if the engine supports materialized view logs. Full refresh rebuilds the entire dataset.
Emulating materialized views
Engines that lack native materialized views (like MySQL before 8.0 or SQLite) can emulate them with a normal table and a scheduled job that repopulates it.
CREATE TABLE sales_summary AS
SELECT region_id,
SUM(total_amount) AS total_sales
FROM orders
GROUP BY region_id;
-- Later: rebuild periodically
TRUNCATE TABLE sales_summary;
INSERT INTO sales_summary
SELECT region_id,
SUM(total_amount)
FROM orders
GROUP BY region_id;
This approach gives similar performance benefits, though it requires external logic for consistency.
Caching strategies
Query results and materialized data can be cached at several layers to improve performance. The best strategy depends on data volatility and application workload.
Database-level caching
Most engines cache query execution plans and recently accessed data pages automatically. Tuning memory buffers and cache size parameters can greatly improve performance.
- MySQL:
innodb_buffer_pool_sizecontrols caching of table data and indexes. - PostgreSQL:
shared_buffersandwork_meminfluence cache behavior. - SQL Server: uses automatic plan and page caching.
Application-level caching
Applications can store query results in memory (for example, Redis or Memcached) or local caches to avoid repeated database reads. This is effective for rarely changing data such as configuration lists or product catalogs.
# Example (Python pseudocode)
key = "sales_summary_2025"
data = cache.get(key)
if not data:
data = db.query("SELECT * FROM sales_summary")
cache.set(key, data, ttl=3600)
return data
Hybrid caching with materialized views
Combining materialized views with in-memory caching can balance freshness and performance. The database maintains semi-fresh summaries, and the application caches them briefly to reduce load further.
For example, a dashboard might read from a refreshed hourly materialized view, caching the result in memory for only a few minutes.
Monitoring and invalidation
Effective caching requires monitoring to ensure accuracy and performance. Invalidate or refresh caches when base data changes significantly. Many frameworks provide cache invalidation hooks tied to database triggers or data pipelines.
Views, materialized views, and caching layers together let you balance speed and accuracy. Used well, they simplify queries and reduce computation costs while maintaining a consistent, reliable interface to your data.
Chapter 19: Stored Procedures and Functions
Procedural SQL lets you package logic close to the data. You can create PROCEDURE blocks that perform actions and FUNCTION blocks that return values. The ANSI core describes the idea of routines, while each engine supplies its own procedural dialect. Use routines to reduce round trips, to encapsulate business rules, and to share logic across applications.
When to use procedural SQL
Procedural code inside the database can be helpful, but it is not always the right tool. Start with set-based SQL, then reach for routines when you need orchestration, reusable logic, or server-side validation.
Good use cases
Use a PROCEDURE when you need multi-step workflows (validate input, insert rows, update aggregates). Use a FUNCTION when an expression needs a computed value (normalize a code, compute a score). Routines are helpful for enforcing rules that must run no matter which client calls the database.
When to avoid
Avoid routines when a single declarative statement does the job. If a query can be expressed as INSERT … SELECT or a well-written UPDATE with a JOIN, prefer the set-based version. Keep heavy business logic in application code when portability, versioning, and testing outweigh server-side benefits.
PROCEDURE. If you repeat the same scalar calculation inside queries, extract it into a FUNCTION.
Portability considerations
Each vendor uses different routine syntax. MySQL and MariaDB use CREATE PROCEDURE and CREATE FUNCTION with BEGIN and END. PostgreSQL uses CREATE FUNCTION and CREATE PROCEDURE with LANGUAGE plpgsql or another language. SQL Server uses CREATE PROCEDURE and CREATE FUNCTION with T-SQL syntax (use of inserted and deleted pseudo-tables for triggers). Keep bodies simple, and isolate vendor specifics in the smallest possible surface area.
Variables, parameters, and return values
Routines introduce parameters, local variables, control flow, and return mechanisms. The details vary per engine; the goals are the same: validate input, compute results, and return a clear outcome.
CREATE PROCEDURE essentials
A PROCEDURE encapsulates actions. It can have IN, OUT, and INOUT parameters (support depends on the engine). It does not return a value in expressions. You call it with CALL (MySQL) or EXEC/EXECUTE (SQL Server, some others).
-- MySQL / MariaDB example
DELIMITER //
CREATE PROCEDURE add_user(IN p_email VARCHAR(255), IN p_name VARCHAR(100), OUT p_user_id BIGINT)
BEGIN
DECLARE v_exists INT DEFAULT 0;
SELECT COUNT(*) INTO v_exists
FROM users
WHERE email = p_email;
IF v_exists > 0 THEN
SET p_user_id = NULL;
SIGNAL SQLSTATE '45000' SET MESSAGE_TEXT = 'Email already exists';
ELSE
INSERT INTO users(email, full_name, created_at)
VALUES(p_email, p_name, NOW());
SET p_user_id = LAST_INSERT_ID();
END IF;
END//
DELIMITER ;
CALL add_user('a@example.com', 'A User', @new_id);
SELECT @new_id;
DELIMITER when defining routines in the CLI because semicolons appear inside the body. In GUI tools this is often handled for you.
-- SQL Server example
CREATE OR ALTER PROCEDURE dbo.AddUser
@Email NVARCHAR(255),
@Name NVARCHAR(100),
@UserId BIGINT OUTPUT
AS
BEGIN
SET NOCOUNT ON;
IF EXISTS(SELECT 1 FROM dbo.Users WHERE Email = @Email)
BEGIN
SET @UserId = NULL;
THROW 50000, 'Email already exists', 1;
END;
INSERT INTO dbo.Users(Email, FullName, CreatedAt)
VALUES(@Email, @Name, SYSDATETIME());
SET @UserId = SCOPE_IDENTITY();
END;
DECLARE @id BIGINT;
EXEC dbo.AddUser @Email = N'a@example.com', @Name = N'A User', @UserId = @id OUTPUT;
SELECT @id AS NewId;
CREATE FUNCTION essentials
A FUNCTION returns a value and can be used inside expressions. It typically has only IN parameters. Keep functions deterministic when possible, and avoid side effects (engines may restrict data modification inside a function).
-- PostgreSQL example (plpgsql)
CREATE OR REPLACE FUNCTION public.slugify(p_text text)
RETURNS text
LANGUAGE plpgsql
AS $$
DECLARE
v_text text := lower(trim(p_text));
BEGIN
-- replace spaces and non-word characters
v_text := regexp_replace(v_text, '\s+', '-', 'g');
v_text := regexp_replace(v_text, '[^a-z0-9\-]', '', 'g');
RETURN v_text;
END;
$$;
SELECT slugify('Hello World & More'); -- returns 'hello-world-more'
SELECT list or WHERE clause, prefer a FUNCTION. If the logic changes data across multiple tables, prefer a PROCEDURE.
Local variables and control flow
Engines provide DECLARE for locals, and IF, CASE, LOOP or WHILE for control flow. Use them sparingly; first consider whether a set-based expression is clearer.
-- MySQL / MariaDB: loop and exit
CREATE PROCEDURE demo_loop(IN p_n INT)
BEGIN
DECLARE i INT DEFAULT 1;
my_loop: LOOP
IF i > p_n THEN
LEAVE my_loop;
END IF;
INSERT INTO t(counter) VALUES(i);
SET i = i + 1;
END LOOP my_loop;
END;
Error handling and signaling
Use the engine’s error constructs to signal problems. MySQL uses SIGNAL. PostgreSQL uses RAISE. SQL Server uses THROW. Prefer meaningful messages and consistent SQLSTATE or error numbers.
-- PostgreSQL
CREATE OR REPLACE FUNCTION safe_divide(a numeric, b numeric)
RETURNS numeric
LANGUAGE plpgsql
AS $$
BEGIN
IF b = 0 THEN
RAISE EXCEPTION 'Division by zero';
END IF;
RETURN a / b;
END;
$$;
Transactions inside routines
Transaction control varies. PostgreSQL functions run inside the caller’s transaction (you cannot COMMIT in a function). SQL Server procedures may issue BEGIN TRAN and COMMIT. MySQL procedures run within the current transaction; you can use START TRANSACTION and COMMIT if the storage engine supports it. Keep boundaries clear and avoid long transactions inside routines.
Triggers
A trigger runs automatically in response to table events. Use triggers to enforce invariants, maintain audit trails, and derive values. Keep trigger bodies brief and predictable.
Trigger timing and events
Common timings are BEFORE, AFTER, and sometimes INSTEAD OF (often for views). Common events are INSERT, UPDATE, and DELETE. You can create separate triggers per event or a combined trigger depending on the engine.
Accessing rows inside triggers
MySQL and PostgreSQL use NEW and OLD row references. SQL Server uses the virtual tables inserted and deleted. Use these to read changed values and to validate or transform data.
-- MySQL: BEFORE INSERT to normalize and enforce
CREATE TRIGGER users_bi BEFORE INSERT ON users
FOR EACH ROW
BEGIN
SET NEW.email = LOWER(TRIM(NEW.email));
IF NEW.email IS NULL OR NEW.email = '' THEN
SIGNAL SQLSTATE '45000' SET MESSAGE_TEXT = 'Email required';
END IF;
END;
-- PostgreSQL: AFTER UPDATE audit trail
CREATE TABLE user_audit(
user_id bigint,
changed_at timestamptz,
old_email text,
new_email text
);
CREATE OR REPLACE FUNCTION audit_user_email()
RETURNS trigger
LANGUAGE plpgsql
AS $$
BEGIN
IF NEW.email IS DISTINCT FROM OLD.email THEN
INSERT INTO user_audit(user_id, changed_at, old_email, new_email)
VALUES(NEW.id, now(), OLD.email, NEW.email);
END IF;
RETURN NEW;
END;
$$;
CREATE TRIGGER users_au
AFTER UPDATE ON users
FOR EACH ROW
EXECUTE FUNCTION audit_user_email();
-- SQL Server: AFTER INSERT trigger to maintain a counter
CREATE OR ALTER TRIGGER dbo.Users_AI ON dbo.Users
AFTER INSERT
AS
BEGIN
SET NOCOUNT ON;
UPDATE dbo.Counters
SET UserCount = UserCount + (SELECT COUNT(*) FROM inserted)
WHERE CounterName = 'Users';
END;
PROCEDURE and call it explicitly to avoid hidden side effects.
Order, recursion, and performance
Trigger firing order can be unspecified or configurable. Some engines allow multiple triggers per timing; if order matters, consolidate logic into one trigger. Watch for unintended recursion when a trigger modifies the same table; guard with conditions or use session settings to disable nesting if available.
Best practices
Keep bodies small. Avoid network calls. Avoid nondeterministic behavior that surprises callers. Document every trigger with purpose and invariants. Test using representative data and concurrent sessions to catch locking issues early.
PROCEDURE or application layer where intent and flow are easier to follow.
Chapter 20: Working with Dates and Times
Dates, times, and timestamps represent real-world events. SQL engines provide arithmetic, extraction, formatting, time zone conversion, and interval math. Favor ISO standards, store in UTC when possible, and convert at the edges when displaying to users.
timestamp with time zone where supported, or store UTC in a plain timestamp and keep a separate time zone column for the user context.
Date arithmetic, extraction, and formatting
SQL supports adding and subtracting intervals, extracting fields like year or month, and formatting values for presentation. Use arithmetic for logic, use formatting only at the final presentation step.
EXTRACT and basic arithmetic
EXTRACT returns a part of a date or timestamp. Arithmetic uses intervals, or engine-specific numeric offsets. Prefer interval-aware syntax to avoid month-length surprises.
-- ANSI-style extraction
SELECT
EXTRACT(YEAR FROM CURRENT_DATE) AS year_num,
EXTRACT(MONTH FROM CURRENT_DATE) AS month_num,
EXTRACT(DAY FROM CURRENT_DATE) AS day_num;
-- ANSI-style addition and subtraction (engines vary)
-- PostgreSQL
SELECT CURRENT_DATE + INTERVAL '7 days' AS in_seven_days;
-- MySQL / MariaDB
SELECT DATE_ADD(CURDATE(), INTERVAL 7 DAY) AS in_seven_days,
DATE_SUB(CURDATE(), INTERVAL 1 MONTH) AS last_month;
-- SQL Server
SELECT DATEADD(DAY, 7, CAST(GETDATE() AS date)) AS in_seven_days,
DATEADD(MONTH, -1, CAST(GETDATE() AS date)) AS last_month;
DATE_TRUNC, bucketing, and grouping
Truncation rounds down to a boundary (hour, day, month). Use it for grouping facts into time buckets.
-- PostgreSQL bucketing by day and month
SELECT DATE_TRUNC('day', created_at) AS day_bucket, COUNT(*) FROM orders GROUP BY 1;
SELECT DATE_TRUNC('month', created_at) AS month_bucket, COUNT(*) FROM orders GROUP BY 1;
-- SQL Server: use DATETRUNC (SQL Server 2022+) or computed boundaries
SELECT DATETRUNC(day, CreatedAt) AS day_bucket, COUNT(*) FROM dbo.Orders GROUP BY DATETRUNC(day, CreatedAt);
-- MySQL: use DATE(), or format to YYYY-MM-01 for month buckets
SELECT DATE(created_at) AS day_bucket, COUNT(*) FROM orders GROUP BY DATE(created_at);
SELECT DATE_FORMAT(created_at, '%Y-%m-01') AS month_bucket, COUNT(*) FROM orders GROUP BY 1;
Differences and durations
Compute elapsed time with supported constructs. Prefer interval types; if the engine returns numbers, document the unit clearly.
-- PostgreSQL: interval result
SELECT (NOW() - order_placed_at) AS elapsed FROM orders WHERE id = 42;
-- MySQL: duration in seconds or minutes
SELECT TIMESTAMPDIFF(SECOND, order_placed_at, NOW()) AS elapsed_seconds FROM orders WHERE id = 42;
-- SQL Server: DATEDIFF returns integer units
SELECT DATEDIFF(second, OrderPlacedAt, SYSDATETIME()) AS elapsed_seconds FROM dbo.Orders WHERE Id = 42;
TO_CHAR, DATE_FORMAT, FORMAT and friends
Use formatting functions at the presentation boundary. Keep internal comparisons and joins in native date types, not in strings. Patterns differ across engines.
| Engine | Function | Example |
| PostgreSQL | TO_CHAR | TO_CHAR(NOW(), 'YYYY-MM-DD HH24:MI') |
| MySQL / MariaDB | DATE_FORMAT | DATE_FORMAT(NOW(), '%Y-%m-%d %H:%i') |
| SQL Server | FORMAT | FORMAT(SYSDATETIME(), 'yyyy-MM-dd HH:mm') |
{fmt …} in application templates, keep the raw timestamp in SQL and format in the application layer for consistency.
Parsing strings into dates safely
Parsing is the inverse of formatting. Use explicit formats so inputs remain unambiguous.
-- PostgreSQL
SELECT TO_TIMESTAMP('2025-11-03 12:00', 'YYYY-MM-DD HH24:MI');
-- MySQL / MariaDB
SELECT STR_TO_DATE('2025-11-03 12:00', '%Y-%m-%d %H:%i');
-- SQL Server
SELECT TRY_CONVERT(datetime2, '2025-11-03 12:00', 120); -- ODBC style 120 = yyyy-mm-dd hh:mi:ss
03/11/2025. Use ISO 8601 strings YYYY-MM-DD for portability.
Time zones and intervals
Storing in UTC reduces ambiguity. Convert to a user’s time zone only when rendering or when the business rule depends on local time. Intervals represent human concepts like months or business days; arithmetic with intervals should respect calendar rules.
AT TIME ZONE and conversion
Engines offer explicit conversion functions. Make the source zone and target zone clear. Document whether the stored value is UTC or local time.
-- PostgreSQL: convert UTC to Europe/London at query time
SELECT (created_at AT TIME ZONE 'UTC') AT TIME ZONE 'Europe/London' AS created_local
FROM orders;
-- PostgreSQL alternative when created_at is timestamptz in UTC
SELECT created_at AT TIME ZONE 'Europe/London' AS created_local FROM orders;
-- SQL Server
SELECT SWITCHOFFSET(TODATETIMEOFFSET(CreatedAtUtc, '+00:00'), '+01:00') AS created_local
FROM dbo.Orders;
-- MySQL 8+: convert between zones
SELECT CONVERT_TZ(created_at_utc, '+00:00', 'Europe/London') AS created_local
FROM orders;
created_at_utc. Add a computed or indexed persisted column in local time if you group by local dates often.
Intervals vs absolute time shifts
Intervals represent amounts like one month or two days. Absolute shifts like adding 86,400 seconds can be wrong around daylight saving transitions. Prefer calendar-aware intervals.
-- PostgreSQL: calendar-aware
SELECT (DATE '2025-03-28' + INTERVAL '5 days') AS plus_5d,
(TIMESTAMP '2025-03-31 23:30 Europe/London' AT TIME ZONE 'Europe/London')
+ INTERVAL '1 day' AS next_day_local;
-- SQL Server: add calendar units
SELECT DATEADD(day, 5, CAST('2025-03-28' AS date)) AS plus_5d;
-- MySQL: INTERVAL expression
SELECT DATE_ADD('2025-03-28', INTERVAL 5 DAY) AS plus_5d;
Normalizing to day boundaries
Many reports need local-day buckets. Convert to the target zone first, then truncate.
-- PostgreSQL
SELECT DATE_TRUNC('day', created_at AT TIME ZONE 'Europe/London') AS day_local, COUNT(*)
FROM orders
GROUP BY 1
ORDER BY 1;
-- SQL Server (2022+)
SELECT DATETRUNC(day, SWITCHOFFSET(TODATETIMEOFFSET(CreatedAtUtc, '+00:00'), '+01:00)) AS day_local,
COUNT(*)
FROM dbo.Orders
GROUP BY DATETRUNC(day, SWITCHOFFSET(TODATETIMEOFFSET(CreatedAtUtc, '+00:00'), '+01:00))
ORDER BY day_local;
-- MySQL
SELECT DATE(CONVERT_TZ(created_at_utc, '+00:00', 'Europe/London')) AS day_local, COUNT(*)
FROM orders
GROUP BY day_local
ORDER BY day_local;
Generating date ranges and calendars
Create ranges to fill gaps in reports. Use set-generating constructs or helper tables.
-- PostgreSQL: generate_series over dates
SELECT d::date AS dt
FROM generate_series(DATE '2025-10-01', DATE '2025-10-31', INTERVAL '1 day') AS g(d);
-- SQL Server: tally using recursive CTE
WITH d AS (
SELECT CAST('2025-10-01' AS date) AS dt
UNION ALL
SELECT DATEADD(day, 1, dt) FROM d WHERE dt < '2025-10-31'
)
SELECT dt FROM d OPTION (MAXRECURSION 0);
-- MySQL: recursive CTE (8.0+)
WITH RECURSIVE d AS (
SELECT DATE('2025-10-01') AS dt
UNION ALL
SELECT DATE_ADD(dt, INTERVAL 1 DAY) FROM d WHERE dt < '2025-10-31'
)
SELECT dt FROM d;
dt, is_weekend, is_holiday, and week_start; join to it in reports for speed and clarity.
Common pitfalls and checks
Validate inputs, document assumptions, and add constraints where the engine allows them.
- Reject impossible dates at ingest with a
CHECKlikecreated_at_utc >= TIMESTAMP '2000-01-01'. - Avoid mixing local times from different time zones in one column without an explicit zone or separate columns.
- Store durations as intervals or integers in clear units (seconds, minutes); name the column accordingly.
Chapter 21: String and Pattern Operations
Strings are among the most common data types in SQL. Operations include concatenation, trimming, replacing, substring extraction, and pattern matching using wildcards or regular expressions. Each engine follows the ANSI baseline but extends it with extra string and pattern tools.
LOWER() or UPPER() to maintain consistency.
Concatenation, SUBSTRING, TRIM, REPLACE
String construction and manipulation appear in most queries. Use ANSI forms when possible; switch to dialect-specific variants only for advanced use cases.
Concatenation
ANSI SQL defines || for concatenation, though some engines use + or CONCAT(). Prefer the function form for cross-engine portability.
-- ANSI
SELECT 'Hello' || ' ' || 'World' AS greeting;
-- MySQL / MariaDB
SELECT CONCAT('Hello', ' ', 'World') AS greeting;
-- SQL Server
SELECT 'Hello' + ' ' + 'World' AS greeting;
-- PostgreSQL supports both ANSI and CONCAT
SELECT CONCAT('Hello', ' ', 'World');
+ operator in SQL Server performs arithmetic when any operand is numeric. Ensure all concatenated expressions are explicitly VARCHAR or NVARCHAR.
SUBSTRING extraction
SUBSTRING() (or SUBSTR()) retrieves part of a string. Position numbering starts at 1 in most engines. Negative indexes are engine-specific.
-- ANSI form
SELECT SUBSTRING('abcdef' FROM 2 FOR 3); -- returns 'bcd'
-- MySQL / MariaDB
SELECT SUBSTRING('abcdef', 2, 3);
-- SQL Server
SELECT SUBSTRING('abcdef', 2, 3);
-- PostgreSQL (supports both forms)
SELECT SUBSTR('abcdef', 2, 3);
POSITION() or CHARINDEX() to locate delimiters, then apply SUBSTRING() around them for clarity.
TRIM and whitespace cleanup
TRIM() removes leading, trailing, or both types of spaces (or custom characters). SQL also offers LTRIM() and RTRIM() in many dialects.
-- ANSI
SELECT TRIM(' text ') AS cleaned;
SELECT TRIM(BOTH 'x' FROM 'xxxtrimxx') AS cleaned;
-- MySQL / MariaDB
SELECT TRIM(' text '), LTRIM(' text '), RTRIM(' text ');
-- SQL Server
SELECT TRIM(' text '), LTRIM(' text '), RTRIM(' text ');
-- PostgreSQL
SELECT TRIM(' text ');
REPLACE and substitution
REPLACE() substitutes every occurrence of one substring with another. All major engines support it with identical signatures.
SELECT REPLACE('abc def abc', 'abc', 'XYZ'); -- returns 'XYZ def XYZ'
REPLACE() is case-sensitive in most databases. Use LOWER() or UPPER() to normalize before substitution if needed.
Other helpful functions
| Function | Purpose | Example |
LENGTH() / CHAR_LENGTH() |
Count characters in a string | SELECT LENGTH('hello'); |
POSITION() / CHARINDEX() |
Find substring index | SELECT POSITION('b' IN 'abc'); |
LEFT(), RIGHT() |
Extract leading or trailing characters | SELECT LEFT('abcdef', 3); |
CONCAT_WS() |
Concatenate with a separator (MySQL, PostgreSQL) | SELECT CONCAT_WS('-', '2025', '11', '03'); |
Regular expressions
Regular expressions extend pattern matching beyond LIKE wildcards. They follow POSIX, PCRE, or T-SQL style syntax depending on the database. Use them for validation, extraction, and conditional filtering, but be cautious with performance.
Matching and filtering
-- PostgreSQL (POSIX)
SELECT 'abc123' ~ '^[a-z]+[0-9]+$' AS is_match;
-- MySQL 8+ (default is case-insensitive unless specified)
SELECT 'abc123' REGEXP '^[a-z]+[0-9]+$';
-- SQL Server (use LIKE or CLR regex functions; no native REGEXP)
-- Example using a pattern for digits only
SELECT CASE WHEN name LIKE '%[0-9]%' THEN 1 ELSE 0 END AS has_digits FROM dbo.People;
Extraction and replacement
Some databases offer regex capture and substitution functions to transform text in one call.
-- PostgreSQL
SELECT REGEXP_REPLACE('abc123def', '[0-9]+', '###', 'g'); -- abc###def
SELECT REGEXP_MATCHES('id=42', '[0-9]+'); -- returns 42
-- MySQL 8+
SELECT REGEXP_REPLACE('abc123def', '[0-9]+', '###');
REGEXP_REPLACE() to strip unwanted characters like spaces, punctuation, or control codes, but test expressions carefully before applying to production data.
Performance and indexes
Regex scans are CPU-intensive because they bypass most index optimizations. Restrict them with other predicates or use precomputed flags to reduce cost.
LIKE patterns
The LIKE operator supports basic wildcard matching using % (any number of characters) and _ (a single character). It works in all engines and often benefits from indexes when the pattern does not begin with a wildcard.
Basic usage
SELECT name FROM customers WHERE name LIKE 'A%'; -- begins with A
SELECT name FROM customers WHERE name LIKE '%son'; -- ends with son
SELECT name FROM customers WHERE name LIKE '_a%'; -- second letter is a
Case sensitivity
Default behavior varies. MySQL is case-insensitive for most collations. PostgreSQL is case-sensitive by default but provides ILIKE for case-insensitive matches. SQL Server depends on the database collation.
-- PostgreSQL
SELECT name FROM users WHERE name ILIKE 'john%'; -- case-insensitive
-- MySQL
SELECT name FROM users WHERE name LIKE 'john%'; -- case-insensitive in utf8_general_ci
-- SQL Server
SELECT name FROM users WHERE name LIKE 'john%'; -- depends on collation
%text%) prevent index use and can slow queries significantly. Consider full-text search features for heavy pattern workloads.
Escaping special characters
To match literal % or _ symbols, define an escape character using ESCAPE.
SELECT filename
FROM logs
WHERE filename LIKE '%\%%' ESCAPE '\'; -- match a literal %
Combining with logic
Combine LIKE with logical operators for complex filters. Remember to group conditions with parentheses to maintain precedence.
SELECT title
FROM books
WHERE (title LIKE 'SQL%' OR title LIKE 'Database%')
AND title NOT LIKE '%Advanced%';
LIKE only for light pattern filtering.
Chapter 22: Advanced Queries
Advanced SQL features like window functions, pivoting, and recursive common table expressions (CTEs) let you express complex analytical and hierarchical logic directly in queries. These constructs build on core SELECT syntax but operate across multiple rows at once, providing cumulative, comparative, and hierarchical insights.
Window functions (OVER, PARTITION BY, RANK, LAG/LEAD)
Window functions compute values across related rows without collapsing them into groups. They use the OVER() clause, which can define partitions (logical subgroups) and orderings. Typical examples include running totals, row numbers, moving averages, and value comparisons.
Basic structure
A window function looks like an aggregate but uses OVER() to define its scope. The result set retains one row per input record.
SELECT
customer_id,
order_date,
amount,
SUM(amount) OVER (PARTITION BY customer_id) AS total_per_customer
FROM orders;
This computes the total amount per customer without grouping rows together. Each order retains its own detail.
ROW_NUMBER(), RANK(), and DENSE_RANK()
Ranking functions assign order within each partition. They differ in how they handle ties.
SELECT
customer_id,
amount,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY amount DESC) AS row_num,
RANK() OVER (PARTITION BY customer_id ORDER BY amount DESC) AS rank_num,
DENSE_RANK() OVER (PARTITION BY customer_id ORDER BY amount DESC) AS dense_rank_num
FROM orders;
| Function | Behavior |
ROW_NUMBER() |
Sequential numbering (no gaps) |
RANK() |
Gaps when ties occur |
DENSE_RANK() |
No gaps; next rank follows immediately |
ORDER BY with unique keys to avoid inconsistent results between runs.
LAG() and LEAD()
These functions look backward or forward in the ordered set to compare values across rows. They are useful for computing differences between consecutive events or detecting changes.
SELECT
customer_id,
order_date,
amount,
LAG(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS prev_amount,
LEAD(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS next_amount,
amount - LAG(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS diff_prev
FROM orders;
COALESCE() with LAG() or LEAD() to substitute defaults for missing neighbors, such as COALESCE(LAG(value) …, 0).
Moving averages and cumulative totals
By defining a frame with ROWS or RANGE, you can compute sliding windows such as a 7-day moving average or cumulative sums.
-- 7-day moving average by date
SELECT
order_date,
AVG(amount) OVER (ORDER BY order_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS moving_avg
FROM orders;
-- cumulative total
SELECT
customer_id,
order_date,
SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date ROWS UNBOUNDED PRECEDING) AS running_total
FROM orders;
ROWS vs RANGE) differ: ROWS counts physical rows, while RANGE uses value intervals (for example, numeric or time-based distances).
Pivoting and unpivoting data
Pivoting transforms rows into columns, while unpivoting reverses that process. These operations help reshape data for reports or analytics.
Pivoting with aggregates
The concept is universal, though syntax varies widely. You can use conditional aggregation or vendor-specific PIVOT operators.
-- Portable approach: conditional aggregation
SELECT
customer_id,
SUM(CASE WHEN status = 'pending' THEN amount ELSE 0 END) AS pending_total,
SUM(CASE WHEN status = 'paid' THEN amount ELSE 0 END) AS paid_total,
SUM(CASE WHEN status = 'shipped' THEN amount ELSE 0 END) AS shipped_total
FROM orders
GROUP BY customer_id;
SQL Server and Oracle include a dedicated PIVOT clause:
-- SQL Server example
SELECT *
FROM (
SELECT customer_id, status, amount FROM orders
) AS src
PIVOT (
SUM(amount) FOR status IN ([pending], [paid], [shipped])
) AS p;
Unpivoting (normalizing columns)
Unpivoting converts columns back into rows, often for cleaner storage or joining with metadata tables.
-- SQL Server example
SELECT customer_id, status, amount
FROM orders_pivoted
UNPIVOT (
amount FOR status IN ([pending], [paid], [shipped])
) AS u;
Portable alternatives rely on UNION ALL constructs:
-- Portable UNION ALL approach
SELECT customer_id, 'pending' AS status, pending_total AS amount FROM orders_pivoted
UNION ALL
SELECT customer_id, 'paid', paid_total FROM orders_pivoted
UNION ALL
SELECT customer_id, 'shipped', shipped_total FROM orders_pivoted;
Recursive CTEs
Recursive common table expressions let you query hierarchical or graph-like data. They start with a base case, then repeatedly join or union new rows until no further matches appear. Use them for trees, bill-of-materials, or dependency chains.
Structure of a recursive CTE
A recursive CTE has two parts: an initial query (the base case) and a recursive query that references the CTE itself. Combine them with UNION ALL.
WITH RECURSIVE subordinates AS (
-- base case
SELECT employee_id, manager_id, full_name, 1 AS level
FROM employees
WHERE manager_id IS NULL
UNION ALL
-- recursive step
SELECT e.employee_id, e.manager_id, e.full_name, s.level + 1
FROM employees e
JOIN subordinates s ON e.manager_id = s.employee_id
)
SELECT * FROM subordinates ORDER BY level, employee_id;
Limiting recursion depth
Engines typically restrict recursion depth for safety. You can control it explicitly.
-- PostgreSQL / SQLite
WITH RECURSIVE t(n) AS (
SELECT 1
UNION ALL
SELECT n + 1 FROM t WHERE n < 10
)
SELECT * FROM t;
-- SQL Server
WITH t AS (
SELECT 1 AS n
UNION ALL
SELECT n + 1 FROM t WHERE n < 10
)
SELECT * FROM t
OPTION (MAXRECURSION 100);
WHERE level < 10 or MAXRECURSION) to prevent runaway loops.
Practical uses
- Exploding organizational hierarchies (manager → employee trees)
- Walking parent-child product assemblies
- Computing dependency graphs (modules, files, prerequisites)
- Generating date or number sequences without auxiliary tables
Chapter 23: Importing and Exporting Data
Importing and exporting are essential for moving data between databases and external systems. SQL engines include commands for bulk loading, copying, and exporting tables in various formats. Efficient data transfer relies on proper file formats, encoding, and matching column definitions.
LOAD DATA INFILE, COPY, BULK INSERT
Most databases provide fast server-side commands for reading or writing large text files. These bypass normal client insert loops for performance.
LOAD DATA INFILE (MySQL / MariaDB)
LOAD DATA INFILE reads data directly from a file into a table. Files can be local or server-side depending on privileges and configuration.
LOAD DATA INFILE '/var/data/users.csv'
INTO TABLE users
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 LINES
(id, name, email, created_at);
Each line represents a row; fields correspond to columns. Use IGNORE 1 LINES to skip headers.
secure_file_priv setting may restrict directories. Use LOAD DATA LOCAL INFILE for client-side files (ensure it is enabled).
COPY (PostgreSQL)
PostgreSQL’s COPY reads or writes delimited files in one operation. It supports CSV, text, and binary formats.
-- Import
COPY users (id, name, email, created_at)
FROM '/var/data/users.csv'
WITH (FORMAT csv, HEADER true, DELIMITER ',', QUOTE '"');
-- Export
COPY (SELECT * FROM users)
TO '/var/data/users_export.csv'
WITH (FORMAT csv, HEADER true);
COPY from client tools, use \copy (psql) to operate on client-local files rather than server-side paths.
BULK INSERT (SQL Server)
SQL Server provides BULK INSERT for high-speed imports. It is similar to PostgreSQL’s COPY.
BULK INSERT dbo.Users
FROM 'C:\data\users.csv'
WITH (
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n',
FIRSTROW = 2
);
Server permissions and file access policies apply. Use OPENROWSET(BULK …) for more flexible data streaming.
Exporting to CSV or JSON
Export operations let you share query results with other systems, data scientists, or visualization tools. Common targets are CSV and JSON, both supported natively by modern SQL engines.
CSV export
CSV (comma-separated values) is portable and widely supported, though limited in type fidelity. Use text-safe quoting and explicit column names.
-- MySQL / MariaDB
SELECT id, name, email
INTO OUTFILE '/var/data/users_out.csv'
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
FROM users;
-- PostgreSQL
COPY (SELECT id, name, email FROM users)
TO '/var/data/users_out.csv'
WITH (FORMAT csv, HEADER true);
-- SQL Server (using BCP utility)
EXEC xp_cmdshell 'bcp "SELECT id, name, email FROM mydb.dbo.Users" queryout "C:\data\users_out.csv" -c -t, -T';
UTF-8 ensures compatibility with modern tools. Avoid locale-dependent encodings like Latin-1 unless required.
JSON export
JSON export preserves structure and nested data. Engines that support JSON types can serialize rows directly from queries.
-- PostgreSQL
SELECT json_agg(t)
FROM (SELECT id, name, email FROM users) AS t;
-- MySQL 8+
SELECT JSON_ARRAYAGG(JSON_OBJECT('id', id, 'name', name, 'email', email))
FROM users;
-- SQL Server
SELECT id, name, email
FROM dbo.Users
FOR JSON AUTO;
LIMIT or TOP clauses, or use pagination for manageable file sizes.
External tools and scripts
For large-scale exports, use database utilities or scripting languages with proper drivers. Examples include mysqldump, pg_dump, bcp, or Python scripts using pandas.to_csv() or json.dump(). These tools can stream large results efficiently and preserve schema metadata.
ETL considerations
ETL (Extract, Transform, Load) pipelines handle structured data flows between systems. SQL often serves as the transformation layer. Keep imports and exports idempotent, validated, and traceable.
Extract
Extract from reliable sources. Use stable queries or incremental filters (WHERE updated_at > last_run) to fetch only new or changed rows. Document query versions for reproducibility.
Transform
Perform type conversions, normalizations, and validations before loading. Use staging tables to isolate intermediate results.
-- Example: clean email data in staging
UPDATE staging_users
SET email = LOWER(TRIM(email))
WHERE email IS NOT NULL;
Load
Load transformed data efficiently. Use bulk commands (COPY, LOAD DATA INFILE, BULK INSERT) where possible. Wrap in transactions for atomicity and to allow rollback on failure.
Validation and audit
After each load, verify counts and constraints. Record timestamps and file hashes for audit trails.
-- Example validation
SELECT COUNT(*) AS src_count FROM staging_users;
SELECT COUNT(*) AS tgt_count FROM users;
Automation and scheduling
Use external schedulers or orchestration tools (cron, Airflow, etc.) to automate ETL. SQL scripts should log start and end times, row counts, and errors for monitoring.
Chapter 24: Optimization and Query Tuning
SQL performance depends on how efficiently the database engine retrieves and processes data. Query tuning involves examining execution plans, managing indexes and statistics, and understanding how caching and memory usage affect results. Even simple queries can slow down when tables grow, so systematic tuning is essential for scalable applications.
Execution plans and EXPLAIN
Every SQL query produces an execution plan (a set of steps the optimizer chooses to retrieve data). The EXPLAIN (or equivalent) command shows how the engine interprets your query, which indexes it uses, and how tables are joined or scanned.
Reading an execution plan
An execution plan reveals the order of operations, join algorithms, and data access paths (index lookup, full table scan, etc.). Use it to detect missing indexes, redundant joins, or expensive sorts.
-- MySQL / MariaDB
EXPLAIN SELECT * FROM orders WHERE customer_id = 5;
-- PostgreSQL
EXPLAIN SELECT * FROM orders WHERE customer_id = 5;
-- SQL Server
SET SHOWPLAN_TEXT ON;
GO
SELECT * FROM orders WHERE customer_id = 5;
GO
SET SHOWPLAN_TEXT OFF;
In MySQL and PostgreSQL, EXPLAIN ANALYZE (or EXPLAIN (ANALYZE, BUFFERS)) executes the query and reports actual run times and row counts.
EXPLAIN ANALYZE
SELECT customer_id, SUM(amount)
FROM orders
GROUP BY customer_id;
EXPLAIN on representative data volumes. Plans may differ dramatically between small test sets and full production sizes.
Key indicators in a plan
- Seq Scan / Table Scan — full table read; may be slow for large data.
- Index Scan / Seek — uses an index to locate rows efficiently.
- Nested Loop / Hash Join / Merge Join — join strategy; the optimizer picks the cheapest for data size and available indexes.
- Sort — sorting step; watch for large memory usage or disk spills.
- Filter — condition applied after scanning; indicates possible index improvements.
Practical tuning workflow
- Profile the query using
EXPLAINor the database’s profiler. - Check that filtering columns are indexed appropriately.
- Remove unnecessary
ORDER BY,DISTINCT, or subqueries. - Consider breaking very complex queries into smaller, staged ones.
- Re-run
EXPLAINand compare cost estimates and runtime.
Index hints, statistics, and caching
Indexing, accurate statistics, and caching mechanisms are central to query performance. The optimizer relies on them to choose efficient paths. Poor statistics or unsuitable indexes can lead to slow queries even with correct logic.
Index hints
Engines usually select the best index automatically, but hints allow manual overrides when necessary for specific workloads or query patterns.
-- MySQL / MariaDB
SELECT * FROM orders USE INDEX (idx_customer_id) WHERE customer_id = 5;
-- SQL Server
SELECT * FROM dbo.Orders WITH (INDEX(idx_customer_id)) WHERE customer_id = 5;
-- PostgreSQL
-- No direct hint syntax, but you can influence plans via enable_* parameters
SET enable_seqscan = off;
SELECT * FROM orders WHERE customer_id = 5;
RESET enable_seqscan;
Updating and maintaining statistics
Statistics describe data distribution and help the optimizer estimate row counts. Outdated statistics can cause poor plan choices.
-- MySQL / MariaDB
ANALYZE TABLE orders;
-- PostgreSQL
ANALYZE orders;
-- SQL Server
UPDATE STATISTICS dbo.Orders;
Regularly analyze heavily updated tables. Most databases also collect statistics automatically, but manual refreshes may be needed after large data loads.
Caching and memory tuning
Database engines cache recently accessed pages, index blocks, and execution plans to reduce disk I/O. Proper configuration of cache size and query reuse can drastically improve performance.
- MySQL:
innodb_buffer_pool_sizecontrols the main cache; ideally set to 60–80% of system memory on dedicated servers. - PostgreSQL:
shared_buffersandwork_memaffect caching and sort operations;effective_cache_sizeinforms the planner about OS-level caching. - SQL Server: uses dynamic memory management; tune
max server memoryand inspect plan cache hits withsys.dm_exec_cached_plans.
Rewriting queries for index use
Sometimes tuning requires minor rewrites to align with index structures. Common techniques include:
- Use exact column matches in WHERE clauses before applying functions (e.g.,
WHERE date_col >= '2025-01-01'rather thanWHERE YEAR(date_col) = 2025). - Prefer
EXISTSoverINwhen joining large subqueries. - Avoid
SELECT *; fetch only needed columns to reduce I/O. - Consolidate redundant joins or nested views.
Monitoring and profiling tools
Use each engine’s built-in monitoring features to detect slow queries and resource bottlenecks.
| Engine | Tools |
| MySQL / MariaDB | SHOW PROFILES, performance_schema, slow query log |
| PostgreSQL | pg_stat_statements, EXPLAIN (ANALYZE, BUFFERS) |
| SQL Server | Query Store, Dynamic Management Views (DMVs), Activity Monitor |
Denormalization and partitioning trade-offs
Normalization improves consistency and avoids redundancy, but highly normalized schemas can cause performance issues for large analytical queries or heavy joins. Denormalization stores repeated data intentionally to speed up reads, while partitioning divides tables into segments for manageability and performance.
When denormalization helps
Denormalization is useful for read-heavy systems, reporting layers, or data warehouses where query simplicity and speed outweigh small risks of redundancy. For example, storing a customer_name directly in an orders table avoids a frequent join.
-- Example of selective denormalization
ALTER TABLE orders ADD COLUMN customer_name VARCHAR(100);
UPDATE orders
SET customer_name = c.name
FROM customers c
WHERE c.id = orders.customer_id;
Partitioning strategies
Partitioning splits large tables into smaller physical units (called partitions) that are queried transparently. It improves performance by reducing the data scanned, and aids in maintenance and archiving.
-- PostgreSQL: range partitioning by date
CREATE TABLE orders_2025 PARTITION OF orders
FOR VALUES FROM ('2025-01-01') TO ('2026-01-01');
-- MySQL: range partitioning
CREATE TABLE orders (
id BIGINT,
order_date DATE,
amount DECIMAL(10,2)
)
PARTITION BY RANGE (YEAR(order_date)) (
PARTITION p2024 VALUES LESS THAN (2025),
PARTITION p2025 VALUES LESS THAN (2026)
);
PostgreSQL, MySQL, and Oracle support range, list, and hash partitioning natively. SQL Server uses partitioned tables with schemes and functions.
Balancing normalization, denormalization, and partitioning
Combine approaches where each fits best:
- Keep OLTP (transactional) data normalized for integrity and small updates.
- Use denormalized or materialized summary tables for analytics.
- Partition large tables to reduce maintenance time and improve archival performance.
Chapter 25: Database Design Patterns
Relational design patterns help you express business rules as tables, keys, and constraints. This chapter focuses on the most common relationship shapes and the practical details that keep data consistent (and queries fast). You will see how to model one to many and many to many, how to implement junction tables for flexibility, and how to plan safe schema changes with versioned migrations.
One-to-many relationships
In a one to many relationship, a single parent row is associated with zero or more child rows. Typical examples are authors to books, or departments to employees. The usual implementation is a primary key on the parent and a foreign key from the child to that parent.
Canonical layout for 1:N
-- Parent
CREATE TABLE authors (
author_id INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
name VARCHAR(200) NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);
-- Child (many side)
CREATE TABLE books (
book_id INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
author_id INTEGER NOT NULL,
title VARCHAR(300) NOT NULL,
published_on DATE,
CONSTRAINT fk_books_author
FOREIGN KEY (author_id)
REFERENCES authors(author_id)
ON UPDATE CASCADE
ON DELETE RESTRICT
);
This pattern keeps the relationship in the child, which simplifies inserts for new children and avoids nullable columns in the parent.
Choosing ON DELETE behavior
Choose a deletion rule that matches your business rules. The following table summarizes common options.
| Rule | Effect on children | When to use |
RESTRICT |
Deletion blocked if children exist | Historical integrity (you must move or delete children first) |
CASCADE |
Children deleted automatically | Dependent data that has no meaning without the parent |
SET NULL |
Child foreign key set to NULL |
Children can survive without a parent (must allow NULL) |
SET DEFAULT |
Child foreign key set to default value | Route orphans to a default record (rare but useful) |
PRAGMA foreign_keys = ON to enforce constraints, and SET DEFAULT requires a defined default value on the column.
Indexing FOREIGN KEY columns
Always index the child foreign key column for fast joins and to avoid table scans during parent deletes or updates.
CREATE INDEX ix_books_author_id ON books(author_id);
The parent primary key is already indexed by the primary key. The child index helps both JOIN performance and referential checks.
Enforcing optional vs required relationships
Use NOT NULL on the child foreign key to make the relationship required. Allow NULL for optional relationships.
-- Required relationship
ALTER TABLE books
ALTER COLUMN author_id SET NOT NULL; -- Dialect may vary …
Many-to-many relationships
Many to many arises when multiple rows on each side can relate to multiple rows on the other side. Examples include students and courses, or posts and tags. Implement this with a dedicated junction table that holds pairs of foreign keys.
Canonical layout for M:N
CREATE TABLE students (
student_id INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
full_name VARCHAR(200) NOT NULL
);
CREATE TABLE courses (
course_id INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
code VARCHAR(20) NOT NULL UNIQUE,
title VARCHAR(200) NOT NULL
);
-- Junction table (no own identity needed unless you add attributes)
CREATE TABLE student_course (
student_id INTEGER NOT NULL,
course_id INTEGER NOT NULL,
enrolled_on DATE NOT NULL DEFAULT CURRENT_DATE,
CONSTRAINT pk_student_course PRIMARY KEY (student_id, course_id),
CONSTRAINT fk_sc_student FOREIGN KEY (student_id)
REFERENCES students(student_id)
ON DELETE CASCADE,
CONSTRAINT fk_sc_course FOREIGN KEY (course_id)
REFERENCES courses(course_id)
ON DELETE CASCADE
);
CREATE INDEX ix_sc_course ON student_course(course_id);
The composite primary key prevents duplicate pairs. Index both sides for symmetric join performance.
Adding attributes to relationships
Relationship attributes (for example role, grade, position) belong in the junction table because they describe the pairing.
ALTER TABLE student_course
ADD COLUMN grade VARCHAR(2); -- e.g., 'A', 'B', …
Querying across M:N
-- All courses for a given student
SELECT c.course_id, c.code, c.title
FROM student_course sc
JOIN courses c ON c.course_id = sc.course_id
WHERE sc.student_id = 42;
-- All students in a course
SELECT s.student_id, s.full_name
FROM student_course sc
JOIN students s ON s.student_id = sc.student_id
WHERE sc.course_id = 7;
(student_id, course_id) is already provided by the primary key; the extra index on (course_id) makes the second query efficient.
Anti-patterns to avoid
Avoid storing multiple ids in a single column (comma separated values) or variable numbers of columns (for example tag1_id, tag2_id, tag3_id). These break normalization and make queries and constraints unreliable.
Junction tables and referential integrity
Junction tables are the workhorse for flexible relationships. Keep them narrow and well indexed, then enforce integrity with keys and constraints.
Composite keys vs surrogate keys
Use a composite primary key of the two foreign keys when the pair must be unique and you rarely reference the relationship row directly. Add a surrogate key (for example id) only when the relationship itself is a first-class entity (for example you attach many child rows to the relationship or you reference it from other tables).
-- Surrogate key variant for richer relationship entities
CREATE TABLE membership (
membership_id INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
user_id INTEGER NOT NULL,
team_id INTEGER NOT NULL,
role VARCHAR(40) NOT NULL,
started_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
CONSTRAINT uq_membership UNIQUE (user_id, team_id),
CONSTRAINT fk_m_user FOREIGN KEY (user_id) REFERENCES users(user_id),
CONSTRAINT fk_m_team FOREIGN KEY (team_id) REFERENCES teams(team_id)
);
Ensuring uniqueness and preventing duplicates
Provide a uniqueness guarantee either through a composite primary key or a separate UNIQUE constraint. This avoids accidental duplicate relationships.
-- If you chose a surrogate key, add an explicit UNIQUE pair
CREATE UNIQUE INDEX ux_membership_user_team
ON membership(user_id, team_id);
Maintaining integrity with CHECK constraints
When additional rules apply (for example role values, valid date ranges, non overlapping intervals), enforce them with CHECK constraints so the database guards your data consistently.
ALTER TABLE membership
ADD CONSTRAINT ck_membership_role
CHECK (role IN ('owner','admin','member'));
CHECK expressions or require enabling constraint enforcement explicitly. Verify your target engine and test migrations with real data.
Cascade rules that fit reality
In junction tables, ON DELETE CASCADE on both foreign keys is often appropriate because the relationship has no meaning if either side disappears. If junction rows carry historical facts that must remain, prefer RESTRICT with application policies that retire records instead of deleting them.
Schema versioning and migration strategies
Databases evolve. Treat schema changes as versioned, repeatable steps that can be applied forward (and sometimes backward) across environments. Your goal is reliability, auditability, and minimal downtime.
Versioning with a migrations ledger
Maintain a dedicated table that records applied migrations. Each migration has an ordered id, a description, a checksum, and timestamps.
CREATE TABLE schema_migrations (
version VARCHAR(50) PRIMARY KEY, -- e.g., 20251103_1400_add_index
description VARCHAR(200) NOT NULL,
checksum VARCHAR(64) NOT NULL,
applied_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);
Store migration scripts in version control with filenames such as V20251103_1400__add_index.sql. The double underscore separates the version and description; the rest is free text.
Forward-only vs reversible migrations
Forward only favors simplicity (roll forward with new scripts to fix issues). Reversible migrations include an UP and DOWN step that can revert changes when safe. Use reversible for non destructive changes; use forward only when data loss or complex transformations make reversal unsafe.
| Change | Safe to reverse | Notes |
| Add column (nullable) | Yes | Provide DOWN to drop the column |
| Add index | Yes | DOWN drops the index |
| Rename column | Sometimes | Use two phase copy with populated values … |
| Drop column/table | No | Forward only; export data first |
Zero downtime patterns
When your application must remain available, use additive, compatible changes first, followed by code rollout, then cleanup.
-- Phase 1: additive
ALTER TABLE orders ADD COLUMN order_total_cents BIGINT;
-- Backfill in batches (pseudo code shown in SQL for brevity)
UPDATE orders
SET order_total_cents = (SELECT SUM(quantity * unit_price_cents) FROM order_lines ol
WHERE ol.order_id = orders.order_id)
WHERE order_total_cents IS NULL
AND order_id BETWEEN :lo AND :hi;
-- Phase 2: deploy code reading the new column
-- Phase 3: make non-nullable once all rows are backfilled
ALTER TABLE orders ALTER COLUMN order_total_cents SET NOT NULL; -- Dialect may vary …
Idempotent and environment aware scripts
Migrations should be idempotent where possible. Use IF NOT EXISTS guards or detect state before applying changes, then record the version in schema_migrations.
-- Example guard (dialect specific)
CREATE TABLE IF NOT EXISTS audit_log (
audit_id INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
actor VARCHAR(100) NOT NULL,
action VARCHAR(100) NOT NULL,
occurred_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);
Data migrations with validation
Structural changes often require moving data. Pair transformation steps with validation queries so you can prove correctness before cleanup.
-- Transform
UPDATE membership SET role = 'member' WHERE role = 'basic';
-- Validate (counts must match)
SELECT COUNT(*) FROM membership WHERE role = 'basic'; -- Expect 0
Tooling and workflow
Adopt a migration tool that suits your stack (for example pure SQL runners or code based tools). Integrate migrations into continuous integration so every deployment applies the exact same steps. Gate releases on successful migration application and verification queries.
Keep every migration small, isolated, and named for its purpose. Avoid modifying old migration files after they have shipped. If you must change history, create a new migration that corrects the state.
Chapter 26: Integration with Applications
Applications talk to databases through drivers and connection libraries. This chapter shows practical patterns for using SQL from Python, Java, and JavaScript. You will learn how to connect, execute parameterized queries, handle transactions, and avoid security mistakes that lead to injection and data leaks.
SQL in Python (sqlite3/mysql-connector-python)
Python ships with sqlite3 and supports other engines through third party packages. The essential steps are: create a connection, get a cursor, execute statements with parameters, then commit or roll back.
Connecting and querying with sqlite3
# Python 3
import sqlite3
conn = sqlite3.connect("app.db", isolation_level=None) # autocommit off when None varies across versions …
cur = conn.cursor()
cur.execute("CREATE TABLE IF NOT EXISTS users (id INTEGER PRIMARY KEY, name TEXT NOT NULL, email TEXT UNIQUE)")
cur.execute("INSERT INTO users (name, email) VALUES (?, ?)", ("Robin", "r@example.com"))
cur.execute("SELECT id, name, email FROM users WHERE email = ?", ("r@example.com",))
row = cur.fetchone()
print(row)
conn.commit()
conn.close()
Use ? placeholders in sqlite3. The driver binds values safely and handles quoting.
Connecting to MySQL with mysql-connector-python
# pip install mysql-connector-python
import mysql.connector as mysql
conn = mysql.connect(
host="localhost",
user="app_user",
password="…",
database="appdb",
autocommit=False
)
with conn.cursor() as cur:
cur.execute("INSERT INTO accounts (email, plan) VALUES (%s, %s)", ("a@example.com", "pro"))
cur.execute("SELECT id, plan FROM accounts WHERE email = %s", ("a@example.com",))
print(cur.fetchone())
conn.commit()
conn.close()
MySQL drivers use %s placeholders. Do not interpolate strings directly. Let the driver bind values.
Transactions and context managers
from contextlib import contextmanager
import sqlite3
@contextmanager
def tx(db_path):
conn = sqlite3.connect(db_path)
try:
yield conn
conn.commit()
except Exception:
conn.rollback()
raise
finally:
conn.close()
with tx("app.db") as conn:
conn.execute("UPDATE balances SET cents = cents - ? WHERE user_id = ?", (500, 1))
conn.execute("UPDATE balances SET cents = cents + ? WHERE user_id = ?", (500, 2))
sqlite3 uses file level locking which limits concurrent writes.
SQL in Java (JDBC)
JDBC provides a uniform interface to many databases. Use a connection pool (HikariCP or similar), prepare statements with placeholders, and manage transactions with try with resources.
Basic JDBC pattern with prepared statements
// Requires driver on classpath, for example MySQL: com.mysql:mysql-connector-j …
import java.sql.*;
public class Repo {
private final DataSource ds;
public Repo(DataSource ds) { this.ds = ds; }
public User findByEmail(String email) throws SQLException {
String sql = "SELECT id, name, email FROM users WHERE email = ?";
try (Connection c = ds.getConnection();
PreparedStatement ps = c.prepareStatement(sql)) {
ps.setString(1, email);
try (ResultSet rs = ps.executeQuery()) {
if (rs.next()) {
return new User(rs.getLong("id"), rs.getString("name"), rs.getString("email"));
}
return null;
}
}
}
public void transfer(long fromId, long toId, long cents) throws SQLException {
String debit = "UPDATE balances SET cents = cents - ? WHERE user_id = ?";
String credit = "UPDATE balances SET cents = cents + ? WHERE user_id = ?";
try (Connection c = ds.getConnection()) {
c.setAutoCommit(false);
try (PreparedStatement d = c.prepareStatement(debit);
PreparedStatement cr = c.prepareStatement(credit)) {
d.setLong(1, cents); d.setLong(2, fromId); d.executeUpdate();
cr.setLong(1, cents); cr.setLong(2, toId); cr.executeUpdate();
c.commit();
} catch (SQLException ex) {
c.rollback();
throw ex;
}
}
}
}
HikariCP is a widely used default in modern stacks.
Placeholder styles across drivers
| Driver | Placeholder form | Example |
| MySQL / PostgreSQL | ? |
WHERE email = ? |
| Oracle | :name |
WHERE email = :email |
| SQL Server | ? (in PreparedStatement) |
WHERE id = ? |
SQL in JavaScript (mysql2/pg/better-sqlite3)
In Node, use battle tested drivers. Prefer parameter arrays or named parameters. For pooling, create a singleton pool for the process and reuse it.
MySQL with mysql2 and a pool
// npm i mysql2
const mysql = require("mysql2/promise");
const pool = mysql.createPool({
uri: "mysql://app_user:…@localhost:3306/appdb",
connectionLimit: 10
});
async function getUser(email) {
const [rows] = await pool.query(
"SELECT id, name FROM users WHERE email = ?",
[email]
);
return rows[0] || null;
}
PostgreSQL with pg
// npm i pg
const { Pool } = require("pg");
const pool = new Pool({ connectionString: "postgres://app_user:…@localhost/appdb" });
async function transfer(client, fromId, toId, cents) {
await client.query("UPDATE balances SET cents = cents - $1 WHERE user_id = $2", [cents, fromId]);
await client.query("UPDATE balances SET cents = cents + $1 WHERE user_id = $2", [cents, toId]);
}
async function run() {
const client = await pool.connect();
try {
await client.query("BEGIN");
await transfer(client, 1, 2, 500);
await client.query("COMMIT");
} catch (e) {
await client.query("ROLLBACK");
throw e;
} finally {
client.release();
}
}
SQLite with better-sqlite3 (sync API)
// npm i better-sqlite3
const Database = require("better-sqlite3");
const db = new Database("app.db");
db.prepare("CREATE TABLE IF NOT EXISTS users (id INTEGER PRIMARY KEY, name TEXT, email TEXT UNIQUE)").run();
db.prepare("INSERT INTO users (name, email) VALUES (?, ?)").run("Robin", "r@example.com");
const row = db.prepare("SELECT id, name FROM users WHERE email = ?").get("r@example.com");
console.log(row);
?, $1, or driver specific named parameters.
Avoiding SQL injection
SQL injection happens when user input becomes part of the SQL syntax. The primary defense is parameterized statements. Additional defenses include least privilege accounts, input validation for shape and length, and escaping only as a last resort when a driver lacks parameters for special cases.
Good vs bad patterns
-- Bad (string concatenation)
sql = "SELECT * FROM users WHERE email = '" + email + "'"; -- vulnerable
-- Good (parameters)
sql = "SELECT * FROM users WHERE email = ?";
cur.execute(sql, (email,))
// Bad in Node
const sql = `SELECT * FROM users WHERE id = ${req.query.id}`; // vulnerable
// Good
const [rows] = await pool.query("SELECT * FROM users WHERE id = ?", [req.query.id]);
Dynamic filters without concatenation
# Python example that builds WHERE safely
base = "SELECT id, name FROM products WHERE 1=1"
conds = []
vals = []
if q := params.get("q"):
conds.append("AND name ILIKE ?")
vals.append("%" + q + "%")
if brand := params.get("brand"):
conds.append("AND brand = ?")
vals.append(brand)
sql = " ".join([base] + conds)
cur.execute(sql, tuple(vals))
Handling LIKE wildcards and escaping
Bind the pattern as a parameter and escape driver specific wildcard characters in the value when necessary. Do not inline user input inside quotes.
Stored procedures and ORMs are not automatic protection
Procedures and ORMs can still be vulnerable if they concatenate strings. Verify parameter binding and review generated SQL. Inspect logs for the final SQL text with placeholders rather than values.
Security pitfalls
Security depends on layers. The following items cover frequent mistakes and the safer alternative.
Secrets in code and connection strings
| Pitfall | Safer approach |
| Hardcoded passwords in source | Use environment variables or a secrets manager; rotate regularly |
| Plain text connection strings in logs | Redact credentials; configure logger filters |
| Credentials shared across services | Use per service principals with least privileges |
Network and transport
Require TLS for connections, validate server certificates, and restrict inbound access to the database from application subnets only. Disable direct public access. Use firewall rules or security groups that allow only known application nodes.
Authorization inside the database
Create dedicated roles for application tasks, then GRANT only the permissions required. Avoid using a superuser or root in production. Separate read and write roles. For admin jobs, use a separate connection and role with elevated privileges.
Migration safety
Schema changes can cause outages when long locks block queries. Prefer additive changes, backfill in batches, and create needed indexes before setting columns to NOT NULL. Run heavy operations during maintenance windows.
Exposing raw errors
Never return database error messages directly to users. Map them to generic messages. Log full details to a secure sink for operators to review.
Resource exhaustion
Use connection pools with sensible limits. Set query timeouts and statement timeouts. Monitor for slow queries and add indexes where appropriate. Guard endpoints with rate limits so expensive queries cannot be spammed.
Chapter 27: Working with Large Data Sets
As data volume grows, a single database instance or table can become a bottleneck. Large data sets require strategies that spread data across partitions, nodes, or even multiple engines while still providing a coherent query interface. This chapter introduces partitioning, sharding, and distributed query techniques used by modern systems to manage scale without losing consistency.
Partitioning and sharding basics
Partitioning divides one logical table into multiple physical parts within the same database. Sharding goes a step further, splitting data across multiple database servers. Both approaches improve performance by limiting how much data each query must scan and by parallelizing operations.
Horizontal and vertical partitioning
Partitioning can happen by rows (horizontal) or columns (vertical):
| Type | Description | Use case |
| Horizontal | Each partition holds a subset of rows based on a key (for example date, region, id range) | Common in time series or multi tenant applications |
| Vertical | Split columns into separate tables, often by access pattern | Improves performance when some columns are rarely queried |
Example: range partitioning by date
Most engines support declarative partitioning syntax. This example shows PostgreSQL style range partitions.
CREATE TABLE sales (
id BIGSERIAL PRIMARY KEY,
region TEXT NOT NULL,
sale_date DATE NOT NULL,
amount NUMERIC(10,2)
) PARTITION BY RANGE (sale_date);
CREATE TABLE sales_2025q1 PARTITION OF sales
FOR VALUES FROM ('2025-01-01') TO ('2025-04-01');
CREATE TABLE sales_2025q2 PARTITION OF sales
FOR VALUES FROM ('2025-04-01') TO ('2025-07-01');
Queries automatically use the correct partition when filtering by the partition key, and each partition can have its own index and storage parameters.
Hash partitioning
Hash partitioning distributes rows more evenly when ranges are not predictable, such as user ids or session keys.
CREATE TABLE sessions (
session_id BIGINT,
user_id BIGINT,
started_at TIMESTAMP,
PRIMARY KEY (session_id)
) PARTITION BY HASH (user_id);
CREATE TABLE sessions_p0 PARTITION OF sessions FOR VALUES WITH (MODULUS 4, REMAINDER 0);
CREATE TABLE sessions_p1 PARTITION OF sessions FOR VALUES WITH (MODULUS 4, REMAINDER 1);
CREATE TABLE sessions_p2 PARTITION OF sessions FOR VALUES WITH (MODULUS 4, REMAINDER 2);
CREATE TABLE sessions_p3 PARTITION OF sessions FOR VALUES WITH (MODULUS 4, REMAINDER 3);
Sharding across servers
Sharding divides data across multiple servers (each with its own schema). Applications or middleware determine which shard to query based on a shard key. For example, users 1–10000 on shard 1, 10001–20000 on shard 2, and so on.
# Pseudo code
def shard_for_user(user_id):
shard_index = (user_id % NUM_SHARDS)
return connections[shard_index]
conn = shard_for_user(15234)
cur = conn.cursor()
cur.execute("SELECT * FROM orders WHERE user_id = %s", (15234,))
Each shard runs independently, and joins across shards require application logic or special frameworks to merge results.
Rebalancing and growth
Shards can fill unevenly as data grows. To rebalance, resharding tools copy data to new nodes based on a new key distribution. Use consistent hashing to minimize the amount of data moved when adding shards.
-- Conceptual representation
Hash(user_id) % N_shards -- determines target shard
-- When N_shards increases, only a fraction of data moves
Federated or external partitions
Some databases (for example MySQL with the FEDERATED engine, or PostgreSQL with postgres_fdw) support tables that point to data in remote servers. These act as partitions hosted elsewhere but queried locally.
CREATE SERVER remote_db
FOREIGN DATA WRAPPER postgres_fdw
OPTIONS (host 'reports.example.com', dbname 'analytics');
CREATE FOREIGN TABLE analytics.sales_2025
(id BIGINT, region TEXT, amount NUMERIC)
SERVER remote_db
OPTIONS (table_name 'sales_2025');
Federated queries and distributed SQL
Federated and distributed SQL systems allow a single query to span multiple nodes or databases as if they were one. They handle data location, parallel execution, and result aggregation transparently to the client.
Federated queries
Federation is common when integrating heterogeneous sources such as operational and analytics databases. Tools like postgres_fdw, mysql_fdw, or cross engine layers like Presto or Trino can query across data sets and merge results at runtime.
SELECT o.id, o.amount, c.region
FROM orders o
JOIN foreign_customers c ON o.customer_id = c.id
WHERE c.region = 'EU';
The federated layer manages connections to both data sources and streams results back to the client. This allows unified analysis without duplicating data, though at a performance cost for large joins.
Distributed SQL engines
Distributed SQL platforms (for example CockroachDB, YugabyteDB, Google Spanner) distribute both data and queries automatically. They replicate data across nodes, provide ACID transactions, and scale horizontally without manual sharding.
| System | Model | Key features |
| CockroachDB | Shared nothing, range based | Automatic rebalancing, PostgreSQL compatible |
| YugabyteDB | Tablet partitioned, PostgreSQL compatible | Strong consistency, global replication |
| Google Spanner | Globally distributed, TrueTime API | Serializable transactions, global scale |
These systems appear as a single logical database but partition data internally. They use consensus protocols (such as Raft or Paxos) to coordinate replicas and ensure consistency.
Trade offs and considerations
Distributed SQL simplifies scaling but introduces coordination overhead. Latency increases when replicas span regions, and schema changes may propagate slowly. Evaluate whether your workload requires true distributed consistency or can tolerate eventual models (for example in caching or analytics).
Query federation vs data replication
Federation queries remote data on demand. Replication copies data locally. Federation reduces duplication but adds latency; replication speeds queries but needs synchronization jobs. Many systems combine both—federating rarely used sources while replicating high traffic tables.
Practical scaling workflow
- Start with a single node, strong indexes, and tuned queries.
- Add read replicas to handle heavy read workloads.
- Introduce partitioning for large tables (by date, tenant, or id ranges).
- Move to sharding or distributed SQL only when you outgrow single host capacity.
Chapter 28: Modern Extensions and SQL 2023 Features
SQL continues to evolve beyond its original relational roots. Modern standards extend it with richer data types, JSON support, table value constructors, and new syntactic features that make queries more expressive and concise. SQL:2023 builds on earlier versions by standardizing patterns already used across major engines and introducing new capabilities for analytics and JSON handling.
JSON data types and functions
SQL:2023 consolidates JSON handling that many engines already implemented independently. It introduces a formal JSON data type and standardized functions for creation, extraction, and validation. This allows relational and semi structured data to coexist efficiently within the same schema.
Storing JSON values
Many engines (PostgreSQL, MySQL, SQL Server, Oracle) already provide a JSON or JSONB type. SQL:2023 defines a general form:
CREATE TABLE events (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
payload JSON NOT NULL,
received TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
INSERT INTO events (payload)
VALUES (JSON_OBJECT('type': 'click', 'user': 'r@example.com', 'x': 120, 'y': 480));
The JSON_OBJECT constructor creates JSON text directly from SQL expressions. The standard defines JSON_ARRAY, JSON_OBJECTAGG, JSON_ARRAYAGG, and more.
Querying JSON fields
Use JSON_VALUE, JSON_QUERY, and JSON_EXISTS to extract or filter by parts of a JSON document.
SELECT JSON_VALUE(payload, '$.user') AS username
FROM events
WHERE JSON_VALUE(payload, '$.type') = 'click';
JSON_VALUE extracts a scalar; JSON_QUERY extracts an object or array; JSON_EXISTS tests for path existence.
Validation and constraints
The standard adds IS JSON for structural validation and defines JSON_SCHEMA_VALIDATION for checking against schema definitions (implementation dependent).
SELECT id
FROM events
WHERE NOT (payload IS JSON VALUE);
-> and ->> operators; MySQL uses JSON_EXTRACT() and ->; SQL Server uses JSON_VALUE(). SQL:2023 standardizes these under unified function names but adoption may differ.
Indexing JSON data
Some engines support functional or path based indexes for fast JSON lookups. For example, PostgreSQL can index computed expressions:
CREATE INDEX ix_events_user
ON events ((payload->>'user'));
Other engines use virtual columns to materialize JSON keys and index them. Standard SQL now allows generated columns based on JSON_VALUE().
Table value constructors
Table value constructors (TVCs) let you create in-memory sets of rows directly within a query. They are part of the SQL standard but expanded in SQL:2023 for better portability and syntax clarity.
Basic inline table construction
SELECT *
FROM (VALUES
(1, 'Alice', 'UK'),
(2, 'Bob', 'US'),
(3, 'Chen', 'CN')
) AS temp(id, name, country);
This creates a virtual table for quick testing or joins without a permanent table. It is especially useful for parameter lists or lookups.
Using VALUES in INSERT
You can use TVCs to insert multiple rows in a single statement:
INSERT INTO products (id, name, price)
VALUES
(1, 'Pen', 1.20),
(2, 'Pencil', 0.80),
(3, 'Notebook', 3.50);
Combining with JOIN and IN clauses
Table constructors integrate naturally with joins and subqueries:
SELECT p.id, p.name, v.country
FROM products p
JOIN (VALUES
('UK'), ('US')
) AS v(country)
ON p.origin = v.country;
Expanded forms in SQL:2023
SQL:2023 clarifies that table constructors can appear in FROM, WITH, and subquery expressions uniformly, and allows use of default column names when omitted. It also standardizes the optional TABLE keyword for readability.
SELECT *
FROM TABLE (VALUES (1, 'X'), (2, 'Y'));
New syntax updates in SQL:2023
The SQL:2023 revision adds multiple clarifications and new constructs across core syntax, JSON, temporal features, and analytic expressions. The main updates include:
Expanded MERGE capabilities
MERGE now supports multiple WHEN MATCHED and WHEN NOT MATCHED clauses with conditional logic. This makes upserts and complex synchronization operations more expressive.
MERGE INTO customers AS c
USING new_data AS n
ON (c.id = n.id)
WHEN MATCHED AND n.status = 'inactive' THEN
UPDATE SET c.active = FALSE
WHEN MATCHED THEN
UPDATE SET c.email = n.email
WHEN NOT MATCHED THEN
INSERT (id, email, active) VALUES (n.id, n.email, TRUE);
Enhanced WINDOW clause
The WINDOW clause can now be reused across multiple analytic functions in the same query, avoiding repetition of PARTITION BY and ORDER BY definitions.
SELECT
region,
SUM(sales) OVER win AS total_sales,
RANK() OVER win AS rank_in_region
FROM sales
WINDOW win AS (PARTITION BY region ORDER BY sales DESC);
Expanded INTERVAL and temporal precision
SQL:2023 refines temporal arithmetic, allowing microsecond precision and standardizing INTERVAL literals across engines.
SELECT CURRENT_TIMESTAMP + INTERVAL '3.5' SECOND AS in_future;
Standardized TRIM_ARRAY, CONTAINS_SUBSTRING, and MULTISET functions
New built-ins extend operations on strings, arrays, and multisets. These provide cross-engine consistency for manipulating structured values.
SELECT TRIM_ARRAY(ARRAY[1, NULL, 2, NULL]) AS compact,
CONTAINS_SUBSTRING('This is SQL', 'SQL') AS found;
Row pattern recognition refinements
SQL:2023 builds on MATCH_RECOGNIZE (introduced in SQL:2016) to better support event stream analytics and state detection. This allows declarative sequence pattern queries.
SELECT *
FROM trades
MATCH_RECOGNIZE (
PARTITION BY symbol
ORDER BY ts
MEASURES FIRST(A.ts) AS start_time, LAST(C.ts) AS end_time
PATTERN (A B* C)
DEFINE
A AS price < 100,
B AS price BETWEEN 100 AND 110,
C AS price > 110
);
Summary of adoption
| Feature | Introduced | Common support (2025) |
| JSON data type | SQL:2023 | PostgreSQL, MySQL, SQL Server, Oracle (variant syntax) |
| Table value constructors | SQL:99 – expanded in 2023 | All major engines |
| Enhanced MERGE | SQL:2023 | Partial support (SQL Server, Oracle) |
| Reusable WINDOW clause | SQL:2023 | PostgreSQL, Oracle |
| Row pattern recognition | SQL:2016 – refined 2023 | Oracle, Snowflake, Trino (partial) |
SQL remains a living standard. Each release brings the relational model closer to a unified language for both structured and semi structured data, capable of scaling from small embedded systems to distributed analytical clusters.
Chapter 29: SQL and Analytics
SQL is not only a language for defining and modifying data; it is also a powerful tool for analysis and reporting. Its aggregation, windowing, and statistical functions can produce detailed insights directly within the database, reducing the need for external processing. This chapter explores how to use SQL for analytical workloads and integrate it into business intelligence pipelines.
Aggregate and window functions for reporting
Aggregate functions summarize multiple rows into single results. Window functions extend this idea by allowing aggregates over specific partitions of data while still returning individual rows. Together, they form the backbone of reporting queries.
Common aggregate functions
| Function | Description |
COUNT() |
Counts rows or non null values |
SUM() |
Totals numeric values |
AVG() |
Computes average value |
MIN() / MAX() |
Finds smallest or largest value |
GROUP_CONCAT() / STRING_AGG() |
Concatenates values within a group (syntax varies) |
SELECT department, COUNT(*) AS staff, AVG(salary) AS avg_salary
FROM employees
GROUP BY department
ORDER BY avg_salary DESC;
Aggregates collapse multiple rows into summaries per group. You can add HAVING to filter results after aggregation.
SELECT department
FROM employees
GROUP BY department
HAVING AVG(salary) > 70000;
Window functions for running totals and rankings
Window functions compute values across a related set of rows, defined by OVER clauses, without collapsing the result set.
SELECT
department,
employee_id,
salary,
SUM(salary) OVER (PARTITION BY department) AS total_in_dept,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rank_in_dept
FROM employees;
This calculates per department totals and ranks each employee by salary while still listing individual rows.
Running totals and moving averages
SELECT
sale_date,
SUM(amount) OVER (ORDER BY sale_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total,
AVG(amount) OVER (ORDER BY sale_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS seven_day_avg
FROM sales;
These patterns are essential for time series and financial analysis, where each new row builds on previous ones.
ORDER BY without a clear frame clause for large data sets. Explicit frames like ROWS BETWEEN improve predictability and may optimize better in some engines.
Combining aggregates and windows
Both types of functions can appear in the same query, allowing summaries and details together:
SELECT
region,
store_id,
SUM(sales) AS store_total,
SUM(SUM(sales)) OVER (PARTITION BY region) AS region_total
FROM sales
GROUP BY region, store_id;
Here, the grouped result provides store_total, and the window computes region_total across grouped rows.
Using SQL for data analysis
SQL can perform most classical analytical steps: filtering, aggregation, correlation, and cohort analysis. When datasets fit within the database, these queries can deliver results faster than exporting to external tools.
Descriptive statistics
Basic statistical summaries can be computed directly in SQL:
SELECT
AVG(amount) AS mean,
STDDEV(amount) AS stddev,
MIN(amount) AS min,
MAX(amount) AS max,
COUNT(*) AS samples
FROM orders;
Some engines provide VARIANCE(), MEDIAN(), and PERCENTILE_CONT() for more advanced measures.
Cohort and retention analysis
Cohort analysis groups users or transactions by their first event date and measures retention or performance over time.
WITH first_use AS (
SELECT user_id, MIN(order_date) AS cohort_date
FROM orders
GROUP BY user_id
)
SELECT
f.cohort_date,
DATE_TRUNC('month', o.order_date) AS active_month,
COUNT(DISTINCT o.user_id) AS active_users
FROM orders o
JOIN first_use f ON o.user_id = f.user_id
GROUP BY f.cohort_date, active_month
ORDER BY f.cohort_date, active_month;
This pattern reveals how many users return after their initial period. It is widely used in product analytics.
Correlation and outlier detection
SQL:2023 defines COVAR_POP(), COVAR_SAMP(), CORR(), and REGR_SLOPE() to measure relationships between columns.
SELECT
CORR(height, weight) AS correlation,
REGR_SLOPE(weight, height) AS slope
FROM measurements;
Outliers can be flagged by comparing values to standard deviation ranges:
SELECT id, amount
FROM sales
WHERE amount > (
SELECT AVG(amount) + 3 * STDDEV(amount)
FROM sales
);
Pivoting and unpivoting for summaries
Pivoting turns row data into columns, while unpivoting does the reverse. Syntax varies by engine, but SQL:2023 provides PIVOT and UNPIVOT constructs.
SELECT *
FROM sales
PIVOT (
SUM(amount) FOR region IN ('EU', 'US', 'APAC')
) AS p;
This produces one row per grouping key and a column per region. Engines without PIVOT can emulate it with conditional aggregation:
SELECT
product,
SUM(CASE WHEN region = 'EU' THEN amount END) AS eu_sales,
SUM(CASE WHEN region = 'US' THEN amount END) AS us_sales,
SUM(CASE WHEN region = 'APAC' THEN amount END) AS apac_sales
FROM sales
GROUP BY product;
Using SQL for BI integration
Modern business intelligence (BI) tools rely on SQL to extract and visualize insights. Whether connected through live queries or cached extracts, SQL defines the data model for dashboards and analytics layers.
Connecting BI tools
BI platforms such as Power BI, Tableau, Looker, and Metabase connect through standard protocols (ODBC, JDBC, or native drivers). They send SQL queries to the database and render results as charts or tables.
-- Example: a BI tool may issue
SELECT region, SUM(amount) AS total_sales
FROM orders
GROUP BY region
ORDER BY total_sales DESC;
Views are ideal interfaces for BI tools, encapsulating business logic and simplifying joins. Expose only the necessary columns and apply row level permissions where needed.
Creating analytical views
CREATE VIEW v_sales_summary AS
SELECT
region,
DATE_TRUNC('month', order_date) AS month,
SUM(amount) AS total_sales,
COUNT(DISTINCT customer_id) AS customers
FROM orders
GROUP BY region, month;
Analytical views provide a stable surface for BI dashboards to query without risk of schema changes breaking charts.
Caching and extracts
Many BI systems store precomputed extracts for performance. Use scheduled jobs or materialized views to populate them efficiently.
CREATE MATERIALIZED VIEW mv_sales_monthly AS
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(amount) AS total
FROM orders
GROUP BY month;
Refresh materialized views regularly, or trigger updates based on data change events.
Data modeling for analytics
Adopt a star schema or snowflake schema to make analytical queries simpler and faster. Separate fact tables (transactions) from dimension tables (descriptive attributes).
-- Simplified star schema
FACT_SALES(order_id, date_id, product_id, amount)
DIM_DATE(date_id, date, month, year)
DIM_PRODUCT(product_id, category, brand)
Analysts can then write concise joins for reports:
SELECT
d.month,
p.category,
SUM(f.amount) AS total_sales
FROM fact_sales f
JOIN dim_date d ON f.date_id = d.date_id
JOIN dim_product p ON f.product_id = p.product_id
GROUP BY d.month, p.category;
Automating analytics pipelines
Use stored procedures, scheduled jobs, or external orchestrators (for example Airflow or dbt) to automate analysis refreshes and report generation.
-- Example task scheduling pseudo SQL
CALL refresh_materialized_view('mv_sales_monthly');
By combining SQL’s analytical power with BI integration, you can build self updating dashboards, real time metrics, and consistent data models across your organization.
BChapter 30: ringing It All Together
This final chapter combines all the principles covered so far into a coherent picture of how SQL fits into real projects. You will see how to design a complete database, write efficient queries for practical problems, explore advanced techniques, compare SQL with NoSQL approaches, and understand SQL’s role in the wider field of data engineering.
Designing and building a complete sample database
Let’s consider an example project for an online retail platform. The database must support customers, products, orders, payments, and inventory tracking. We will use a normalized relational schema, designed for clarity and integrity.
Schema overview
CREATE TABLE customers (
customer_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
name VARCHAR(200) NOT NULL,
email VARCHAR(200) UNIQUE NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE products (
product_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
name VARCHAR(200) NOT NULL,
category VARCHAR(100),
price NUMERIC(10,2) NOT NULL,
stock_quantity INTEGER NOT NULL DEFAULT 0
);
CREATE TABLE orders (
order_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
customer_id BIGINT NOT NULL,
order_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
total_amount NUMERIC(10,2),
CONSTRAINT fk_orders_customer FOREIGN KEY (customer_id)
REFERENCES customers(customer_id)
);
CREATE TABLE order_items (
order_id BIGINT NOT NULL,
product_id BIGINT NOT NULL,
quantity INTEGER NOT NULL,
price_each NUMERIC(10,2) NOT NULL,
CONSTRAINT pk_order_items PRIMARY KEY (order_id, product_id),
CONSTRAINT fk_oi_order FOREIGN KEY (order_id) REFERENCES orders(order_id),
CONSTRAINT fk_oi_product FOREIGN KEY (product_id) REFERENCES products(product_id)
);
CREATE TABLE payments (
payment_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
order_id BIGINT NOT NULL,
amount NUMERIC(10,2) NOT NULL,
method VARCHAR(40),
paid_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
CONSTRAINT fk_payments_order FOREIGN KEY (order_id)
REFERENCES orders(order_id)
);
This schema implements one to many relationships between customers and orders, and a many to many relationship between orders and products through order_items.
FOREIGN KEY constraints and keep indexes on these columns for performance. Missing indexes are a common cause of slow joins.
Populating the database
INSERT INTO customers (name, email)
VALUES
('Alice', 'alice@example.com'),
('Bob', 'bob@example.com');
INSERT INTO products (name, category, price, stock_quantity)
VALUES
('Laptop', 'Electronics', 899.00, 20),
('Headphones', 'Accessories', 49.99, 100),
('Mouse', 'Accessories', 25.50, 150);
Use INSERT statements, CSV imports, or bulk loading tools depending on volume. After populating tables, test the schema by joining related data.
SELECT c.name, o.order_id, p.name AS product, oi.quantity, oi.price_each
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id;
Real-world query examples and optimization
With the schema in place, you can write reporting and analytical queries that mirror business needs. Optimization comes from using indexes, caching, and avoiding unnecessary scans.
Sales by customer and category
SELECT
c.name AS customer,
p.category,
SUM(oi.quantity * oi.price_each) AS total_spent
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
GROUP BY c.name, p.category
ORDER BY total_spent DESC;
Identifying top selling products
SELECT
p.name,
SUM(oi.quantity) AS total_sold
FROM products p
JOIN order_items oi ON p.product_id = oi.product_id
GROUP BY p.name
ORDER BY total_sold DESC
LIMIT 10;
Using execution plans
Most engines support an EXPLAIN or EXPLAIN ANALYZE statement to view how a query executes. Look for full table scans, missing indexes, and join order inefficiencies.
EXPLAIN ANALYZE
SELECT * FROM orders WHERE customer_id = 12345;
ANALYZE or engine specific maintenance commands so the planner has accurate data distribution estimates.
Caching and indexing
Frequently accessed data should be cached at the application layer, or stored in materialized views. Index columns used in JOIN, WHERE, and ORDER BY clauses. Avoid over indexing; each index slows down inserts and updates.
Advanced SQL
Advanced SQL techniques extend beyond simple querying. They include recursive queries, common table expressions, and analytical window functions that enable complex data transformations.
Recursive CTE for hierarchies
WITH RECURSIVE category_tree AS (
SELECT category_id, parent_id, name
FROM categories
WHERE parent_id IS NULL
UNION ALL
SELECT c.category_id, c.parent_id, c.name
FROM categories c
JOIN category_tree p ON c.parent_id = p.category_id
)
SELECT * FROM category_tree;
This pattern handles hierarchical data like departments, categories, or organizational charts.
Advanced analytics with window functions
SELECT
customer_id,
order_date,
SUM(total_amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS running_total
FROM orders;
Window functions let you build progressive totals or rankings directly in SQL, reducing the need for procedural logic.
Combining SQL with procedural code
Stored procedures, functions, and triggers allow you to embed logic that runs inside the database. Use them for validation, complex updates, or background processing.
CREATE OR REPLACE FUNCTION update_stock()
RETURNS TRIGGER AS $$
BEGIN
UPDATE products
SET stock_quantity = stock_quantity - NEW.quantity
WHERE product_id = NEW.product_id;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER after_order_item_insert
AFTER INSERT ON order_items
FOR EACH ROW EXECUTE FUNCTION update_stock();
NoSQL contrasts
NoSQL databases evolved to handle unstructured or highly variable data where strict relational modeling may not fit. They trade strong consistency and complex joins for scalability and flexibility. Common types include document stores, key value stores, column families, and graph databases.
Document stores
Document databases (such as MongoDB) store JSON like objects. Each record is self contained, with nested arrays and objects replacing foreign keys. This simplifies reads but duplicates data.
{
"customer": "Alice",
"orders": [
{ "product": "Laptop", "quantity": 1, "price": 899.00 },
{ "product": "Mouse", "quantity": 2, "price": 25.50 }
]
}
Key value and column stores
Key value databases (like Redis) and wide column stores (like Cassandra) are optimized for rapid lookups and high volume writes. They require explicit design for query access patterns instead of flexible joins.
Graph databases
Graph databases model entities as nodes and relationships as edges. They use query languages such as Cypher or Gremlin for path traversal and relationship queries. SQL extensions like SQL/PGQ (in SQL:2023) bridge some of these ideas back into relational databases.
MATCH (p1:Person)-[:FRIEND_OF]->(p2:Person)
WHERE p1.name = 'Alice'
RETURN p2.name;
SQL and NoSQL convergence
Many modern engines blur the line between SQL and NoSQL. PostgreSQL supports JSONB and key value operations; MySQL and SQL Server handle document columns; and distributed SQL platforms integrate both structured and unstructured models under one query engine.
Data engineering
Data engineering connects storage, transformation, and analytics into a complete data pipeline. SQL is at the core of this ecosystem, serving as the language of ETL, warehousing, and analytics orchestration.
ETL and ELT patterns
Traditional ETL (extract, transform, load) performs transformations before loading data into a warehouse. Modern systems often use ELT (extract, load, transform), loading raw data first and transforming it with SQL inside the database.
-- Example transformation step
INSERT INTO clean.orders
SELECT
order_id,
customer_id,
CAST(order_total AS NUMERIC(10,2)) AS amount,
order_date::DATE
FROM raw.orders_raw
WHERE order_total IS NOT NULL;
Data pipelines and orchestration
Tools like Apache Airflow, dbt, or Dagster run SQL transformations as scheduled tasks, coordinating dependencies between stages.
dbt run --models staging+ # Example command
Each transformation becomes a versioned SQL file. Tests validate row counts, null constraints, and referential links automatically.
Data warehouses and lakes
Data warehouses (Snowflake, BigQuery, Redshift) use columnar storage optimized for analytics. Data lakes extend this to semi structured data formats like Parquet and JSON. SQL remains the unifying query interface across both layers.
Streaming and real-time analytics
Streaming SQL engines (Flink SQL, Materialize, ksqlDB) process continuous event streams using familiar syntax.
SELECT region, COUNT(*) FROM orders_stream
WINDOW TUMBLE (INTERVAL '5' MINUTE)
GROUP BY region;
This model enables near real-time dashboards and alerts powered by SQL queries on live data.
The unified data ecosystem
Modern data engineering treats SQL as the universal language across OLTP, OLAP, and streaming systems. Whether structured, semi structured, or distributed, SQL unites diverse tools under a shared analytical vocabulary.
By now, you have seen how SQL evolved from simple relational tables into a full scale analytical and operational language. Its clarity, portability, and power ensure its place at the heart of modern computing for decades to come.
© 2025 Robin Nixon. All rights reserved
No content may be re-used, sold, given away, or used for training AI without express permission
Questions? Feedback? Get in touch