posts sql 2025-04-12

SQL Tour

This article is a chapter from my enterprise kotlin handbook.

Gentle introduction to relational databases

For many years, relational databases where the only serious way to keep relevant data saved in a safe and retrievable way, whenever needed.

Some would say that it still is the only serious and safe way.

Relational databases emerged victorious from alternative ways to store large volume of enterprise data way back in the 80's, there is a fine documentary about it.

They worth to use instead of just shove all the data inside a plain file because most of them offer ACID properties:

Atomic operations
Consistency of data
Isolation in concurrent scenarios
Durability of stored data

We'll look on several relevant details in this chapter, but let's start with the sample code: in fact, i will present things first and explain later, deal with database management in most enterprise scenario is a more business-related job than a purely technical challenge: what really matters is what does this data means instead of how to store it.

Write a program to query data

This is a small example on how store things.

Kotlin inherits from java a library called JDBC and you can use it to query the database:

// ...
fun list() {
    println("|#\t|description\t|done\t|updated\t")
    connection.prepareStatement(
        """
            select * from todos
        """.trimIndent()
    ).use {
        val rs = it.executeQuery()
        while (rs.next()) {
            println(
                "|${rs.getLong("id")}\t" +
                        "|${rs.getString("description")}\t" +
                        "|${rs.getBoolean("done")}\t" +
                        "|${rs.getTimestamp("updated")}"
            )
        }
    }
}
// ...

JDBC isn't famous for the concise, easy to use API, but it's a good base for the people writing alternative database interfacing libraries.

You can check the complete sample code here.

Cool database runtimes

When people talk about databases, what they really mean is the collection of software needed to provide database management. RDBMS - Relational Database Management System is the term to kick in to find ou more.

Some of them are quite humble, just one program saving into a file, while others are huge, state-of-the-art beasts, intended for brutally large data volume processing.

There are many RDBMS around, but let's focus on some that will help you to deliver things fast.

SQLite

Let's start with the tiny guy. They say it's not meant to be used for serious applications, yet it is easily the most ubiquitous database engine in the entire world.

There are several ways to install the command line tool and the library, but we'll focus on how to handle the database using the command line, since we already sampled how a kotlin program could consume the data.

Also, we'll make use of SQL - Structured Query Language to create the sample todo list database. More about it, in details, further in this chapter.

Open the command line terminal, call the sqlite3 command you have installed from previous step and:

sombriks@zagreus Documents % sqlite3 todos.db
SQLite version 3.43.2 2023-10-10 13:08:14
Enter ".help" for usage hints.
sqlite> create table todos(id integer, description text, updated timestamp);
sqlite> insert into todos(id,description,updated) values(1, 'do the dishes', '2024-05-26');
sqlite>

Hit CTRL+D to exit, the file todos.db now is there.

You can go back into your database with the same previous command sqlite3 todos.db. Now try to query the data:

sombriks@zagreus Documents % sqlite3 todos.db
SQLite version 3.43.2 2023-10-10 13:08:14
Enter ".help" for usage hints.
sqlite> select * from todos;
1|do the dishes|2024-05-26
sqlite>

There is a lot to explain, yet look how simple is that!

SQLite figures into this list because it is by far the easiest one to use when we need a small and simple database for testing purposes. As you can see you don't even need to setup a real project to test things with it.

H2

This is another tiny, embeddable, database runtime that you can use in your kotlin/spring projects.

In fact, spring boot will provision a h2 embedded instance if your tests needs a DataSource, but you configured no such thing.

Instead of a command line tool, H2 offers a feature-complete web admin console, so you get a better visualization of the data

MySQL

MySQL is a popular, robust RDBMS widely used on internet solutions.

It's quite feature complete (I still don't forgive the lack of RETURNING statements, i don't care if it's not ANSI-SQL) and easy to find years of documentation about it.

It has a notable fork, MariaDB (which has the RETURNING).

PostgreSQL, the open-source and enterprise one

Now, this is the one you pick when you want to do big, serious business. It's also one of the oldest RDBMS in activity, dating from the time of relational database invention (i mean WARS!!).

With postgres you can handle internet scale operations, your hardware is the limit. It is into another league, competing with systems like Oracle, IBM DB2 and SQL Server. And winning.

Pick postgres whenever possible.

Cool database clients

Most RDBMS offers tools like the sqlite command line tool we used earlier in this chapter or something like the H2 embedded web consoles.

But if we keep using only the official tool for each RDBMS, we'll end up overwhelmed with so much tooling to learn about something that is a ISO standard.

There are tools able to deal with several distinct database runtimes, so you can learn one single tool and get the job done even though you don't understand the database in use completely.

Dbeaver

dbeaver supports a wide range of RDBMS. Everything the jvm touches it touches too.

Intellij Ultimate or DataGrip

datagrip supports a wide range of RDBMS. Everything the jvm touches it touches too.

DbBrowser for SQLite

dbbrowser is a dedicated client for SQLite and it deserves a place here because SQLite is the one we pick to do fast, expendable test databases.

Sure, command line is cool but sometimes visual tools allows you to go early to home.

The SQL language

Now let's talk about the way we interact with databases.

The Structured Query Language is the standard way to retrieve, model, modify and evolve data stored into a RDBMS.

Since the language existed for years before the standardization appeared (and for decades before that standardization started to be taken seriously), a few SQL Dialects exists in different RDBMS. We'll face them up ahead.

It can be divided in several categories, but the principal ones are DDL and DML.

DDL

DDL stands for Data Definition Language. You use it to describe the data schema, tables and relations between them. We saw it in action in the initDb function back in the sample project.

Create table, alter table, drop table

We talk SQL, we talk tables.

A table is the foundational structure used to model our data. It holds columns, each column represents a characteristic of the data we want stored.

Let's say we want to keep a list of friends. We need their names.

create table friends
(
    name text
);

This table represents exactly what we requested, although it is horribly, poorly designed.

If we have two friends with the same name, we end up with no easy way to differentiate one from another.
I don't know when i inserted that friend in my database.
Extra information about the friend (phone, email, address, for example) would be nice to have.

We can recreate our table to fix those problems i just invented:

-- throw away previous table definition
drop table friends;
-- do it again
create table friends
(
    id      integer primary key,
    name    text      not null,
    created timestamp not null default current_timestamp
);

Or, instead of throw it away we just alter the table definition:

alter table friends
    add column phone text;
alter table friends
    add column email text;
alter table friends
    add column address text;

Some database engines (SQLite, for example!) won't let you add primary keys via alter table, this is why the example above is how it is.

What?? what is a primary key and why 9 out of 10 DBAs call it id?

What is a primary key

Whenever you insert data into a table, a record is created. A record is a 'line' in the table.

If you insert the same data twice, how to know the difference between records? They are the same, yet they're not.

The concept of identity, from psychology, helps to solve it. You can read the Theseus ship paradox to get things even more clear: data is subject to change over the time, yet each record is supposed to be equal to itself no matter the changes.

'Natural' vs 'artificial' keys

Another common debate is whether to adopt 'natural' or 'artificial' keys.

By natural understand some trait that already exists along the characteristics of the entity being represented by a table.

A social security number could be considered a natural key.

The ìd column is an artificial key. It makes sense in the table context but is completely alien to a person in the real world.

In most cases, prefer artificial keys. Natural ones are subject to change due real world events and thus outside the RDBMS control.

If for some reason you want to use special names for your table, use double quotes:

Double quotes

create table "never tell me the odds"
(
    id             integer primary key,
    "odd col name" text not null
);

insert into "never tell me the odds" ("odd col name")
values ('han shoots first');

select *
from "never tell me the odds";

Resulting in:

id	"odd col name
1	han shoots first

Create index, create view

Indexes are a way to inform the RDBMS that a certain column will be requested by queries pretty often.

Primary keys are usually indexed automatically.

You can add uniqueness on your column definition and it will also be indexed.

The table bellow defines unique indexes on email and phone. no friend share such information, but it's ok to have more than one friend with the same address.

-- drop table friends;
create table friends
(
    id      integer primary key,
    name    text        not null,
    email   text unique not null,
    phone   text unique not null,
    address text        not null,
    created timestamp   not null default current_timestamp
);

This is an alternative syntax, the advantage is we know the key name:

-- drop table friends;
create table friends
(
    id      integer primary key,
    name    text      not null,
    email   text      not null,
    phone   text      not null,
    address text      not null,
    created timestamp not null default current_timestamp,
    constraint friends_uq_email unique (email),
    constraint friends_uq_hone unique (phone)
);

This also works for some dialects (H2 and the rest of the world ok, sqlite no):

-- drop table friends;
create table friends
(
    id      integer primary key,
    name    text      not null,
    email   text      not null,
    phone   text      not null,
    address text      not null,
    created timestamp not null default current_timestamp
);

alter table friends
    add constraint friends_uq_email unique (email);
alter table friends
    add constraint friends_uq_hone unique (phone);

On the other hand, sometimes we want just a subset of results, or even there are two or more tables we join very often (see what is a join a little further in this chapter). For those scenarios we can create views:

create view friend_emails as
select id, email
from friends;

Those two concepts will make more sense in the further topics.

Foreign keys

Let's talk about the "Relational" in Relational Database Management System.

There are relations between the tables. Let's evolve our friends database, let's have one table for friends, another for addresses because yes.

drop table friends;

create table addresses
(
    id      integer primary key,
    name    text      not null,
    number  text,
    created timestamp not null default current_timestamp,
    constraint addresses_uq_name_number unique (name, number)
);

create table friends
(
    id           integer primary key,
    name         text        not null,
    email        text unique not null,
    phone        text unique not null,
    addresses_id integer     not null,
    created      timestamp   not null default current_timestamp,
    foreign key (addresses_id) references addresses (id)
);

The way address is defined allow us to share the same address with more than one friend. It is represented in the friends table by the use of a foreign key pointing to the address primary key.

But what happens if i delete an address?

Depending on the database runtime, the delete operation can either be prevented or it occurs but let you with a database with invalid state.

Databases honoring the ACID properties will prevent you from doing that.

But you can solve it explaining what to do in such cases:

create table friends
(
    id           integer primary key,
    name         text        not null,
    email        text unique not null,
    phone        text unique not null,
    addresses_id integer     not null,
    created      timestamp   not null default current_timestamp,
    foreign key (addresses_id) references addresses (id) on delete set null
);

We'll come back to this cascade thing in upcoming topics.

DML

DML stands for Data Manipulation language. It differs from what we're seeing because the first one, DDL, explains how our data should look like. This one is to feed data into it.

Insert

To add data to you database you must use insert statements.

For this table (tested on sqlite, trust me):

-- drop table if exists friends;
create table friends
(
    id      integer primary key,
    name    text      not null,
    email   text,
    created timestamp not null default current_timestamp
);

You can insert data like this:

insert into friends(name)
values ('Joe');

The output should be something like this:

1 row affected in 5 ms

And you can check your data with a select (more on that later):

select *
from friends;

And the result would be:

id	name	email	created
1	Joe	null	2024-05-28 11:46:18

Not that insert didn't had any info about id or createdcolumns, but there is values on them. This is how column definitions works, you can define keys, default values, restrictions, you name it (as long as supported by the database engine in use).

For the following table:

-- drop table if exists friends;
create table friends
(
    id      integer primary key,
    name    text      not null,
    email   text      not null,
    created timestamp not null default current_timestamp
);

The same insert that served us well will fail:

insert into friends(name)
values ('Joe');

With a message error more or less like this:

[2024-05-28 09:02:09] [19] [SQLITE_CONSTRAINT_NOTNULL] A NOT NULL constraint failed (NOT NULL constraint failed: friends.email)

What does it mean? it means, according to our table definition, that i don't accept friends who doesn't have an email. Uncivilized people.

You fix that insert by simply providing an email:

insert into friends(name, email)
values ('Joe', 'email@joe.com');

It's possible to provide several friends in a single insert:

insert into friends(name, email)
values ('Joe', 'email@joe.com'),
       ('Anne', 'anne@ggg.co');

You can specify values for columns with default values if you want to:

insert into friends(name, email, created)
values ('Joe', 'email@joe.com', '2024-05-28');

But beware! Primary and unique keys will complain about duplicate values and your insert will fail. The id thing from psychology, remember?

insert into friends(id, name, email)
values (1, 'Joe', 'email@joe.com'),
       (1, 'Anne', 'anne@ggg.co');

Expect something like this:

[2024-05-28 09:13:28] [19] [SQLITE_CONSTRAINT_PRIMARYKEY] A PRIMARY KEY constraint failed (UNIQUE constraint failed: friends.id)

A funny one, if every column has a default value:

create table my_messages
(
    id      integer primary key,
    message varchar(255) not null default 'yo!',
    created timestamp    not null default current_timestamp
);

You can insert data like this:

insert into my_messages default
values;

Tables with foreign keys marked with not null restriction will perform extra checks in your insert attempts.

For this table schema:

create table orders
(
    id      integer   not null primary key,
    created timestamp not null default current_timestamp
);

create table items
(
    id        integer   not null primary key,
    product   text      not null,
    amount    integer   not null default 1,
    orders_id integer   not null references orders (id),
    created   timestamp not null default current_timestamp
);

This insert alone will fail (except in sqlite, unless you enable the foreign key check):

-- pragma foreign_keys = 1;
insert into items (product, orders_id)
values ('cell phone', 10);

Resulting error looks like this:

[2024-05-28 10:00:57] [19] [SQLITE_CONSTRAINT_FOREIGNKEY] A foreign key constraint failed (FOREIGN KEY constraint failed)

Avoid that by inserting an order first:

insert into orders default
values
    returning *

The returning part (which MySQL doesn't have but every other database runtime does, even sqlite) helps to know how the newly created record looks like. Here's a sample output:

id	created
1	2024-05-28 13:05:48

Now the item insert will pass -- as long as the orders_id exists as an id in the order table:

-- pragma foreign_keys = 1;
insert into items (product, orders_id)
values ('bicycle', 1);

Update, delete, cascade

Similar to the table schema itself, stored data must evolve too.

Either by modifying or removing records. For example, given this table:

create table foods
(
    id          integer   not null primary key,
    description text      not null,
    is_sweet    boolean   not null default false,
    created     timestamp not null default current_timestamp
);

And this data:

insert into foods(description)
values ('noodles'),
       ('milky ''n honey chocolate'),
       ('lemon'),
       ('steak');

By checking the content we'll notice an error:

select *
from foods;

id	description	is_sweet	created
5	noodles	false	2024-05-28 23:42:08.814539
6	milky 'n honey chocolate	false	2024-05-28 23:42:08.814539
7	lemon	false	2024-05-28 23:42:08.814539
8	steak	false	2024-05-28 23:42:08.814539

The chocolate thing is supposed to be sweet.

To fix that, update the row:

update foods
set is_sweet = true
where description = 'milky ''n honey chocolate';

But the name is too difficult to use, and what is that '' thing?

You can use an alternative update:

update foods
set is_sweet = true
where description like '%honey%';

The like operator looks for similarities to a given special string.

But the most safe, straightforward way to update a record is knowing its id:

update foods
set is_sweet = true
where id = 6;

We started to use the where clause in our queries, more on that up next.

If we need to get rid of some data, the delete can be used:

delete
from foods
where id = 6;

Note that, like update, a where clause should be used to specify which records you want to affect. On both scenarios, if no where clause is provided, it will affect all records in the table, which is undesirable on most cases.

Deletes also take relations into account. Given this schema:

create table team
(
    id          identity not null primary key,
    description text     not null
);

create table member
(
    id      identity not null primary key,
    team_id integer  not null,
    name    text     not null,
    constraint fk_member_team_id foreign key (team_id) references team (id)
);

insert into team (description)
values ('team 1'),
       ('team 2');

insert into member(name, team_id)
values ('Joe', 1),
       ('Anna', 1),
       ('Lucas', 1),
       ('Ana', 2),
       ('Kelly', 2);

An attempt to delete team 2 will likely fail (tested on H2):

delete
from team
where id = 2;

With an output like this:

[2024-05-28 21:12:15] [23503][23503] Referential integrity constraint violation: "FK_MEMBER_TEAM_ID: PUBLIC.MEMBER FOREIGN KEY(TEAM_ID) REFERENCES PUBLIC.TEAM(ID) (2)"; SQL statement:
[2024-05-28 21:12:15] delete from team where id = 2 [23503-220]

Yes, yes, you can delete all members where team_id = 2 and then perform the deletion on team, but you can define a cascade into team_id n:

alter table member
    drop constraint fk_member_team_id;
alter table member
    add constraint fk_member_team_id foreign key (team_id) references team (id) on delete cascade;

Now your deletion goes even better than expected:

TODOS.H2.DB.PUBLIC> delete from team where id = 2
[2024-05-28 21:26:35] 1 row affected in 2 ms

Check what's left in the database:

select *
from team
         join member on member.team_id = team.id;

team.id	description	member.id	team_id	name
1	team 1	1	1	Joe
1	team 1	2	1	Anna
1	team 1	3	1	Lucas

Select

We're using selects for a while in this chapter now but let's look at them in more detail now.

We already saw how to define our database (create table, alter, etc.)
We saw how to feed and modify data into our database (insert, update, delete)
Now we will see how to ask things to the database and get answers

To perform a select, you need very little, really. Just a RDBMS running and accepting connections (SQLite is a file, H2, MySQL, PostgreSQL, Oracle and so on works usually as a client/server application). This is a valid select:

select 1;

No tables. Just a number. Works.

You can select arbitrary values:

select 'Hi mom',
       123,
       current_timestamp,
       lower('DON''T YELL AT ME!'),
       true;

When selecting you can mix things with data from your table:

select concat('the team member ', name, ' from team ', description) as headline
from team
         join member on member.team_id = team.id;

Select to insert

It is possible to use the select output as the values to an insert statement. It is a powerful move when transforming data already present inside the database.

For example, i want to know all possible color combinations from my color database, first i create a table to save my colors, add some, create a second table to store my combinations and instead of feed combinations one by one i can do the insert into / select operation:

-- drop table if exists colors;
create table colors
(
    id   integer primary key,
    name varchar(255) not null
);

insert into colors (name)
values ('red'),
       ('green'),
       ('blue'),
       ('cyan'),
       ('purple');

-- drop table if exists color_combinations;
create table color_combinations
(
    color1 integer not null,
    color2 integer not null,
    constraint color_combinations_fk_colors_id_1 foreign key (color1) references colors (id) on delete cascade,
    constraint color_combinations_fk_colors_id_2 foreign key (color2) references colors (id) on delete cascade,
    constraint color_combinations_pk primary key (color1, color2)
);

-- feed the combinations
insert into color_combinations
select c1.id, c2.id
from colors c1,
     colors c2
where c1.id <> c2.id;

select *
from color_combinations cc
         join colors c1 on cc.color1 = c1.id
         join colors c2 on cc.color2 = c2.id

This results into this remarkable dataset:

color1	color2	id	name	id	name
1	2	1	red	2	green
1	3	1	red	3	blue
1	4	1	red	4	cyan
1	5	1	red	5	purple
2	1	2	green	1	red
2	3	2	green	3	blue
2	4	2	green	4	cyan
2	5	2	green	5	purple
3	1	3	blue	1	red
3	2	3	blue	2	green
3	4	3	blue	4	cyan
3	5	3	blue	5	purple
4	1	4	cyan	1	red
4	2	4	cyan	2	green
4	3	4	cyan	3	blue
4	5	4	cyan	5	purple
5	1	5	purple	1	red
5	2	5	purple	2	green
5	3	5	purple	3	blue
5	4	5	purple	4	cyan

How to make it a combination instead a permutation is left to you as an exercise.

Select to create table

In a similar fashion, you can use a select to create a table:

create table ccombs as
select c1.name as name1, c2.name as name2
from color_combinations cc
         join colors c1 on cc.color1 = c1.id
         join colors c2 on cc.color2 = c2.id
where c1.id < c2.id

select *
from ccombs;

name1	name2
red	green
red	blue
red	cyan
red	purple
green	blue
green	cyan
green	purple
blue	cyan
blue	purple
cyan	purple

Note however that the resulting table is pretty much poorly designed. it does not have any constraints, no primary key, just raw, synthetic data. Sure, you can fix some of those issues with alter table statements, but the victorious approach here is to create the table first and then go with insert/select strategy.

Join and union

Join operations (which we've been using for a while now) combine the records of two or more tables. Keep that in mind, it involves various tables but the operation is at record level. Why is that important?

select c1.*, c2.*
from colors c1,
     colors c2;

id	name	id	name
1	red	1	red
1	red	2	green
1	red	3	blue
1	red	4	cyan
1	red	5	purple
2	green	1	red
2	green	2	green
2	green	3	blue
2	green	4	cyan
2	green	5	purple
3	blue	1	red
3	blue	2	green
3	blue	3	blue
3	blue	4	cyan
3	blue	5	purple
4	cyan	1	red
4	cyan	2	green
4	cyan	3	blue
4	cyan	4	cyan
4	cyan	5	purple
5	purple	1	red
5	purple	2	green
5	purple	3	blue
5	purple	4	cyan
5	purple	5	purple

Each record from the colors table aliased as c1 was combined with each record from the same table, but aliased as c2. This is called a cartesian product and not always desirable.

As we saw previously, you avoid that by proper define either a where clause or conditions in the join. Both join forms bellow solves the issue:

select c1.*, c2.*
from colors c1,
     colors c2
where c1.id < c2.id;

select c1.*, c2.*
from colors c1
         join colors c2 on c1.id < c2.id;

Joins has variations:

left join, when the condition in joining table involves null values.
right join, when the condition in joined table involves null values.
(inner) join, when null values are not allowed in the junction. On most RDBMS the inner is optional, and the comma-separated syntax is equivalent to it, like the example above.
cross outer join, when null values are allowed in the junction.

Now, if you want to append distinct yet slightly similar results you need to do a union. For example for this schema and data:

create table colonels
(
    id   integer primary key,
    name text not null
);
create table captains
(
    id   integer primary key,
    name text not null
);
create table soldiers
(
    id   integer primary key,
    name text not null
);

insert into colonels(name)
values ('Clark'),
       ('Mark'),
       ('Oliver');
insert into captains(name)
values ('Barry'),
       ('Larry');
insert into soldiers(name)
values ('Joe');

-- i can query each table, one by one, of course, but i can union them and even
-- polish results a little so no information is left out:

select '1 - colonel' as rank, name
from colonels
union
select '2 - captain' as rank, name
from captains
union
select '3 - soldier' as rank, name
from soldiers
order by rank;

rank	name
1 - colonel	Clark
1 - colonel	Mark
1 - colonel	Oliver
2 - captain	Barry
2 - captain	Larry
3 - soldier	Joe

Pretty nice, right?

CTE - Common Table Expression

Queries can grow more and more complex depending on the kind of information we must extract. Joins, subqueries, views, things can grow in complexity pretty fast.

One approach to ease such pain is to identify the duplicity in your subqueries and promote them to CTE's.

Let's exemplify the issue and then how to ease things with this database schema:

-- pragma foreign_keys = 1;

-- a table for products. key characteristics, hardly changing, goes here. 
create table products
(
    id          integer primary key,
    description text      not null,
    created     timestamp not null default current_timestamp
);

-- a table for history price. prices will vary over time. 
create table prices_history
(
    id          integer primary key,
    value       decimal(10, 4) not null,
    created     timestamp      not null default current_timestamp,
    products_id integer        not null,
    foreign key (products_id) references products (id),
    unique (products_id, created)
);

-- a table to know the product stock position in a given moment in time.
create table stock_history
(
    id          integer primary key,
    amount      integer   not null default 0,
    created     timestamp not null default current_timestamp,
    products_id integer   not null,
    foreign key (products_id) references products (id),
    unique (products_id, created)
);

-- order table to identify a transaction
create table orders
(
    id          integer primary key,
    -- this should be calculated from all items in this order
    total_price decimal(10, 4) not null,
    created     timestamp      not null default current_timestamp
);

-- order item to know how many products where involved in the transaction, how
-- much they costed and what was the stock position when this transaction
-- happened. 
create table order_items
(
    id                integer primary key,
    amount            integer   not null default 1,
    created           timestamp not null default current_timestamp,
    orders_id         integer   not null,
    products_id       integer   not null,
    prices_history_id integer   not null,
    stock_history_id  integer   not null,
    foreign key (orders_id) references orders (id),
    foreign key (products_id) references products (id),
    foreign key (prices_history_id) references prices_history (id),
    foreign key (stock_history_id) references stock_history (id),
    -- product appears a single time per transaction, 
    -- with a stock and a price properly trackable in time
    unique (orders_id, products_id),
    unique (orders_id, products_id, prices_history_id),
    unique (orders_id, products_id, stock_history_id)
);

I want to sell things and keep track on how exactly it affects my product stock and how much it used to cost by the time i sold it. It's not that complex, but believe me, sell things is hard.

Let's feed some data:

-- pragma foreign_keys = 1;
-- some base products
insert into products(id, description)
values (1, 'Apple'),
       (2, 'Banana'),
       (3, 'Milk'),
       (4, 'Toy Airplane');

-- initial prices
insert into prices_history (id, products_id, value)
VALUES (1, 1, 5),
       (2, 2, 3),
       (3, 3, 5.5),
       (4, 4, 25);

-- initial stock
insert into stock_history (id, products_id, amount)
values (1, 1, 100),
       (2, 2, 500),
       (3, 3, 100),
       (4, 4, 30);

-- now create an order
insert into orders(id, total_price)
values (1, 100);

-- and some items to it
insert into order_items(orders_id, products_id, prices_history_id, stock_history_id, amount)
values (1, 1, 1, 1, 10), -- 10 apples at $5 each
       (1, 4, 4, 4, 2);
-- 2 toys at $25 each

-- let's update our stocks
insert into stock_history (id, products_id, amount)
values (5, 1, 90), -- i just sold 10 apples
       (6, 4, 28);
-- just sold 2 airplane toys

-- now let's change some prices just because.
insert into prices_history (id, products_id, value)
VALUES (5, 1, 10),
       (6, 2, 9);

-- and place a new order
insert into orders(id, total_price)
values (2, 100);

insert into order_items(orders_id, products_id, prices_history_id, stock_history_id, amount)
values (2, 1, 5, 5, 20), -- 20 apples at $10 each
       (2, 2, 6, 2, 100);
-- 100 bananas at $9 each

-- now update the stock, keep the stock up to date
insert into stock_history (id, products_id, amount)
values (7, 1, 70),
       (8, 2, 400);

-- oof! boy that reminds me my days as a salesman.

Now, let's ask some questions to this database:

how many different items i've sold so far?
what was the paid amount in the cheapest order item?
what is my latest stock position for each product?
what is the initial and current price for each product?

The following queries can answer that:

-- Answer #1
select count(distinct products_id) as different_items
from order_items;

-- Answer #2
select min(paid_price) as paid_amount
from (select amount * value as paid_price
      from order_items oi
               join prices_history ph on oi.prices_history_id = ph.id);

-- Answer #3
select latest.latest_id, latest.products_id, sh.amount
from (select max(id) as latest_id, products_id
      from stock_history
      group by products_id) as latest
         join stock_history sh on sh.id = latest.latest_id;

-- Answer #4
select mm.products_id, phmin.value, phmax.value
from (select min(ph.id) first_id, max(ph.id) last_id, ph.products_id
      from prices_history ph
      group by products_id) mm
         join prices_history phmin on phmin.id = mm.first_id
         join prices_history phmax on phmax.id = mm.last_id

But boy that is a lot of queries and subqueries. Subqueries are hard to read. We can improve that a little with CTE's:

-- Answer #2
with minimum as (select oi.amount * ph.value as paid_price
                 from order_items oi
                          join prices_history ph on oi.prices_history_id = ph.id)
select min(paid_price) as paid_amount
from minimum;

Now the subquery is more readable and also reusable.

Let's check another one:

-- Answer #3
with latest_history as (select max(id) as latest_id, products_id
                        from stock_history
                        group by products_id),
     latest_products as (select lh.latest_id, lh.products_id, sh.amount
                         from latest_history lh
                                  join stock_history sh on sh.id = lh.latest_id)
select latest_products.latest_id, latest_products.products_id, latest_products.amount
from latest_products;

By composing CTE's you can drill down more easily what your query really does. Sequencing is better than nesting.

One more:

with mm as (select min(ph.id) first_id, max(ph.id) last_id, ph.products_id
            from prices_history ph
            group by products_id)
select mm.products_id, phmin.value, phmax.value
from mm
         join prices_history phmin on phmin.id = mm.first_id
         join prices_history phmax on phmax.id = mm.last_id;

You can think on reusable queries hierarchy like this:

A local, in-place subquery in the current query for one single use.
A CTE for subqueries to better reading or use in more than once scenario.
A view when the subquery will be reused across many queries.

Sum, avg, count, group by

We can use aggregate functions (like we've been using!) to extract even more cool information from our database.

You can figure out total sold items:

select sum(amount)
from order_items;

The average historic price of a product:

select avg(value)
from prices_history
where products_id = 1;

How many transactions we've done so far:

select count(id)
from orders;

By grouping, you can narrow down those information to a specific aspect.

For example, this is how to figure out how many price changes each product had:

select count(id) price_changes, products_id
from prices_history
group by products_id;

Every column present on select statement and not being aggregated (count, sum, avg, etc) must be present in the group by statement.

Order by, limit, offset

Order by sorts the results. for example:

select *
from products
order by description;

Results in:

id	description	created
1	Apple	2024-05-30 20:29:20
2	Banana	2024-05-30 20:29:20
3	Milk	2024-05-30 20:29:20
4	Toy Airplane	2024-05-30 20:29:20

While:

select *
from products
order by description desc;

id	description	created
4	Toy Airplane	2024-05-30 20:29:20
3	Milk	2024-05-30 20:29:20
2	Banana	2024-05-30 20:29:20
1	Apple	2024-05-30 20:29:20

You can get creative with your order clauses:

select *
from products
order by length(description) desc;

id	description	created
4	Toy Airplane	2024-05-30 20:29:20
2	Banana	2024-05-30 20:29:20
1	Apple	2024-05-30 20:29:20
3	Milk	2024-05-30 20:29:20

Limit and offset helps to trim down the total results. Very useful when your database grows or you don't want to waste bandwidth:

select *
from products
order by length(description) desc limit 2;

id	description	created
4	Toy Airplane	2024-05-30 20:29:20
2	Banana	2024-05-30 20:29:20

select *
from products
order by length(description) desc limit 2 offset 1;

id	description	created
2	Banana	2024-05-30 20:29:20
1	Apple	2024-05-30 20:29:20

Never paginate without ordering

I mean, you can do that, but it will give you unstable results. Without the order by clause, it's up to the database to decide what goes in each offset. Not good if you want to recover precise information chunks.

Remember that. keep your data in order.

Window functions

At last but not least, let's talk about window functions.

Regular queries are like unroll a scroll. sometimes it's a very long scroll, so you use limit, offset, order by to get just the part you want. Row by row your data will be presented. Some RDBMS systems has a concept called cursor, it's not quite pagination, but serves a similar purpose.

Aggregation queries are like take that scroll and fold it over itself and doing over the grouped results the aggregation operation you want. You can still order, limit, offset the result, but it's a different dataset now.

Window functions is what happens when you decide that you need a small piece of aggregate data alongside your regular row information.

So let's mix things up and see how does it look like:

select id, description, count(id) over ()
from products;

id	description	"count(id) over()"
1	Apple	4
2	Banana	4
3	Milk	4
4	Toy Airplane	4

The first two columns are regular ones, from existing rows in the products table but the last one is am aggregation result.

The query tell us the products we have in the data base out of the total products created.

What if we want to know the product, its stock position and how does it compare with total of all items in stock?

with latest_history as (select distinct p.id,
                                        p.description,
                                        max(sh.id) over (partition by sh.products_id) sh_id
                        from products p
                               join stock_history sh on sh.products_id = p.id)
select lh.id,
       lh.description,
       sh2.amount              product_amount,
       sum(sh2.amount) over () all_items_stock
from latest_history lh
       join stock_history sh2 on sh2.id = lh.sh_id;

id	description	product_amount	all_items_stock
1	Apple	70	598
2	Banana	400	598
3	Milk	100	598
4	Toy Airplane	28	598

Yes, by combining record data and aggregate data in single results you start to get better analytical insights.

The form of a windowed column is:

An aggregate function (min, max, sum, avg) or ranking functions (rank, dense_rank row_number).
The window, the over() part
Inside the window, partitioning, ordering, frame definition.

Ranking offers information about who's your top thing.

For example, what if i want to know which product is the most sold?

select oi.products_id,
       oi.orders_id,
       oi.id                              as order_items_id,
       p.description,
       oi.amount,
       rank() over (order by amount desc) as most_sold
from products p
         left join order_items oi on oi.products_id = p.id
order by most_sold, description

products_id	orders_id	order_items_id	description	amount	most_sold
2	2	4	Banana	100	1
1	2	3	Apple	20	2
1	1	1	Apple	10	3
4	1	2	Toy Airplane	2	4
null	null	null	Milk	null	5

Another important thing to know: order by with aggregate functions works differently from regular order by.

It's a sliding window, remember that. Because of that, the following query does accumulative sums instead of a total sum for all rows:

select oi.id,
       oi.products_id,
       sum(oi.amount) over (order by oi.id) as acc_amount
--        sum(oi.amount) over () as total_amount
from order_items oi
order by oi.id

id	products_id	acc_amount
1	1	10
2	4	12
3	1	32
4	2	132

This is quite useful when plotting charts.

Next steps

What a ride, huh?

This as a long chapter, but if you're going enterprise then you're going databases and queries.

Check the links, try to model things around you and get good on that.

In the next chapter we get back to our normal kotlin programming.