Assignment 7 — Query Optimization

The goal of this assignment is to practice reading query execution plans and optimizing queries.

Schema

You will be working with four tables, R, S, T, U, connected by foreign keys:

The DDL creating these tables is as follows:
create table U(u_id int not null primary key, u_5 int, u_filler varchar) create table T(t_id int not null primary key, t_5 int, t_filler varchar) create table S(s_id int not null primary key, s_4 int, s_20 int, s_filler varchar, t_id int references T, u_id int references U) create table R(r_id int not null primary key, r_2 int, r_10 int, r_1p int, r_10p int, r_50p int, r_filler varchar, s_id int references S)

Data

Row counts are as follows:

The column naming scheme describes the selectivity of the columns. As you know, that information is crucial for understanding performance and optimizing queries.

So, for example:

Query Output
select count(*)
from r
where r_2 = 0
2
select count(*)
from r
where r_10 = 0
10
select count(*)
from r
where r_1p = 0
20,000
select count(*)
from r
where r_10p = 0
200,000

Working with the database

Setup

Download this archive. It contains the following: Load the data into your database as follows:
psql -h postgres.cs.tufts.edu < loaddb.sql

If you are using a postgres database on your own computer, you don't need the -h postgres.cs.tufts.edu arguments.

This should take under 2 minutes to run. (25 seconds on my laptop, probably slower on the homework and postgres servers.) You should only need to run this once, because to work on this assignment, you will not be modifying data. You will be adding and dropping indexes, but those are easily reversible changes.

To verify that the database is set up correctly, check the row counts, (e.g. select count(*) from r). The expected counts are given above.

Workflow

As you work on this assignment, you will do the following:

I strongly suggest that you work on each query by putting the explains, index creation, and index deletion steps in a script. You will spend far too much time typing and getting things subtly wrong otherwise.

For example, suppose that I am working on this query:
select * from s where s_10p between 1 and 2

I want to examine the cost without an index, and with. I would create a script, test.sql as follows:
select 'ORIGINAL -- NO INDEXES'; explain select * from r where r_10p between 1 and 2; select 'ADD INDEX ON r_10P'; create index on r(r_10p); analyze r; explain select * from r where r_10p between 1 and 2; drop index r_r_10p_idx;

Note the following:

(To delete these indexes: drop index r_r_10p_idx; drop index r_r_10p_idx1; drop index r_r_10p_idx2;)

Don't worry about the time taken by all the creation and deletion of indexes. The largest table has only 2M rows, and these steps will run quickly, (1-2 seconds).

I run my script as follows:
psql < test.sql > test.out

Here is the output (in test.out):
Pager usage is off. ?column? ------------------------ ORIGINAL -- NO INDEXES (1 row) QUERY PLAN ----------------------------------------------------------- Seq Scan on r (cost=0.00..46667.00 rows=402200 width=39) Filter: ((r_10p >= 1) AND (r_10p <= 2)) (2 rows) ?column? -------------------- ADD INDEX ON r_10P (1 row) CREATE INDEX ANALYZE QUERY PLAN ---------------------------------------------------------------------------------- Bitmap Heap Scan on r (cost=8532.24..31228.24 rows=401933 width=39) Recheck Cond: ((r_10p >= 1) AND (r_10p <= 2)) -> Bitmap Index Scan on r_r_10p_idx (cost=0.00..8431.76 rows=401933 width=0) Index Cond: ((r_10p >= 1) AND (r_10p <= 2)) (4 rows) DROP INDEX

I notice the following:

Finally: The questions

Plans print costs like this:
HashAggregate (cost=35405.68..35607.11 rows=20143 width=129)

The first red number is the cost to produce the first row, and the second one is the cost to produce the entire query result.

In all cases, aim to optimize the time to produce the entire query result, not just the first row.

For each question, submit the .sql file used to run explain, trying various indexing strategies. Also submit the output from each script. For the first question, put your written answers in q1.txt. For the remaining questions, include a text file identifying: 1) the set of indexes that led to the cheapest plan, (not all indexes, just the ones used in the cheapest plan); 2) the cheapest plan; and 3) the cost of that plan. If you just include explain output, that will include the plan and its cost. I.e., include qi.sql, qi.out, qi.txt for i = 1..7.

1) Explain these plans

You should see three different plans.

  1. Explain why these very similar queries get different plans.

  2. Given what you know about the selectivity of the columns, why does each plan make sense?

2) What's the best indexing strategy?

Create one or more indexes to get the best possible performance for this query:
select * from r where r_1p = 0 and r_10p = 0 and r_50p = 0

3) What's the best indexing strategy?

Create one or more indexes to get the best possible performance for this query:
select * from r where r_1p between 0 and 5 and r_10p = 0 and r_50p = 0

4) What's the best indexing strategy?

Create one or more indexes to get the best possible performance for this query:
select t.* from t join s using (t_id) join r using (s_id) join u using (u_id) where u_5 = 0

5) What's the best indexing strategy?

Create one or more indexes to get the best possible performance for this query:
select r_10, s_20 from r join s using (s_id) where r_50p = 0 and s_4 = 0

6) What's the best way to index and/or rewrite this query?

This time, consider indexing strategies as usual, but also consider rewriting the query to obtain a query that can be optimized better.
select * from r where r_1p = 0 union select * from r where r_10 = 0

7) What's the best way to index and/or rewrite this query?

As with #6, consider both indexing strategies, and rewriting the query.
select * from r where s_id in ( select s_id from s where s_20 = 0 and t_id in ( select t_id from t where t_5 = 0 ) )

In case you want to make this interesting ...

Here are the lowest cost estimates I was able to obtain.

Query My laptop
(Postgres 10.6)
postgres.cs.tufts.edu
(Postgres 9.2.24)
Q2 2963.11 3274.77
Q3 11620.25 12372.42
Q4 69.29 107.09
Q5 117.69 118.11
Q6 18286.34 18321.37
Q7 110.02 109.17