Assignment 7 — Query Optimization

The goal of this assignment is to practice reading query execution plans and optimizing queries.

Schema

You will be working with four tables, R, S, T, U, connected by foreign keys:

The DDL creating these tables is as follows:
create table U(u_id int not null primary key, u_5 int, u_filler varchar) create table T(t_id int not null primary key, t_5 int, t_filler varchar) create table S(s_id int not null primary key, s_4 int, s_20 int, s_filler varchar, t_id int references T, u_id int references U) create table R(r_id int not null primary key, r_2 int, r_10 int, r_1p int, r_10p int, r_50p int, r_filler varchar, s_id int references S)

Data

Row counts are as follows:

R: 2,000,000 rows
S: 200,000 rows
T: 200,000 rows
U: 400,000 rows

The column naming scheme describes the selectivity of the columns. As you know, that information is crucial for understanding performance and optimizing queries.

Column name prefixes: All columns in table X have names with the prefix x_.
Primary key: x_id is the primary key of table X.
Equality search returning a fixed number of rows: A query of the form x_n = 0 will return exactly n rows.
Equality search returning a fixed percentage of rows: A query of the form x_np = 0 will return n% of the table's rows.
Filler: Every table X contains a column named x_filler which contains a 10-byte string.
Foreign keys: Tables R and S contain foreign keys. A column named y_id is a reference to table Y.

So, for example:

Query Output

select count(*) from r where r_2 = 0 2

select count(*) from r where r_10 = 0 10

select count(*) from r where r_1p = 0 20,000

select count(*) from r where r_10p = 0 200,000

Working with the database

Setup

Download this archive. It contains the following:

r.csv: Data for table R.
s.csv: Data for table S.
t.csv: Data for table T.
u.csv: Data for table U.
loaddb.sql: A script for loading the data into your database.

Load the data into your database as follows:


psql -h postgres.cs.tufts.edu < loaddb.sql

If you are using a postgres database on your own computer, you don't need the -h postgres.cs.tufts.edu arguments.

This should take under 2 minutes to run. (25 seconds on my laptop, probably slower on the homework and postgres servers.) You should only need to run this once, because to work on this assignment, you will not be modifying data. You will be adding and dropping indexes, but those are easily reversible changes.

To verify that the database is set up correctly, check the row counts, (e.g. select count(*) from r). The expected counts are given above.

Workflow

As you work on this assignment, you will do the following:

Print the execution plan for a query, using explain.
Analyze the plan.
Add or drop indexes to try and obtain a cheaper plan.
Run analyze.
Repeat.

I strongly suggest that you work on each query by putting the explains, index creation, and index deletion steps in a script. You will spend far too much time typing and getting things subtly wrong otherwise.

For example, suppose that I am working on this query:
select * from s where s_10p between 1 and 2

I want to examine the cost without an index, and with. I would create a script, test.sql as follows:
select 'ORIGINAL -- NO INDEXES'; explain select * from r where r_10p between 1 and 2; select 'ADD INDEX ON r_10P'; create index on r(r_10p); analyze r; explain select * from r where r_10p between 1 and 2; drop index r_r_10p_idx;

Note the following:

select statements are used to print comments in the output. This helps you navigate the output more easily.
Very important: After creating an index, the script runs analyze r. This is essential. If you don't do this, the optimizer will not have statistics on the index you just created, and this is likely to result in inferior execution plans.
Even more important: The end of the script drops any indexes that have been created. If you don't do this, then the index continues to exist, and the next time your run the script, the index will be there, which is probably not what was intended. For example, if I leave out the drop index statement above, and run the script three times, I get three indexes on table R (\d R prints schema information on table R):
jao=# \d r Table "public.r" Column | Type | Modifiers ----------+-------------------+----------- r_id | integer | not null r_2 | integer | r_10 | integer | r_1p | integer | r_10p | integer | r_50p | integer | r_filler | character varying | s_id | integer | Indexes: "r_pkey" PRIMARY KEY, btree (r_id) "r_r_10p_idx" btree (r_10p) "r_r_10p_idx1" btree (r_10p) "r_r_10p_idx2" btree (r_10p) Foreign-key constraints: "r_s_id_fkey" FOREIGN KEY (s_id) REFERENCES s(s_id)

(To delete these indexes: drop index r_r_10p_idx; drop index r_r_10p_idx1; drop index r_r_10p_idx2;)

Don't worry about the time taken by all the creation and deletion of indexes. The largest table has only 2M rows, and these steps will run quickly, (1-2 seconds).

I run my script as follows:
psql < test.sql > test.out

Here is the output (in test.out):
Pager usage is off. ?column? ------------------------ ORIGINAL -- NO INDEXES (1 row) QUERY PLAN ----------------------------------------------------------- Seq Scan on r (cost=0.00..46667.00 rows=402200 width=39) Filter: ((r_10p >= 1) AND (r_10p <= 2)) (2 rows) ?column? -------------------- ADD INDEX ON r_10P (1 row) CREATE INDEX ANALYZE QUERY PLAN ---------------------------------------------------------------------------------- Bitmap Heap Scan on r (cost=8532.24..31228.24 rows=401933 width=39) Recheck Cond: ((r_10p >= 1) AND (r_10p <= 2)) -> Bitmap Index Scan on r_r_10p_idx (cost=0.00..8431.76 rows=401933 width=0) Index Cond: ((r_10p >= 1) AND (r_10p <= 2)) (4 rows) DROP INDEX

I notice the following:

The initial plan is a sequential scan, which is not surprising. The cost is 46667.
Once the index is added, and analyze is run, the optimizer decides to use the index, resulting in a cost of 31228.
However, the plan is not an index scan, but a bitmap index scan. This is because the index is not selective -- the query retrieves 20% of the table. That is 400,000 rows, the estimated row count is quite good. Because the index is not highly selective, an index scan would result in a lot of random accesses. A bitmap scan should reduce the number of random accesses somewhat.

Finally: The questions

Plans print costs like this:
HashAggregate (cost=35405.68..35607.11 rows=20143 width=129)

The first red number is the cost to produce the first row, and the second one is the cost to produce the entire query result.

In all cases, aim to optimize the time to produce the entire query result, not just the first row.

For each question, submit the .sql file used to run explain, trying various indexing strategies. Also submit the output from each script. For the first question, put your written answers in q1.txt. For the remaining questions, include a text file identifying: 1) the set of indexes that led to the cheapest plan, (not all indexes, just the ones used in the cheapest plan); 2) the cheapest plan; and 3) the cost of that plan. If you just include explain output, that will include the plan and its cost. I.e., include qi.sql, qi.out, qi.txt for i = 1..7.

1) Explain these plans

Create indexes on each of these columns of R: r_2, r_1p, r_50p.
Don't forget to run analyze!
Run explain on each of these queries:
- select * from r where r_2 = 0
- select * from r where r_1p = 0
- select * from r where r_50p = 0

You should see three different plans.

Explain why these very similar queries get different plans.
Given what you know about the selectivity of the columns, why does each plan make sense?

2) What's the best indexing strategy?

Create one or more indexes to get the best possible performance for this query:
select * from r where r_1p = 0 and r_10p = 0 and r_50p = 0

3) What's the best indexing strategy?

Create one or more indexes to get the best possible performance for this query:
select * from r where r_1p between 0 and 5 and r_10p = 0 and r_50p = 0

4) What's the best indexing strategy?

Create one or more indexes to get the best possible performance for this query:
select t.* from t join s using (t_id) join r using (s_id) join u using (u_id) where u_5 = 0

5) What's the best indexing strategy?

Create one or more indexes to get the best possible performance for this query:
select r_10, s_20 from r join s using (s_id) where r_50p = 0 and s_4 = 0

6) What's the best way to index and/or rewrite this query?

This time, consider indexing strategies as usual, but also consider rewriting the query to obtain a query that can be optimized better.
select * from r where r_1p = 0 union select * from r where r_10 = 0

7) What's the best way to index and/or rewrite this query?

As with #6, consider both indexing strategies, and rewriting the query.
select * from r where s_id in ( select s_id from s where s_20 = 0 and t_id in ( select t_id from t where t_5 = 0 ) )

In case you want to make this interesting ...

Here are the lowest cost estimates I was able to obtain.

Query My laptop
(Postgres 10.6) postgres.cs.tufts.edu
(Postgres 9.2.24)

Q2 2963.11 3274.77

Q3 11620.25 12372.42

Q4 69.29 107.09

Q5 117.69 118.11

Q6 18286.34 18321.37

Q7 110.02 109.17

Query	Output
`select count(*) from r where r_2 = 0`	2
`select count(*) from r where r_10 = 0`	10
`select count(*) from r where r_1p = 0`	20,000
`select count(*) from r where r_10p = 0`	200,000

Query	My laptop (Postgres 10.6)	`postgres.cs.tufts.edu` (Postgres 9.2.24)
`Q2`	`2963.11`	`3274.77`
`Q3`	`11620.25`	`12372.42`
`Q4`	`69.29`	`107.09`
`Q5`	`117.69`	`118.11`
`Q6`	`18286.34`	`18321.37`
`Q7`	`110.02`	`109.17`