Assignment 6 — Query Processor Implementation

In assignment 2 you implemented Relational Algebra. Then in assignment 4, you implemented queries using Relational Algebra. In this assignment, you will be doing something similar: implementing a few query processing operators, and then using them to implement queries.

A query processing operator is much like a Relational Algebra operator except:

As in the earlier assignments, you will start with code that I provide (in this archive), which contains an implementation of Database, Row, Table and Index classes; and some unit tests. You will need to 1) implement query processing operators, and 2) use these operators to implement some of the same queries you implemented in Assignment 4.

Implementing the query processing operators is likely to be the most demanding part of this assignment. Unlike Relational Algebra operators, which return tables containing all output rows at once, you must implement the query processing operators as Iterators, which return one row each time next() is called. Structuring your code to return one row at a time is going to be trickier than simply generating all of the rows at once, (especially for join).

Writing the queries should be familiar to you from Assignment 4. The style of coding — deeply nested expressions combining operators — is similar.

What's in the archive

The archive contains the following:

Guided Tour

A query

It is probably easiest to understand the pieces of this assignment by examining a query, and seeing how it is implemented in terms of operators. Here is test_q1 from test_query_plans.cpp. (The name test_q1 is unchanged from Assignment 4, and in fact we are implementing the same query. But this time, using query processing operators instead of relational algebra.) static bool q1_predicate(const Row *row) { return row->at(1) == "Tweetii"; } static void test_q1() { Table *control1 = Database::new_table("control1", ColumnNames{"birth_date"}); add(control1, {"1984/02/28"}); Iterator* q1 = project( select(table_scan(user), q1_predicate), {2}); Iterator* c1 = table_scan(control1); CHECK(match(c1, q1)); delete q1; delete c1; }

Query Processing Interface

In the code above, the functions project, select, and table_scan are part of the query processing interface, declared in QueryProcessor.h. table_scan creates an Iterator that scans the user table. The select invocation selects those rows satisfying q1_predicate. Note that q1_predicate refers to columns by number. In this case, the username is stored in column 1 (the second column) of the input row. project takes Rows from the Iterator returned by select, and keeps only the last column (number 2).

Here is the complete interface (minus the documentation present in the header file):
Iterator* table_scan(Table* table); Iterator* index_scan(Index* index, Row* lo, Row* hi = NULL); Iterator* select(Iterator* input, RowPredicate predicate); Iterator* project(Iterator* input, initializer_list project_columns); Iterator* nested_loops_join(Iterator* left, const initializer_list& left_join_columns, Iterator* right, const initializer_list& right_join_columns); Iterator* sort(Iterator* input, const initializer_list& sort_columns); Iterator* unique(Iterator* input);

These functions are implemented in QueryProcessor.cpp, and all they do is to create query processing operator objects, e.g.
Iterator* project(Iterator* input, initializer_list project_columns) { return new Project(input, project_columns); }

So the project function is implemented by the Project class, the table_scan function is implemented by the TableScan class, etc.

Query Processing Operators

Project is a query processing operator, declared in Operators.h, and implemented in Operators.cpp. Here is the declaration:
class Project : public Iterator { public: unsigned n_columns() override; void open() override; Row* next() override; void close() override; public: Project(Iterator* input, const initializer_list& columns); ~Project(); private: Iterator* _input; ColumnSelector _column_selector; };

A few things to note about this class:

Here is the implementation of Project, from Operators.cpp:
unsigned Project::n_columns() { return _column_selector.n_selected(); } void Project::open() { _input->open(); } Row* Project::next() { Row* projected = NULL; Row* row = _input->next(); if (row) { projected = new Row(); for (unsigned i = 0; i < _column_selector.n_selected(); i++) { projected->append(row->at(_column_selector.selected(i))); } Row::reclaim(row); } return projected; } void Project::close() { _input->close(); } Project::Project(Iterator* input, const initializer_list& columns) : _input(input), _column_selector(input->n_columns(), columns) {} Project::~Project() { delete _input; }

Important things to notice:

What you need to do

All of the code in this walkthrough can be found in the provided archive. I've implemented test_q1 and the member functions of Project. However, you will need to implement the member functions of the other operators in Operators.cpp, and the other queries in test_query_plans.cpp, (look for IMPLEMENT_ME).

Major classes

Database, Table, Row

The Database class is pretty much the same as in previous assignments.

In previous assignments, Table was a set of Rows, (using the RowSet type). To reflect database system internals more accurately, this assignment defines a Table as a list of Rows. This means that duplicates may occur, (just like a sequential file).

Row is a bit different. As in the past, a Table contains Rows, and you can access Row field values by using a column name, e.g.
Row* person = ... printf("name: %s\n", person->value("name");

However, column names are only useful in conjunction with tables. Projects and joins can easily generate rows that have little resemblance to any tables. So for this assignment, operators will index fields of Rows by position, starting from 0. If a Row r has n fields, then you can access the ith field as r->at(i), i = 0 .. n-1. (All fields are strings, as before.)

ColumnSelector

Several of the operators require special columns to be identified. For example, Project needs to know which columns to keep. NestedLoopsJoin needs to know, for each input iterator, what columns to compare for the join. The ColumnSelector class is used for this purpose.

For example, suppose we have an Iterator whose rows have 5 columns (numbered 0 through 4), and we want to project on columns 1 and 2. The Project operator would have a ColumnSelector object set up as follows:

RowCompare

RowCompare is different than in previous assignments. Previously, RowCompare::only was the only RowCompare object needed, to compare whether two rows are equal, and to determine row ordering. But now, the Sort operator needs to be able to sort by a given set of columns. RowCompare has been modified to help you.

Your Sort implementation should be able to use std::sort to do the actual sorting. The last argument to std::sort should be a RowCompare object that compares rows according to the Sort operator's sort columns. I.e., your Sort::open() implementation should do the sorting by creating an approprate RowCompare object and then passing it to std::sort.

Indexes

One new feature of the database classes is that tables can now have indexes. Query execution plans can start by scanning tables, or by doing index lookups, as in a real database system.

An IndexScan operator is specified by giving the lo and hi key values of interest. The scan is implemented by searching for lo, and scanning forward until hi is exceeded. IndexScan::next() returns a Row pointed to by the index.

For example, suppose we have a Table, t, with columns x, y, z:
x y z --------------- "5" "10" "123" "5" "71" "456" "6" "66" "789" "7" "42" "543" "8" "83" "765" "5" "99" "678" "3" "16" "432" "8" "25" "987"

We can create an index on y as follows:
Index* ty = t->add_index(ColumnNames{"y"});

This index provides access to Rows of t in order of y values: "10", "16", "25", "42", "66", "71", "83", "99".

To search for all the Rows with y between "20" and "50", we can do this:
Row* lo = new Row(1); lo->append("20"); Row* hi = new Row(1); hi->append("50"); Iterator* i = index_scan(ty, lo, hi);

Using this iterator yields the rows ("8", "25", "987"), ("7", "42", "543").

Iterator

Each operator class is a subclass of Iterator, defined in Iterator.h:
class Iterator { public: virtual unsigned n_columns() = 0; virtual void open() = 0; virtual Row* next() = 0; virtual void close() = 0; virtual ~Iterator() {} };

open(), next(), and close() are as discussed in class.

n_columns() returns the number of columns in the Iterator's rows, (i.e., the Rows returned by next()). So for example, TableIterator::n_columns() returns the number of columns in the underlying table. Project::n_columns() returns the number of columns being projected.

Schema and data

The schema for this assignment is similar to Assignment 4. We have the same tables:
user( user_id primary key, username, birth_date) message( message_id primary key, send_date, text) routing( from_user_id references user, to_user_id references user, message_id, primary key(from_user_id, to_user_id, message_id)

In addition, the user table has an index on username, (this is created in the setup() function of test_query_plans.cpp).

The db directory contains the same data as in Assignment 4.

Memory management

I have tried to eliminate as many memory management headaches for you as possible. (E.g., I implemented the destructor for each operator class. You're welcome.) There is just one thing you need to watch out for: deleting rows created by your operator implementations, that will never be needed again.

For example, note the call to Row::reclaim() in Project::next(). The input row is never returned. Instead, a new row is constructed (in the variable projected), and that is returned from next(). The input row will never be used again, so it is reclaimed.

Don't delete rows by using the C++ delete statement. Use Row::reclaim() instead. This function deletes a row unless it belongs to a table. We want to keep those of, course. We only want to delete rows generated by operators which will never be needed again.

Operator tests

Here is a typical operator test:
void sort_non_empty() { Table* t = Database::new_table("t", ColumnNames{"a", "b", "c"}); add(t, {"a", "z", "30"}); add(t, {"c", "y", "20"}); add(t, {"e", "x", "10"}); add(t, {"g", "w", "40"}); add(t, {"a", "z", "31"}); add(t, {"c", "y", "21"}); add(t, {"e", "x", "11"}); add(t, {"g", "w", "41"}); Iterator* i = sort(table_scan(t), {1, 2, 0}); Table* control = Database::new_table("control", ColumnNames{"a", "b", "c"}); add(control, {"g", "w", "40"}); add(control, {"g", "w", "41"}); add(control, {"e", "x", "10"}); add(control, {"e", "x", "11"}); add(control, {"c", "y", "20"}); add(control, {"c", "y", "21"}); add(control, {"a", "z", "30"}); add(control, {"a", "z", "31"}); Iterator* control_iterator = table_scan(control); CHECK(i->n_columns() == 3); TWICE { CHECK(match(control_iterator, i)); }; delete i; delete control_iterator; }

How to attack this assignment

I suggest proceeding as follows:

  1. Don't worry about deleting rows (Row::reclaim() calls) until all tests are passing. Otherwise, you will waste mental energy constantly worrying about memory management as you work on correctness.

  2. Implement the operators one at a time. In main.cpp, disable the call of test_queries. In test_operators, disable the tests for operators that you haven't implemented yet. Once you get an operator working correctly (tests pass, valgrind shows no memory corruption), go on to the next.

  3. Re-enable the call of test_queries, and then implement the queries one at a time, starting with the simplest one. You will need to fill in the expressions where you see IMPLEMENT_ME.

  4. If you are having trouble with a query, use print_iterator() to check intermediate results. E.g. if you have an expression like this:
    select( project( nested_loops_join( table_scan(r), {3} table_scan(s), {1}), {0, 1, 4}), some_predicate)

    and it is crashing, or returning incorrect results, then you can debug by examining sub-expressions, e.g.
    print_iterator("r scan", table_scan(r)); print_iterator("s scan", table_scan(s)); print_iterator("join", nested_loops_join(table_scan(r), {3}, table_scan(s), {1})));

  5. Valgrind is your friend, not your enemy. Run it frequently, and if it reports memory corruption, fix these problems as soon immediately. If you neglect to do this, problems will get harder to debug as you add more code, and you will be subject to random crashes due to memory corruption.

  6. Note that NestedLoopsJoin has a data member named _left_row, which is intended to store a row from the _left input to the join. If you understand the nested join algorithm, and how it should be implemented as an Iterator, then you should understand why this is a useful thing to have.

  7. Once all tests are passing, start eliminating memory leaks. The basic idea is to reclaim a row as soon as you are sure that it will not be passed to another operator by being returned from next().

Building the code

Unpack the archive, and cd into the a6 directory.

Run make: jao@mintyzack ~/115/assignments/a6/a6_jao $ make g++ -g -Wall -Wno-unused-function -O0 -std=c++11 -c ColumnNames.cpp -o ColumnNames.o g++ -g -Wall -Wno-unused-function -O0 -std=c++11 -c ColumnSelector.cpp -o ColumnSelector.o g++ -g -Wall -Wno-unused-function -O0 -std=c++11 -c Database.cpp -o Database.o g++ -g -Wall -Wno-unused-function -O0 -std=c++11 -c Index.cpp -o Index.o g++ -g -Wall -Wno-unused-function -O0 -std=c++11 -c main.cpp -o main.o g++ -g -Wall -Wno-unused-function -O0 -std=c++11 -c Operators.cpp -o Operators.o g++ -g -Wall -Wno-unused-function -O0 -std=c++11 -c QueryProcessor.cpp -o QueryProcessor.o g++ -g -Wall -Wno-unused-function -O0 -std=c++11 -c Row.cpp -o Row.o g++ -g -Wall -Wno-unused-function -O0 -std=c++11 -c RowCompare.cpp -o RowCompare.o g++ -g -Wall -Wno-unused-function -O0 -std=c++11 -c Table.cpp -o Table.o g++ -g -Wall -Wno-unused-function -O0 -std=c++11 -c test_operators.cpp -o test_operators.o g++ -g -Wall -Wno-unused-function -O0 -std=c++11 -c test_query_plans.cpp -o test_query_plans.o g++ -g -Wall -Wno-unused-function -O0 -std=c++11 -c unittest.cpp -o unittest.o g++ -g -Wall -Wno-unused-function -O0 -std=c++11 -c util.cpp -o util.o g++ -g -Wall -Wno-unused-function -O0 -std=c++11 ColumnNames.o ColumnSelector.o Database.o Index.o main.o Operators.o QueryProcessor.o Row.o RowCompare.o Table.o test_operators.o test_query_plans.o unittest.o util.o -o a6

Before you fill in the IMPLEMENT_ME sections, you will probably see compiler warnings about statements with no effect. Those will disappear as you fill in code.

Running the code

Run the code as follows:
jao@mintyzack ~/115/a6/a6_jao $ ./a6 db

You will see two batches of tests run, the operator tests and the query tests.

Your submission

You need to do the following: