Assignment 6 — Query Processor Implementation

In assignment 2 you implemented Relational Algebra. Then in assignment 4, you implemented queries using Relational Algebra. In this assignment, you will be doing something similar: implementing a few query processing operators, and then using them to implement queries.

A query processing operator is much like a Relational Algebra operator except:

A query processing operator operates on streams of rows, not sets of rows.
A query processing operator may generate duplicates, (since streams, unlike sets, are not guaranteed to have unique elements).
Columns are identified by position, not by name.

As in the earlier assignments, you will start with code that I provide (in this archive), which contains an implementation of Database, Row, Table and Index classes; and some unit tests. You will need to 1) implement query processing operators, and 2) use these operators to implement some of the same queries you implemented in Assignment 4.

Implementing the query processing operators is likely to be the most demanding part of this assignment. Unlike Relational Algebra operators, which return tables containing all output rows at once, you must implement the query processing operators as Iterators, which return one row each time next() is called. Structuring your code to return one row at a time is going to be trickier than simply generating all of the rows at once, (especially for join).

Writing the queries should be familiar to you from Assignment 4. The style of coding — deeply nested expressions combining operators — is similar.

What's in the archive

The archive contains the following:

Database classes: There is an implementation of database classes, similar to what you implemented in Assignment 1:
- ColumnNames.{h,cpp}
- ColumnSelector.{h,cpp}
- Database.{h,cpp}
- dbexceptions.h
- Index.{h,cpp}
- Row.{h,cpp}
- RowCompare.{h,cpp}
- Table.{h,cpp}
You should use these source files without modification.
Query processor interface: QueryProcessor.{h,cpp}. Don't modify this either. This code has the functions you will use to implement queries.
Query processor implementation: This is in Iterator.h, and Operators.{h,cpp}. Iterator is the base class of the query processing operators' implementations. Operators.{h,cpp} contain the classes implementing each operator. One of your main tasks is to fill in the implementations of the member functions in Operators.cpp.
Tests: main.cpp, test_operators.cpp, test_query_plans.cpp, unittest.{h,cpp}. These are the tests exercising your code. test_operators.cpp ensures the correct functioning of your operator implementations. test_query_plans.cpp is mostly filled in, but you need to write some query plans, as you did in Assignment 4.
Makefile: So that you can build the code.
db directory: The db directory contains the data for the member, message, and routing tables, as in previous assignments. This data is used by the query plan tests.

Guided Tour

A query

It is probably easiest to understand the pieces of this assignment by examining a query, and seeing how it is implemented in terms of operators. Here is test_q1 from test_query_plans.cpp. (The name test_q1 is unchanged from Assignment 4, and in fact we are implementing the same query. But this time, using query processing operators instead of relational algebra.) static bool q1_predicate(const Row *row) { return row->at(1) == "Tweetii"; } static void test_q1() { Table *control1 = Database::new_table("control1", ColumnNames{"birth_date"}); add(control1, {"1984/02/28"}); Iterator* q1 = project( select(table_scan(user), q1_predicate), {2}); Iterator* c1 = table_scan(control1); CHECK(match(c1, q1)); delete q1; delete c1; }

Query Processing Interface

In the code above, the functions project, select, and table_scan are part of the query processing interface, declared in QueryProcessor.h. table_scan creates an Iterator that scans the user table. The select invocation selects those rows satisfying q1_predicate. Note that q1_predicate refers to columns by number. In this case, the username is stored in column 1 (the second column) of the input row. project takes Rows from the Iterator returned by select, and keeps only the last column (number 2).

Here is the complete interface (minus the documentation present in the header file):
Iterator* table_scan(Table* table); Iterator* index_scan(Index* index, Row* lo, Row* hi = NULL); Iterator* select(Iterator* input, RowPredicate predicate); Iterator* project(Iterator* input, initializer_list project_columns); Iterator* nested_loops_join(Iterator* left, const initializer_list& left_join_columns, Iterator* right, const initializer_list& right_join_columns); Iterator* sort(Iterator* input, const initializer_list& sort_columns); Iterator* unique(Iterator* input);

These functions are implemented in QueryProcessor.cpp, and all they do is to create query processing operator objects, e.g.
Iterator* project(Iterator* input, initializer_list project_columns) { return new Project(input, project_columns); }

So the project function is implemented by the Project class, the table_scan function is implemented by the TableScan class, etc.

Query Processing Operators

Project is a query processing operator, declared in Operators.h, and implemented in Operators.cpp. Here is the declaration:
class Project : public Iterator { public: unsigned n_columns() override; void open() override; Row* next() override; void close() override; public: Project(Iterator* input, const initializer_list& columns); ~Project(); private: Iterator* _input; ColumnSelector _column_selector; };

A few things to note about this class:

Project is a subclass of Iterator. All of the operator classes are Iterator subclasses, which means they must implement n_columns(), open(), next(), and close(); you will provide these implementations. (The Iterator class is discussed later.)
A Project object is represented by two fields, _input and _column_selector. I have provided the field declarations for each operator class. It is your job to implement class using these fields.
The constructor takes input (passed in from the project() function), and will obviously use these arguments to initialize fields.

Here is the implementation of Project, from Operators.cpp:
unsigned Project::n_columns() { return _column_selector.n_selected(); } void Project::open() { _input->open(); } Row* Project::next() { Row* projected = NULL; Row* row = _input->next(); if (row) { projected = new Row(); for (unsigned i = 0; i < _column_selector.n_selected(); i++) { projected->append(row->at(_column_selector.selected(i))); } Row::reclaim(row); } return projected; } void Project::close() { _input->close(); } Project::Project(Iterator* input, const initializer_list& columns) : _input(input), _column_selector(input->n_columns(), columns) {} Project::~Project() { delete _input; }

Important things to notice:

The constructor initializes the fields.
Observe how much delegation there is. Project::open() simply calls _input->open(). Project::close() simply calls _input->close(). And the destructor deletes _input.
The major work of this class is done by next(). An input row is scanned, the columns selected for projection are written to a new row, and that new row is returned.
Project::next() reclaims rows that will never be used again, to avoid memory leaks. In general, if an operator receives a row and does not return it from next(), then the row should be reclaimed. A row that is read and returned by next() should not be reclaimed by the operator — it will be the responsibility of some later operator to do the reclamation.

What you need to do

All of the code in this walkthrough can be found in the provided archive. I've implemented test_q1 and the member functions of Project. However, you will need to implement the member functions of the other operators in Operators.cpp, and the other queries in test_query_plans.cpp, (look for IMPLEMENT_ME).

Major classes

Database, Table, Row

The Database class is pretty much the same as in previous assignments.

In previous assignments, Table was a set of Rows, (using the RowSet type). To reflect database system internals more accurately, this assignment defines a Table as a list of Rows. This means that duplicates may occur, (just like a sequential file).

Row is a bit different. As in the past, a Table contains Rows, and you can access Row field values by using a column name, e.g.
Row* person = ... printf("name: %s\n", person->value("name");

However, column names are only useful in conjunction with tables. Projects and joins can easily generate rows that have little resemblance to any tables. So for this assignment, operators will index fields of Rows by position, starting from 0. If a Row r has n fields, then you can access the ith field as r->at(i), i = 0 .. n-1. (All fields are strings, as before.)

ColumnSelector

Several of the operators require special columns to be identified. For example, Project needs to know which columns to keep. NestedLoopsJoin needs to know, for each input iterator, what columns to compare for the join. The ColumnSelector class is used for this purpose.

For example, suppose we have an Iterator whose rows have 5 columns (numbered 0 through 4), and we want to project on columns 1 and 2. The Project operator would have a ColumnSelector object set up as follows:

n_columns(): 5, the number of columns in an input row.
n_selected(): 2, the number of columns in an output row.
n_unselected(): 3, the number of columns not selected.
selected(i): The position in the input row of the ith output column. So selected(0) = 1 and selected(1) = 2.
unselected(i): The position in the intput row of the unselected columns. So unselected(0) = 0, unselected(1) = 3, unselected(2) = 4.

RowCompare

RowCompare is different than in previous assignments. Previously, RowCompare::only was the only RowCompare object needed, to compare whether two rows are equal, and to determine row ordering. But now, the Sort operator needs to be able to sort by a given set of columns. RowCompare has been modified to help you.

Your Sort implementation should be able to use std::sort to do the actual sorting. The last argument to std::sort should be a RowCompare object that compares rows according to the Sort operator's sort columns. I.e., your Sort::open() implementation should do the sorting by creating an approprate RowCompare object and then passing it to std::sort.

Indexes

One new feature of the database classes is that tables can now have indexes. Query execution plans can start by scanning tables, or by doing index lookups, as in a real database system.

An IndexScan operator is specified by giving the lo and hi key values of interest. The scan is implemented by searching for lo, and scanning forward until hi is exceeded. IndexScan::next() returns a Row pointed to by the index.

For example, suppose we have a Table, t, with columns x, y, z:
x y z --------------- "5" "10" "123" "5" "71" "456" "6" "66" "789" "7" "42" "543" "8" "83" "765" "5" "99" "678" "3" "16" "432" "8" "25" "987"

We can create an index on y as follows:
Index* ty = t->add_index(ColumnNames{"y"});

This index provides access to Rows of t in order of y values: "10", "16", "25", "42", "66", "71", "83", "99".

To search for all the Rows with y between "20" and "50", we can do this:
Row* lo = new Row(1); lo->append("20"); Row* hi = new Row(1); hi->append("50"); Iterator* i = index_scan(ty, lo, hi);

Using this iterator yields the rows ("8", "25", "987"), ("7", "42", "543").

Iterator

Each operator class is a subclass of Iterator, defined in Iterator.h:
class Iterator { public: virtual unsigned n_columns() = 0; virtual void open() = 0; virtual Row* next() = 0; virtual void close() = 0; virtual ~Iterator() {} };

open(), next(), and close() are as discussed in class.

n_columns() returns the number of columns in the Iterator's rows, (i.e., the Rows returned by next()). So for example, TableIterator::n_columns() returns the number of columns in the underlying table. Project::n_columns() returns the number of columns being projected.

Schema and data

The schema for this assignment is similar to Assignment 4. We have the same tables:
user( user_id primary key, username, birth_date) message( message_id primary key, send_date, text) routing( from_user_id references user, to_user_id references user, message_id, primary key(from_user_id, to_user_id, message_id)

In addition, the user table has an index on username, (this is created in the setup() function of test_query_plans.cpp).

The db directory contains the same data as in Assignment 4.

Memory management

I have tried to eliminate as many memory management headaches for you as possible. (E.g., I implemented the destructor for each operator class. You're welcome.) There is just one thing you need to watch out for: deleting rows created by your operator implementations, that will never be needed again.

For example, note the call to Row::reclaim() in Project::next(). The input row is never returned. Instead, a new row is constructed (in the variable projected), and that is returned from next(). The input row will never be used again, so it is reclaimed.

Don't delete rows by using the C++ delete statement. Use Row::reclaim() instead. This function deletes a row unless it belongs to a table. We want to keep those of, course. We only want to delete rows generated by operators which will never be needed again.

Operator tests

Here is a typical operator test:
void sort_non_empty() { Table* t = Database::new_table("t", ColumnNames{"a", "b", "c"}); add(t, {"a", "z", "30"}); add(t, {"c", "y", "20"}); add(t, {"e", "x", "10"}); add(t, {"g", "w", "40"}); add(t, {"a", "z", "31"}); add(t, {"c", "y", "21"}); add(t, {"e", "x", "11"}); add(t, {"g", "w", "41"}); Iterator* i = sort(table_scan(t), {1, 2, 0}); Table* control = Database::new_table("control", ColumnNames{"a", "b", "c"}); add(control, {"g", "w", "40"}); add(control, {"g", "w", "41"}); add(control, {"e", "x", "10"}); add(control, {"e", "x", "11"}); add(control, {"c", "y", "20"}); add(control, {"c", "y", "21"}); add(control, {"a", "z", "30"}); add(control, {"a", "z", "31"}); Iterator* control_iterator = table_scan(control); CHECK(i->n_columns() == 3); TWICE { CHECK(match(control_iterator, i)); }; delete i; delete control_iterator; }

We populate a table, t.
Next, we create an Iterator, i, that will scan and sort the table.
We create a Table, control, that specifies the expected output from the operator, and get an iterator over it, control_iterator.
Then we check correctness. First, we verify that the iterator yields the expected number of columns.
Finally, we compare the rows returned by the two iterators.
The TWICE macro runs the test twice. This ensures that Iterators can be reused — opened, scanned, closed, and then opened and scanned again. This is important for some operators, (e.g. nested_loops_join, which obviously has an inner loop that is run multiple times).

How to attack this assignment

I suggest proceeding as follows:

Don't worry about deleting rows (Row::reclaim() calls) until all tests are passing. Otherwise, you will waste mental energy constantly worrying about memory management as you work on correctness.
Implement the operators one at a time. In main.cpp, disable the call of test_queries. In test_operators, disable the tests for operators that you haven't implemented yet. Once you get an operator working correctly (tests pass, valgrind shows no memory corruption), go on to the next.
Re-enable the call of test_queries, and then implement the queries one at a time, starting with the simplest one. You will need to fill in the expressions where you see IMPLEMENT_ME.
If you are having trouble with a query, use print_iterator() to check intermediate results. E.g. if you have an expression like this:
select( project( nested_loops_join( table_scan(r), {3} table_scan(s), {1}), {0, 1, 4}), some_predicate)

and it is crashing, or returning incorrect results, then you can debug by examining sub-expressions, e.g.
print_iterator("r scan", table_scan(r)); print_iterator("s scan", table_scan(s)); print_iterator("join", nested_loops_join(table_scan(r), {3}, table_scan(s), {1})));
Valgrind is your friend, not your enemy. Run it frequently, and if it reports memory corruption, fix these problems as soon immediately. If you neglect to do this, problems will get harder to debug as you add more code, and you will be subject to random crashes due to memory corruption.
Note that NestedLoopsJoin has a data member named _left_row, which is intended to store a row from the _left input to the join. If you understand the nested join algorithm, and how it should be implemented as an Iterator, then you should understand why this is a useful thing to have.
Once all tests are passing, start eliminating memory leaks. The basic idea is to reclaim a row as soon as you are sure that it will not be passed to another operator by being returned from next().

Building the code

Unpack the archive, and cd into the a6 directory.

Run make: jao@mintyzack ~/115/assignments/a6/a6_jao $ make g++ -g -Wall -Wno-unused-function -O0 -std=c++11 -c ColumnNames.cpp -o ColumnNames.o g++ -g -Wall -Wno-unused-function -O0 -std=c++11 -c ColumnSelector.cpp -o ColumnSelector.o g++ -g -Wall -Wno-unused-function -O0 -std=c++11 -c Database.cpp -o Database.o g++ -g -Wall -Wno-unused-function -O0 -std=c++11 -c Index.cpp -o Index.o g++ -g -Wall -Wno-unused-function -O0 -std=c++11 -c main.cpp -o main.o g++ -g -Wall -Wno-unused-function -O0 -std=c++11 -c Operators.cpp -o Operators.o g++ -g -Wall -Wno-unused-function -O0 -std=c++11 -c QueryProcessor.cpp -o QueryProcessor.o g++ -g -Wall -Wno-unused-function -O0 -std=c++11 -c Row.cpp -o Row.o g++ -g -Wall -Wno-unused-function -O0 -std=c++11 -c RowCompare.cpp -o RowCompare.o g++ -g -Wall -Wno-unused-function -O0 -std=c++11 -c Table.cpp -o Table.o g++ -g -Wall -Wno-unused-function -O0 -std=c++11 -c test_operators.cpp -o test_operators.o g++ -g -Wall -Wno-unused-function -O0 -std=c++11 -c test_query_plans.cpp -o test_query_plans.o g++ -g -Wall -Wno-unused-function -O0 -std=c++11 -c unittest.cpp -o unittest.o g++ -g -Wall -Wno-unused-function -O0 -std=c++11 -c util.cpp -o util.o g++ -g -Wall -Wno-unused-function -O0 -std=c++11 ColumnNames.o ColumnSelector.o Database.o Index.o main.o Operators.o QueryProcessor.o Row.o RowCompare.o Table.o test_operators.o test_query_plans.o unittest.o util.o -o a6

Before you fill in the IMPLEMENT_ME sections, you will probably see compiler warnings about statements with no effect. Those will disappear as you fill in code.

Running the code

Run the code as follows:
jao@mintyzack ~/115/a6/a6_jao $ ./a6 db

You will see two batches of tests run, the operator tests and the query tests.

Your submission

You need to do the following:

Implement all of the operator member functions marked IMPLEMENT_ME in Operators.cpp. Feel free to add fields and member functions to the operator classes. (However, I suspect it will not be necessary to add any fields.)
Implement test_q2_table_scan, test_q2_index_scan, and test_q3 in test_queries.cpp. Note that in test_q2_table_scan you MUST use a table_scan of the user table, while in test_q2_index_scan, you must use, instead, an index_scan of username_index.
Zip up everything (code, Makefile, db directory) and submit it using provide.