Jack Orenstein

home | education | industry | research | teaching | contact


Citations for the publications referred to below can be found here.

Erdös number: 3.
Engheta number: 2.

Table Grouping (2009-present)

Akiban Server is a database system designed to optimize complex queries, characterized by many joins. The basic idea is to colocate frequently joined rows in a single b-tree. When these joins are specified in a query, the necessary rows can then be retrieved in a way that makes effective use of caches at all levels. While there is extensive research literature non-first normal form query processing, and on hierarchical query languages, Akiban had a unique problem: how to execute relational queries (SQL) over hierarchical structures. This led to a number of research problems:

  • Storing a hierarchy in a b-tree: The basic storage scheme is straightforward -- the b-tree key is formed by concatenating primary keys from the root of the hierarchy (defined by grouping foreign keys) to each table. Table identifiers separate the key segments to allow for branching hierarchies, (as opposed to just deep hierarchies). A number of research problems had to be solved to make this hierarchical representation practical: how to maintain b-tree keys across primary and foreign key updates; how to deal with "orphan" rows, (which have no parent in the hierarchy); and how to do locking to allow for good concurrency while maintaining correct operation.
  • Algebra for query processing: We needed a set of query processing operators for supporting SQL queries over hierarchically structured data. "Flattening" operators have been used in other non-first normal algebras. We also needed operators for navigating the hierarchy, and we needed to generalize more conventional operators (e.g. for selection) to accomodate hierarchically structured data. Factors influencing the design of the algebra included expressiveness, performance, and the needs of the query optimizer.
  • Indexing: Indexes over hierarchical data provide a number of new optimization opportunities. I contributed to research on how to implement indexes whose fields span the tables in a hierarchy.
Scalable Data Structures (2003-2009)

At Archivas, I was responsible for the design and implementation of the Metadata Manager, the component which maintained metadata on archived files. This metadata includes POSIX metadata, and locations of file copies within the archive. We initially considered one large database that would scale up as needed. But this database would have been a single point of failure. It would not scale as well as the rest of the system, and would become a bottlneck. As nodes are added to the cluster, throughput would increase, but only up to a limit imposed by the database.

For these reasons, we decided to distribute the database, using techniques from the field of scalable data structures. I conducted original research in this area:

  • How to avoid rehashing everything: When nodes fail and leave the cluster, or when nodes are added to the cluster, data needs to be load-balanced across the modified set of nodes. But rehashing everything is prohibitively expensive. The traditional approach to this problem is to use consistent hashing. We pursued an approach based on linear hashing instead.
  • Maintaining synchronous replicas: Each linear hash bucket has multiple copies. These copies have to be kept tightly synchronized so that, following a node failure, surviving copies can be used to continue servicing request.
  • Mapping hierarchical filesystem to hash-based data structure: POSIX semantics require directory metadata to be updated each time a file is created or deleted. There is no easy way to map a hierarchy to a set of hash buckets -- does a directory go to the same bucket as the child or the parent? I designed a mapping that maintains POSIX semantics correctly with correct transaction semantics and good performance. The implementation supports up to 200,000 files per directory.

Object/Relational Mapping (1995-2000)

At Object Design, I became interested in the problems of linking the object-oriented and relational data models, based on some problems faced by one of our customers, USWest. USWest liked the programming model supported by Object Design's product, ObjectStore, but wanted to be able to use that model in conjunction with traditional relational database systems. I wrote a series of internal papers describing technical approaches to these problems. This design work led to the establishment of a group whose goal was to create products mapping in both directions between object-oriented and relational databases. A customer application built on top of one of these tools was described in a 1995 conference publication.

In late 1995 I decided to build an object/relational mapping system for the Java language, which had just been released by Sun. Object Design was not interested in pursuing this work so I left to build the product on my own.

A few other object/relational mapping systems existed, mostly for C++. However, they all had serious performance problems. An object-oriented application spends much of its time navigating from one object to another. Turning each navigation into a retrieval from a relational database is an easy translation, but this is an inefficient way to access a relational database. I defined and solved a number of research problems in producing a system whose performance was comparable to what could be achieved by hand-coded relational database applications. A paper describing some of this research was published in 1999. Other original work was not published.

I sold my company to Novera Software, Inc. in 1996. My product became a central piece of the Novera product line.

Object-Oriented Database Systems (1989-1996)

Traditional database systems are based on the relational model of data. Virtually all relational database systems use a server-centric architecture. Many applications cannot use relational database systems because the data model is not adequate for modeling the concepts on which the application is based, and because the performance characteristics of the application call for a more client-centric architecture.

Object-oriented database systems offer an alternative. The data model of an object-oriented database system is much closer to that of a programming language, (an object-oriented one), and most of the persistence mechanism is implemented on the client side. These approaches combine to produce a database system that is well-suited to applications involving detailed analysis of complex data, e.g. geographic information systems, ECAD, MCAD systems.

At CCA, on the DARPA-funded PROBE research project, I started working on the problem of how to extend the capabilities of database systems with object-oriented features. My work on spatial data processing represented part of this work. Papers describing the PROBE project were published in several journals and conference proceedings.

PROBE was focused on adding object-oriented capabilities to traditional database systems. Another project I worked on, Persistent Ada, dealt with a new approach to persistence, adding persistence to a programming language. Persistent Ada was a brief, DARPA-funded research project, and produced no software.

I worked on ideas related to Persistent Ada at Object Design. The company's goal was to create a commercial product based on C++, so in addition to producing a database system based on an object-oriented model, we also had to develop a client-centric architecture. My contributions were in the area of data modeling, query languages and query processing. I surveyed the relevant research literature in these areas, selected a set of the more practical ideas, and implemented them. In a number of instances, the implementation work required solutions to new research problems. Conference and journal publications on the novel aspects of the system appeared between 1991 and 1993.

Spatial Data Processing (1982-1991)

Traditional database systems are designed to handle alphanumeric information, such as names and numbers. Many important applications need both the data management capabilities of traditional database systems, as well as the ability to handle spatial data, e.g. maps, solid models, and terrain models. Three approaches have been used to build these applications: 1) Ad hoc implementations; 2) Systems geared to applications within a domain, e.g. Geographic Information Systems; 3) Hybrid systems built by combining a traditional database system with extensions for geographic data (as an example).

My research in this area has pursued a fourth possibility: extending the capabilities of database systems to handle arbitrary spatial data types. This work has been carried out in the context of traditional relational database systems, as well as extended-relational and object-oriented database systems. The results have been described in conference and journal publications appearing between 1983 and 1991. The research has also led to the creation of a software product, a spatial index library named Geophile. The product has been licensed to two object database vendors (Object Design, Objectivity), and is in use in deployed applications by their customers.

There are at least two other implementations of these research ideas:
  • I consulted with IBMs Santa Teresa Labs on my research, and they have used it in conjunction with a medical imaging application.
  • Microsoft has adopted my algorithms for use in their TerraServer application, which is a showcase for their SQL Server database system.
Professional Activities
  • Member of the program committees for various database conferences, including SIGMOD and VLDB, (several times each).
  • Reviewer for several journals.
  • Reviewer for several M.Sc. dissertations.