Osh (Object SHell) is a tool that integrates the processing of
structured data, database access, and remote access to a cluster of
nodes. These capabilities are made available through a command-line
interface (CLI) and a Python application programming interface (API).
Osh processes streams of Python objects using simple
commands. Complex data processing is achieved by command sequences in
which the output from one command is passed to the input of the
next. This is similar to composing Unix commands using pipes.
However, Unix commands pass strings from one command to the next, and
the commands (grep, awk, sed, etc.) are heavily string-oriented. Osh
commands send primitive Python types such as strings and numbers;
composite types such as tuples, lists and maps; objects representing
files, dates and times; or even user-defined objects.
Suppose you have a cluster named fred, consisting of nodes
101, 102, 103. Each node has a database tracking work requests
with a table named request. You can find the total number of open
requests in each database as follows (using the CLI):
jao@zack$ osh @fred [ sql "select count(*) from request where state = 'open'" ] ^ out
- osh: Invokes the osh interpreter.
- @fred [ ... ]:
fred is the name of a cluster, (configured in the osh configuration file, .oshrc).
A thread is created for each node of the cluster, and the bracketed command is executed on each thread,
- sql "select count(*) from request where state = 'open'": sql is an osh
command that submits a query to a relational database. The query output is returned as
a stream of tuples.
- ^ out: ^ is the osh operator for piping objects from one command to the next
In this case, the input objects are tuples resulting from execution of a SQL query on each
node of the cluster. The out command renders each object as a string and prints it to stdout.
- Each output row identifies the node of origination
(e.g. 101, 102),
and includes a tuple
from the database on that node. So ('103', 5) means that
the database on node 103
Now suppose you want to find the total number of open requests
across the cluster. You can pipe the (node, request count) tuples into an aggregation command:
jao@zack$ osh @fred [ sql "select count(*) from request where state = 'open'" ] ^ agg 0 'total, node, count: total + count' $
Note that this example combines remote execution on cluster nodes, database access (on each cluster node),
and data processing (the aggregation step) in a single framework.
- agg: agg is the aggregation command. Tuples from across the cluster are piped
into the agg command, which will accumulate results from all inputs.
- 0: agg will maintain a total, which is initialized to 0.
- 'total, node, count: total + count': This specifies an aggregation function.
total is the running total,
which was initialized to 0.
node and count come from the sql command executed on each node of the cluster.
total + count accumulates the counts from each node.
- $: An alternative to ^ out that can be used at the end of a command only.
- 6: The total of the counts from across the cluster.
The same computation can be done using the API as follows:
from osh.api import *
remote(sql("select count(*) from request where state = 'open'"))),
agg(0, lambda total, node, count: total + count))
- from osh.api import *: Imports the osh API.
- osh(...): Invokes the osh interpreter
- fork("fred", remote(sql(...))): Runs the sql command on each node of cluster fred, in parallel.
- agg(...): Aggregates query results from across the cluster.
Command Reference Guide
Software with similar goals
PyCon 2006 paper on osh
PyCon 2006 talk on osh