|
Abstract
Our current concern is a scalable infrastructure for
information re-trieval (IR) with up-to-date retrieval
results in the presence of fre-quent, continuous updates.
Timely processing of updates is impor-tant with novel
application domains, e.g., e-commerce. We want to use
off-the-self hardware and software as much as possible.
These issues are challenging, given the additional
requirement that the resulting system must scale well. We
have built PowerDB-IR, a system that has the
characteristics sought. This paper describes its design,
implementation, and evaluation. PowerDB-IR is a
coordi-nation layer for a database cluster. The rationale
behind a database cluster is to tscale-outs, i.e., to add
further cluster nodes, whenever necessary for better
performance. We build on IR-to-database map-pings and
service decomposition to support high-level parallelism.
We follow a three-tier architecture with the database
cluster as the bottom layer for storage management. The
middle tier provides IR-specific processing and update
services. PowerDB-IR has the following features: It allows
to insert and retrieve documents con-currently, and it
ensures freshness with almost no overhead. Alter-native
physical data organization schemes provide adequate
perfor-mance for different workloads. Query processing
techniques for the different data organizations efficiently
integrate the ranked retrieval results from the cluster
nodes. We have run extensive experiments with our prototype
using commercial database systems and middle-ware software
products. The main result is that PowerDB-IR shows
surprisingly ideal scalability and low response times.
You can directly download a PDF (93 KB) version of this paper.
|