System Overview

Next: Hybrid Adaptive Caching Up: Contents Previous: Introduction

2 System Overview

In Thor, applications running at client machines make use of persistent objects stored at servers. Applications interact with Thor by calling methods of objects; these calls occur within atomic transactions. The methods are implemented in a type-safe language called Theta [LCD+94].

To speed up applications, client machines cache copies of persistent objects. Clients communicate with servers only to fetch pages or commit transactions; we use optimistic concurrency control [Gru97][AGLM95] to serialize transactions. The fact that computations run as transactions does not preclude the use of HAC in non-transactional systems, since the main impact of transactions is defining when modifications are sent to servers, and this could be defined in a different way, e.g., for file systems, updates could be sent every few seconds.

Good performance for a distributed object storage system requires good solutions for client cache management, storage management at servers, and concurrency control for transactions. Performance studies of our previous implementation, Thor-0, showed its techniques for server storage management [Ghe95] and concurrency control [Gru97][AGLM95] worked well and therefore these were retained in Thor-1.

2.1 Servers

Servers store objects on disk in pages and maintain a page cache in main memory to speedup client fetch requests. The results presented in this paper were obtained using 8 KB pages, but the system can be configured to use a different page size. To simplify cache management, objects are required not to span page boundaries. Pages are large enough that this does not cause significant internal fragmentation. Objects larger than a page are represented using a tree.

When a transaction commits, its client sends information about its modifications to the servers in the commit request; if the commit succeeds, these changes become visible to later transactions. Because client caches in HAC may hold objects without their containing pages, we must ship modified objects (and not their containing pages) to servers at commit time. Earlier work [WD95][OS94] has shown that this leads to poor performance if it is necessary to immediately read the objects' pages from disk to install them in the database. We avoid this cost by using the modified object buffer (MOB) architecture [Ghe95]. The server maintains an in-memory MOB that holds the latest versions of objects modified by recent transactions. New versions are written to the MOB when transactions commit; when the MOB fills up, versions are written to their disk pages in the background.

2.2 Page and Object Format

Keeping objects small at both servers and clients is important because it has a large impact on performance [MBMS95][WD94]. Our objects are small primarily because object references (or orefs) are only 32 bits. Orefs refer to objects at the same server; objects point to objects at other servers indirectly via surrogates. A surrogate is a small object that contains the identifier of the target object's server and its oref within that server; this is similar to designs proposed in [DLMM94][Mos90][Bis77]. Surrogates will not impose much penalty in either space or time, assuming the database can be partitioned among servers so that inter-server references are rare and are followed rarely; we believe these are realistic assumptions.

Object headers are also 32 bits. They contain the oref of the object's class object, which contains information such as the number and types of the object's instance variables and the code of its methods.

An oref is a pair consisting of a 22-bit pid and a 9-bit oid (the remaining bit is used at the client as discussed below). The pid identifies the object's page and allows fast location of the page both on disk and in the server cache. The oid identifies the object within its page but does not encode its location. Instead, a page contains an offset table that maps oids to 16-bit offsets within the page. The offset table has an entry for each existing object in a page; this 2-byte extra overhead, added to the 4 bytes of the object header, yields a total overhead of 6 bytes per object. The offset table is important because it allows servers to compact objects within their pages independently from other servers and clients. It also provides a larger address space, allowing servers to store a maximum of 2 G objects consuming a maximum of 32 GB; this size limitation does not unduly restrict servers, since a physical server machine can implement several logical servers.

Our design allows us to address a very large database. For example, a server identifier of 32 bits allows 232 servers and a total database of 267 bytes. However, our server identifiers can be larger than 32 bits; the only impact on the system is that surrogates will be bigger. In contrast, most systems that support large address spaces use very large pointers, e.g., 64 [LAC+96][CLFL95], 96 [Kos95], or even 128-bit pointers [WD92]. In Quickstore [WD94], which also uses 32-bit pointers to address large databases, storage compaction at servers is very expensive because all references to an object must be corrected when it is moved (whereas our design makes it easy to avoid fragmentation).

2.3 Clients

The client cache is partitioned into a set of page-sized frames. Some frames are intact: they hold pages that were fetched from servers. Others are compacted; they hold objects that were retained when their containing server pages were compacted. When a client tries to use an object that is not in its cache, it fetches the object's containing page from the appropriate server and frees a frame. Pages at the client have the same size and structure as at the server to avoid extra copies.

It is not practical to represent object pointers as orefs in the client cache because each pointer dereference would require a table lookup to translate the name into the object's memory location. Therefore, clients perform pointer swizzling [WD92][Mos92][KK90], i.e., replace the orefs in objects' instance variables by virtual memory pointers to speed up pointer traversals. HAC uses indirect pointer swizzling [KK90], i.e., the oref is translated to a pointer to an entry in an indirection table and the entry points to the target object. Indirection allows HAC to move and evict objects from the client cache with low overhead; indirection has also been found to simplify page eviction in a page-caching system [MS95].

HAC uses a novel lazy reference counting mechanism to discard entries from the indirection table. The reference count in an entry is incremented whenever a pointer is swizzled and decremented when objects are evicted, but no reference count updates are performed when objects are modified. Instead, reference counts are corrected lazily when a transaction commits, to account for the modifications performed during the transaction. This scheme is described in detail in [CAL97], where it is shown to have low overheads.

In-cache pointers are 32 bits, which is the pointer size on most machines. HAC uses 32-bit pointers even on 64-bit machines; it simply ensures that the cache and the indirection table are located in the lower 232 bytes of the address space. The results presented in Section 4 were obtained on a machine with 64-bit pointers.

Both pointer swizzling and installation of objects, i.e., allocating an entry for the object in the indirection table, are performed lazily. Pointers are swizzled the first time they are loaded from an instance variable into a register [WD92]; the extra bit in the oref is used to determine whether a pointer has been swizzled or not. Objects are installed in the indirection table the first time a pointer to them is swizzled. The size of an indirection table entry is 16 bytes. Laziness is important because many objects fetched to the cache are never used, and many pointers are never followed. Furthermore, lazy installation reduces the number of entries in the indirection table, and it is cheaper to evict objects that are not installed.

Next: Hybrid Adaptive Caching Up: Contents Previous: Introduction

Miguel Castro, Atul Adya, Barbara Liskov, and Andrew Myers

Copyright ©1997 by the Association for Computing Machinery