|
1. Introduction
In Heirich and Moll [hei99] a commodity-based architecture was
presented for scalable visualization in distributed systems. This
architecture allows large numbers of rendering servers to post images
on rectilinear tiles of a large display, and to apply compositing,
blending, and other pixel-level operators to those displayed images.
This paper reports on the application of a second generation implementation
(Sepia-2) of this architecture to structured volume visualization.
The goal of this current work is to demonstrate that large volume
datasets (3D rectilinear grids of scalars) can be interactively rendered
with a networked configuration of off-the-shelf components. Volume
rendering is performed by a proprietary ray-casting engine, the VolumePro
500 [pfi99]. This engine implements object-order ray casting using
a shear-warp factorization of the viewing matrix [lac94].
We present an algorithm for concurrent subvolume rendering followed
by pipelined view-dependent tiling and blending (Figure 1). This
algorithm requires dynamic view-dependent remapping of the blending
pipeline on each frame, similar to requirements posed by Porter-Duff
compositing in XFree86 [por84]. We demonstrate how to map the algorithm
onto a large multi-stage network configuration, consisting of PCs
with blending/compositing (Sepia-2) boards and ray-casting engines,
connected by a high speed network, and utilizing a GL-based graphics
board for display (Figure 2). Our test configuration demonstrates
interactive rendering at 30 frames per second which is the maximum
frame rate of the ray-casting engine. Extrapolations from a small
demonstration cluster suggest that interactivity will be sustainable
in clusters of 1024 servers and larger.
The Sepia architecture is platform neutral beyond the unavoidable
obligation to support device drivers and programming infrastructure.
It is being prototyped for commercial use in various kinds of computing
platforms. These range from clusters of workstations or PCs containing
Intel CPUs, to clusters of Alpha servers, and large NUMA multiprocessor
systems with hundreds of AGP graphics accelerators and Alpha CPUs.
All of these environments support Linux and/or Unix. The configuration
presented in this paper uses Pentium-III graphics workstations running
Microsoft Windows NT. By clustering next generation ray casting engines
we expect to visualize 3D gigavoxel data sets interactively using
as few as 8 of these nodes.
2. Related work
A number of projects are exploring hardware, software, and algorithms
for scalable rendering using commodity components [sto01,bla00,eld00,sam00,hum00,hei99].
The Sepia architecture is probably most similar to the Lightning-2
project [sto01]. Both architectures distribute image pixels across
a network with switching between source and display. Sepia switches
at the level of tiles, Lightning-2 at the level of scanline fragments.
Unlike Lightning-2 Sepia’s approach does not produce visible
artifacts and can support non-commutative operators, of which Porter-Duff
operators are an example. The Metabuffer [bla00] is a more feature-rich
approach to building a scalable display subsystem, with support for
independent viewports that span tile boundaries and have independent
multi-resolution scaling. Like Lightning-2 the Metabuffer is built
on a half duplex mesh topology. In this paper we demonstrate how
to implement pipelined blending operations in the Sepia architecture
using a full-duplex multi-stage network. All three of these projects
(Lightning-2, Metabuffer, Sepia) transfer pixels over a network and
are complementary to coder-decoder approaches like WireGL [hum00]
that use the network to distribute graphical primitives. Both of
these architectural approaches are bandwidth limited in current generation
components but will become practical in high bandwidth commodity
technologies like HyperTransport and InfiniBand.
According to the taxonomy introduced by Molnar [mol94] the Sepia
architecture is compatible with sort-first or sort-last approaches
to building graphics systems. Our main interest has been in sort-last
approaches due to their scaling advantages. The state of the art
in algorithm development for both approaches is represented by companion
papers in these symposium proceedings.
Figure 1. Shear-warp volume rendering
(top), parallel clustering (bottom, left), Sepia-2 block
diagram (bottom, right). Object-order ray casting
provides optimal data throughput but results in a distorted
image when the view vector is not axis-aligned with the
data volume (top). In standard shear-warp rendering the
distorted Base Plane (BP) is corrected by a 2D warp on
an OpenGL accelerator. This approach can be applied in
parallel with concurrent subvolume rendering, followed
by blending and tiling (bottom, left) and then a final
2D warp. The result of tiling a set of BPs is a Superimage
Base Plane (SBP). Blending is performed by the Sepia-2
card (bottom right) which merges a 30 Hz series of locally
computed SBPs with SBPs arriving and departing through
the network interface.
3. TeraVoxel rendering with shear-warp viewing
factorization
Our scientific interest in this problem is driven by two projects
in large data visualization funded by the National Science Foundation.
The TeraVoxel project is constructing an electronic flow visualization
system to capture snapshots of physical fluid experiments in real
time followed by interactive visualization and exploration. The Kilo-Frame
per Second (KFS) camera collects regularly gridded 1024 2 cross sectional
samples of a fluid pressure field, corresponding to data a rate of
one gigavoxel per second. Our current goal is to visualize one second
(1024 samples) of this data interactively. Another scientific interest,
and a focus of the Large Data Visualization initiative, is interactive
rendering for full-resolution MRI data sets. Current MRI data sets
of one quarter to one half gigavoxel are typically rendered as 256
3 reduced data sets. Both MRI systems and the KFS camera produce
rectilinear data sets and can be supported by this demonstration.
Object-order ray casting makes efficient use of general purpose
CPUs by organizing the ray casting computation for optimal data access.
The same principle applies to a dedicated hardware ray casting engine
like the VolumePro. In both cases a shear-warp factorization of the
viewing matrix allows efficient image generation regardless of viewpoint
[lac94]. Leaving aside issues of scaling and axis reordering, the
shear matrix S defines a hexagonal projection of the volumetric data
onto a volumetric axis-aligned Base Plane (BP). The warp matrix W
corrects distortion in the Base Plane image that results from misalignment
with the image plane, and is defined in terms of the standard modelview
matrix M as W = M x S –1 . This warp is a 2D operation and
is usually computed by mapping the BP image onto a polygon as an
OpenGL texture.
Parallelizing this algorithm is straightforward. Data is mapped
statically onto rendering servers (computers equipped with VolumePro
cards) which all generate images simultaneously. When the images
are ready they are merged in a view-dependent order using associative
blending operators. The final tiled and blended result is corrected
by the 2D warp just as in the serial algorithm.
3.1 Blending arithmetic
The ray casting computation blends the successive contributions
of an ordered set of voxels that intersect a line of sight. If V(acc,
s) corresponds to a blending operator that combines
a new voxel sample s with a total accumulated
result accT then the result
of a front to back ray casting computation is:
where s1... sn are
successive voxel samples and 0 represents
the contribution from a transparent voxel. VolumePro [pfi99]
implements the following front-to-back blending operator V for
all color channels C (where α is
the opacity channel, Cacc the
accumulated color, αs the
opacity of sample s, etc.)
Parallelization breaks (1) into two or more concurrent
subcomputations, for example: acc1 = V(V(...V(V(0,s1),s2),...),sn/2) and acc2
= V(V(...V(0,sn/2+1),sn/2+2),...),sn).
The blending computation in the Sepia-2 firmware must compute
a function F such that F(acc1,
acc2) = accT. In addition
this F must be associative (in
order to scale above two) with an identity element 0 that
corresponds to a transparent pixel. We use the standard approach
of pre-multiplication by α and
define a function P on pixels s
We can easily verify that the following F is
associative with identity element 0 (a
transparent black voxel) and further that V(acc,
s) = F(acc, P(s)).
We conclude that F(acc1, acc2) gives
the same arithmetic result as the original ray casting computation.
We note that our computations use the 8 bit pixel values
from the BP, but that VolumePro uses internal 12 bit blending.
As a result our F will produce
more rounding error than the original accT.
This will affect the accuracy of the result at large scales.
3.2 Dynamic mapping for pipelines in full duplex multi-stage
networks
Any distributed algorithm for volume visualization must solve a
problem of routing pixel data according to a dynamically changing
viewpoint. A less stringent form of this problem also occurs in supporting
Porter-Duff compositing for XFree86. Blending operators are non-commutative
and must be computed along a line of sight. When data is mapped statically
onto computers, as it is in our system, the blending order will change
as the viewing direction changes. In the Sepia architecture blending
requires solving a Hamiltonian circuit problem on each frame: given
the current viewing direction, find a contention-free mapping of
the blending pipeline onto the network that visits every Sepia-2
card once and only once.
The network technologies that we work with (VIA and Infiniband)
implement full duplex destination-based routing. Network switches
are programmed with routing tables, lookup functions that provide
the outgoing local port address for each network packet as a deterministic
function of the destination address contained in the packet header.
This enforces a condition that only a single path may exist between
any source and destination pair, although aliasing relaxes this condition
to a finite number of paths. A single routing table must be defined
to support every permutation of node orderings. This contrasts with
other technologies (e.g. Myrinet) that implement source routing,
where every packet is prefixed with instructions for routing it through
the network.
In a single-stage network the mapping problem is solved trivially.
Any full bisection crossbar allows the full set of permutations of
nodes 1...n such that each node i connects to node i+1 by a contention-free
path, with node n identified arbitrarily with the display server.
The resulting connectivity graph is a star, in which the Hamiltonian
circuit is trivially embeddable.
We will solve the mapping problem on three-stage symmetric Clos
networks [clo53] which are isomorphic to two-stage fully interconnected
full duplex networks with identified inputs and outputs. The Slepian-Duguid
theorem [sle90] defines conditions for contention-free mapping in
a Clos network. A set of contention-free routing tables exists if
the number of switches in stage two is greater than or equal to the
number of links connecting each switch in stage two to switches in
stage one. The theorem provides a constructive algorithm for generating
a set of contention-free routing tables under these conditions.
In order to solve our Hamiltonian circuit problem in a full duplex
network we note that the solution is equivalent to solutions of two
concatenated Clos mapping problems, one in the forward direction
and another in the reverse direction, if the outputs of the first
problem are identified with the inputs of the second problem. The
union of the routing table solutions for these two problems is a
set of routing tables (possibly after aliasing) that allow any Hamiltonian
circuit to be embedded without contention. This provides a constructive
solution to mapping in two-stage fully interconnected full duplex
networks using switches with any number of ports. This result can
be extended inductively to as many stages as may be required by any
size configuration.
4. Demonstration system
The algorithm for using the VolumePro in parallel, and a Sepia-2
block diagram, are illustrated in Figure 1. The data volume is distributed
to a set of rendering servers which all generate images of their
local subvolumes. Each of these images is acquired by a Sepia-2 card,
which blends it in pipeline order with an intermediate blended result
from the preceding Sepia-2 card. The fully blended result is delivered
by a Sepia-2 card into the memory of a display server which applies
the final warp and display (Figures 2 and 3).
The VolumePro ray casting engine generates a pre-warped image in
the form of a Base Plane (BP). The BP is a smaller version of a tiled
image known as the Superimage Base Plane (BP). Application software
controls the VolumePro, converts each resulting BP into a larger
SBP, and then controls the blending computation in the Sepia-2 card.
Figure 1 illustrates the distorted BP (top, right) and it’s
place within a larger tiled SBP (bottom, left). The VolumePro delivers
a 512 x 512 BP containing RGBA pixels into memory, within which exists
the hexagonal sheared image of a 256 3 subvolume. Software translates
this BP by a view dependent vector within a larger tiled 1024 x 1024
SBP and also cleans up the BP by removing certain artifacts. This
view dependent vector takes one of twenty-four possible values which
represent all possible viewing orientations. The viewing orientations
are modeled with respect to six cube faces, and four possible planar
orientations (left, right, down, up) with respect to each face, resulting
in twenty-four combinations (six faces times four orientations).
Figure
2. Demonstration cluster configuration, 8 rendering servers plus
1 display server. Every rendering server contains a VolumePro
ray casting engine and a Sepia-2 card. BPs are delivered at 30
frames per second by the VolumePro card. These are converted in
host memory to SBPs and then acquired by the Sepia-2 card. The
pipelined result of blending a series of SBPs is delivered by another
Sepia-2 card into the host memory of the display server. From here
the final SBP is displayed by an OpenGL accelerator.
Our demonstration cluster is depicted schematically in Figure 2
and photographically in Figure 4. We used off-the-shelf graphics
workstations (Compaq SP750) containing Intel i840 motherboard chipsets
and Pentium-III Xeon CPUs. The i840 supports dual PCI busses and
multiple banks of Rambus memory with peak aggregate bandwidth of
3.2 GB/s, in addition to an AGP-4x graphics interface. The VolumePro
occupies a slot in a PCI bus (which may be either 33 or 66 MHz and
32 or 64 bits). To minimize interference the Sepia-2 card is in a
66 MHz 64 bit PCI bus, and the VolumePro is in a separate 33 MHz
32 bit bus.
Once the SBP is prepared software calculates the position of the
rendering server within the pipelined blending sequence using a left-right,
top-bottom, front-back ordering. This calculation is performed locally
based on the current view and information about volume and subvolume
partitioning. This gives the pipeline ordering, which is converted
into a (statically defined) network address of a Sepia-2 card. This
conversion from pipeline position to network address implements the
dynamic pipeline remapping required by the blending operators. Software
then initiates the SBP blending by the Sepia-2 card. Blending is
pipelined, with the first pixels of the final blended result written
to display server memory before the last pixels of the local unblended
SBP have been acquired from the host memory on the rendering server.
The final blended result is an SBP that is delivered into the memory
of a display server where it is warped and displayed by an OpenGL
accelerator (Figure 2).
In order to render at 30 frames per second it is necessary to overlap
the VolumePro BP generation with the subsequent steps of SBP construction,
blending, and display (Figure 3). Each of these processing stages
must be no longer than 33 ms in order to sustain a continuous stream
of BPs coming from the VolumePro. The result is up to four frames
of interactive latency, from the time a viewpoint change is communicated
to the VolumePros, until the time the result is seen on the display.
Figure 3 illustrates the flow of data through the system on each
frame and the associated stages of processing. The VolumePro writes
a new BP to host memory 30 times per second. A BP consists of one
megabye of data (512 x 512 x RGBA). This stage is dominated by the
computational time through the VolumePro and may require a full 1/30
second. When the BP is available it is converted in memory to an
SBP, blended by the Sepia-2 cards, and delivered to the display server.
These four stages occur sequentially with successive frames overlapped
in time as shown in Figure 3 (right). The latency of the display
stage can vary depending on AGP performance and monitor refresh rate,
from a few milliseconds to a full frame.
Figure
3. Data flow and latency analysis for a single rendering server
and single display server. The analysis for larger configurations
is based on this simplified model. Left, the VolumePro writes a
BP to host memory where it is converted by software to an SBP.
The Sepia-2 card reads the SBP into the network where it is blended
in transit. At the display server a Sepia-2 card writes the blended
result SBP to host memory. Host software on the display server
loads the SBP into GL texture memory for final warp and display.
Right, a timing figure shows the overlapped stages of image generation,
SBP construction, acquisition and blending, and final display.
The interactive latency is dominated by the time through these
overlapped stages regardless of node count, and is approximately
four frames prior to optimizations.
5. Results
At the time this paper is written our demonstration cluster demonstrates
30 frames per second operation but with errors. The cluster runs
without errors at 20 frames per second. We have identified the sources
of the errors in the application software that converts the BP into
the SBP. We are currently rewriting the critical software stages
and believe that our final paper will report error-free interactive
operation at 30 frames per second. We have not fully optimized the
display server software, nor taken other steps that we could to reduce
interactive latency, since the current latency is tolerable.
Figure 4 shows a series of images that were generated on the cluster
at 512 3 voxel resolution, using 8 subvolumes of 256 3 voxels each.
Upgrading the VolumePro cards will increase the capacity of this
cluster to 1024 3 voxels.
6. Scalability
We have identified the steps required to scale this configuration
both in object space and in image space. In object space the size
of data volumes being visualized may be increased arbitrarily (in
principle). In image space the size of the tiled image may be increased
until it exceeds the pixel resolution of an individual display, leading
to display partitioning. Both types of scaling can be supported by
software-only methods, and also by firmware or hardware upgrades
that increase the system performance.
In addition to scaling the cluster configuration the VolumePro
ray casting engines are themselves increasing in performance, and
both object and image space may be scaled just by upgrading these
cards. Our demonstration cluster has been configured with first generation
cards that support subvolumes of 256 3 each and render an aggregate
volume of 512 3 voxels using eight servers. A second generation card
that supports 512 3 subvolumes would render a gigavoxel using eight
servers.
Object space scaling involves increasing the number of ray casting
engines, and correspondingly, the size of the aggregate volume being
visualized. The Sepia-2 architecture has been designed to maintain
interactivity at scales of 1024 servers and more [hei99]. As a result
increases in server count should only slightly increase the latency
of the blending stage.
It is easy to scale in object space if the image size does not
increase. The image size is determined by the software that converts
each BP to an SBP. This software may rescale the BP into a smaller
size in order to keep the SBP small enough for a single display.
This rescaling operation would increase the time spent in the software
conversion stage, and this would manifest an increased interactive
latency. But it would allow arbitrary object space scaling for a
fixed pixel resolution image.
The more interesting case involves simultaneous scaling in both
object space and image space, growing the size of the aggregate data
volume while simultaneously growing the pixel resolution of the displayed
image. This introduces a new consideration because SBPs are no longer
sent to a common display server. If images are mapped to arbitrary
positions on a tiled display, images will often cross tile boundaries.
If we assume that all tiles are equal sized and rectilinear then
no individual image can ever appear on more than four tiles. This
produces a requirement that the SBP must be segmentable in up to
four contiguous portions with individual routing for each portion.
The most efficient way to support this would be for the Sepia-2
card to acquire the BP in up to four sections and merge these sections
directly into their correct position in the SBPs as they flow through
the network. This would require maintaining state for four different
SBP streams, which might be distinguished by the use of different
network aliases. Because of the near-stateless nature of our pixel
operators it may be possible to support this very efficiently. An
easier but slightly less efficient approach would be for software
to partition the BP into four parts, move these four parts to separate
SBPs in memory, then direct the Sepia-2 to acquire only the necessary
portions of each of the four SBPs. The least efficient but easiest
approach would be for software to replicate the full SBP four times
and acquire all four copies. This increases the bandwidth requirements
fourfold and would result in a significant drop in frame rate in
our current hardware.
7. Discussion
From a scientific standpoint we are on course to provide a gigavoxel
visualization system in a small footprint. In principle these results
should be representative of a variety of structured and unstructured
volume visualization techniques that make use of GL acceleration.
Our larger scientific goals depend on supporting OpenGL in Linux
environments, and this requires efficient DVI image acquisition.
We believe our PCI design will make it easy to synchronize the DVI
acquisition intervals with the software operations required for auxiliary
(e.g. depth) buffer acquisition. We have conducted distributed programming
experiments using glReadPixels() acquisition and this is clearly
not adequate for a production system. We have seen the best Linux
pixel path performance using Matrox G400Max accelerators with 32
MB frame buffers. In these cards the readback performance is over
300 MB/s and writeback (glDrawPixels) over 900 MB/s for color, depth,
alpha, and stencil buffers. We continue to search for an ideal graphics
engine, which would be a Linux-based OpenGL accelerator with full
double buffering and binary frame buffer access through DVI.
From an architectural standpoint we continue to gain fundamental
insights as we address application exercises and observe other projects.
We have designed an approach to supporting general Xinerama-like
capabilities to allow panning and clipping in tiled displays. This
approach requires different physical packaging so that the pixel
operations can be applied to two image streams in a display server
rather than in a rendering server. The topology and capabilities
of these operators are similar to the scalable mesh used in the Metabuffer
and Lightning-2.
The most unexpected insight we have gained has been an appreciation
for the value of planar operations on image fragments. We experimented
with implementing a limited set of such operations in Sepia-1.
Examples included various forms of image tiling, fragment displacement,
and image reassembly. The need for such operations has appeared in
many guises, for example in the Xinerama-like capabilities described
above. In the application presented in this paper they manifest twice,
in the initial translation of the BP within the SBP, and in partitioning
the SBP for multi-panel displays. We suspect that fragment operations
represent a fundamental requirement for Sepia and related architectures.
In Sepia this requirements might be addressed by an image prefix
string that specifies a scissoring operation and routing for the
scissored fragments. These fragments would then be processed by planar
pixel operators, and could be routed across tile boundaries and later
reassembled. Note that although Sepia can perform planar translations
the pipelined nature of the architecture does not allow rotations.
We are currently studying multi-scale issues. There are a number
of performance enhancements we could pursue if we felt the effort
were justified. Increasing the effective rate of SBP acquisition
to 180 MB/s would allow the blending computation to occur at full
network rates and thus reduce latency (see Figure 1, bottom right).
The most efficient way to do this is for the Sepia-2 card to acquire
the BP from memory and translate it appropriately into the SBP that
is flowing through the network. This makes no change in the effective
frame rate, because the full SBP still travels through the network.
But it allows each BP to be acquired faster, or to be acquired as
four segments (for multi-panel display) without increasing network
bandwidth requirements. We believe the interactive latency could
be cut approximately in half with limited effort.
8. Acknowledgements
We have had helpful discussions about VolumePro with Hugh Lauer,
Andy Vesper, and Larry Seiler of RTViz. This work was supported by
National Science Foundation grant #ACI-9982273, by the U.S. Department
of Energy’s ASCI Center for Simulation of Dynamic Response
of Materials, by Caltech’s Center for Advanced Computing Research,
and by Compaq Computer Corporation’s Tandem Laboratories and
Systems Research Center.
Bibliography
[pfi99] H. Pfister, J. Hardenbergh, J. Knittel, H. Lauer and L.
Seiler. The VolumePro real-time ray-casting system. Proceedings of
ACM SIGGRAPH (1999).
[lac94] P. Lacroute and M. Levoy. Fast volume rendering using a
shear-warp factorization of the viewing transform. Proceedings of
ACM SIGGRAPH (1994), pp. 451-457.
[hei99] A. Heirich and L. Moll. Scalable distributed visualization
using off-the-shelf components. Proceedings of the IEEE symposium
on parallel visualization and graphics (PVG’99) (1999) pp.
55-59.
[por84] T. Porter and T. Duff. Compositing digital images. Proceedings
of ACM SIGGRAPH (1984).
[sle90] D. Slepian and A. M. Duguid, in Joseph Y. Hui, Switching
and traffic theory for integrated broadband networks, pp. 53-84,
Kluwer Academic Publishers (1990).
[bla00] Blanke, W. J., Bajaj, C., Fussel, D. and Zhang, X. The
metabuffer: a scalable multiresolution multidisplay 3D graphics system
using commodity rendering engines. Technical report TR2000-16, University
of Texas at Austin, February 2000.
[sto01] Stoll, G. et al. Lightning-2, a high performance display
subsystem for PC clusters. Proceedings of ACM SIGGRAPH (2001).
[hum00] Humphreys, G., Buck, I., Eldridge, M. and Hanrahan, P. Distributed
rendering for scalable displays. Proceedings of Supercomputing (2000).
[mol94] Molnar, S., Cox, M., Ellsworth, D. and Fuchs, H. A sorting
classification of parallel rendering. IEEE Computer Graphics and
Applications 14 (1994).
[clo53] Clos, C. A study of non-blocking switching networks. Bell
System Technical Journal 32, pp. 406-424 (1953).
[eld00] Eldridge, M., Igehy, H. and Hanrahan, P. Pomegranate: a
fully scalable graphics architecture. Proceedings of ACM SIGGRAPH
(2000).
[sam00] Samanta, R., et al. Hybrid sort-first and sort-last parallel
rendering with a cluster of PCs. Proceedings of SIGGRAPH/Eurographics
workshop on graphics hardware (2000).
Image gallery
Figure 4. Rendered images and photographs
of the hardware. Top, some representative images,
some of which may be at lower resolution. These will be
expanded and replaced with high resolution images in the
final paper. Bottom left, the rendering cluster consists
of eight Compaq SP750 graphics workstations with Pentium-III
Xeon CPUs configured as shown in Figure 2. A ninth workstation
functions as the display server. Bottom right, the Sepia-2
card contains two Xilinx XC4044XLA packages that support
the host and network PCI interfaces, an XC4085XLA for the
merge operator logic, memory (SRAM and DRAM), and a ServerNet-II
network interface ASIC (Colorado-2). Three connectors allow
future expansion for I/O daughtercards that could be used
to support DVI.
|