Research Gems provide an opportunity for researchers at SC2000 to present insights into the solutions of specific research problems in high performance networking and computing. This category of participation replaces the traditional Poster Exhibits, and will be prominently placed in the Exhibition Hall.

This year, the Research Gems will have special Open Houses Wednesday, Nov. 8 and Thursday Nov. 9, from 10-11am. Don't miss this chance to discuss the posters with the authors.

Special Awards

A $250 award will be granted for the submission selected by the review committee as "Best Research Gem of the Conference". This will be announced at the SC2000 Awards Session.



Adding OpenMP to an Existing MPI Code: Will it be Beneficial?
Joseph D. Blahovec, Keith L. Cartwright, Air Force Research Lab

ICEPIC (Improved Concurrent Electromagnetic Particle In Cell), developed at the Air Force Research Laboratory, is a 3-D particle-in-cell code specifically designed for parallel high performance computing (HPC) resources. ICEPIC simulates collision-less plasma physics phenomena on a Cartesian grid. ICEPIC has several novel features that allow efficient use of parallel architectures. It is written in ANSI C and uses the MPI message passing standard to provide portability to a variety of HPC systems. Currently, we are determining if adding OpenMP to the existing MPI will increase our performance on the new IBM SMP.

Advantages of Multi-block Curvilinear Grid and Dual-Level Parallelism in Ocean Circulation Modeling
Phu V. Luong, University of Texas, Austin, ERDC MSRC, Clay P. Breshears, Rice University, ERDC MSRC Andy Haas, LOGICON, NAVO MSRC

A multi-block curvilinear grid technique is used in ocean modeling to eliminate problems inherent in the traditional one-block rectangular structured grid. A dual-level parallel technique is included to improve the performance of the Multi-block Grid Princeton Ocean Model (MGPOM) ocean circulation model. This technique involves the use of Message Passing Interface (MPI) in which each grid block is assigned to a unique processor. Since not all grid blocks are of the same size, the workload varies between MPI processes. To alleviate this, OpenMP dynamic threading is used to improve load balance.

Application-level Implementation of Asynchronous Methods in a CORBA-based Distributed Object Database
Milorad Tosic, Helen Berman, John Westbrook, Rutgers, The State University of New Jersey

In this poster we consider a concurrency model for method calls in a CORBA-based implementation of a distributed object database for the topology based sub-similarity search of chemical structures. Each object in such databases typically has both computationally intensive and light-weight methods. Applying the concurrency model at the ORB/BOA level in OMG CORBA enforces the same model for all of the methods of an object. We propose a wrapper class for the C++ multi-threading library that enables a concurrency model choice at the method level. This abstraction is developed in a design pattern style for managing threading issues in any application.

Applications of Parallel Process HiMAP for Large-scale Multidisciplinary Problems
Guru Guruswamy, NAS Division, Ames Research Center, David Rodriguez, Mark Postdam, Eloret Inc, Ames Research Center

A modular process to simulate coupled multi-physics interactions of flexible vehicles using high fidelity equations is developed. The process is designed to execute on massively parallel processors (MPP). Computations of each discipline are spread across processors using IEEE standard message passing interface (MPI) for inter processor communications. Disciplines can run in parallel using a middleware MPIRUN developed based on MPI and C++. In addition to discipline parallelization and coarse-grain parallelization of the disciplines, embarrassingly parallel capability to run multiple parameter cases is implemented using a script system. The poster will show the development and applications of the process. Results that highlight load balancing and portability issues will be shown.

APPMAP: A New Way to Predict Application Performance in Minutes
John Gustafson, Sun Microsystems, Rajat Todi, Don Heller, Ames Laboratory/Iowa State University

Our research proposes a convolution technique using hardware and software signatures to accurately predict the system's performance in a relatively short time. Hardware signatures are obtained by actually running the broad-spectrum benchmark HINT or derived from the machine characteristics using analytical HINT (AHINT). Application signatures are extracted from the application as a black box using hardware performance counters and dynamic tracing. In this poster we will present the model and show the predicted result of SPEC '95. We will validate the model using statistical analysis such as correlation, rank correlation, and linear prediction.

Automatic TCP Socket Buffer Tuning
Jian Liu, Jim Ferguson, National Laboratory for Applied Network Research, National Center for Supercomputing Applications

Relying on the static system socket buffer size for TCP data transfers seriously limits the efficient use of high-speed network bandwidth and system memory. The most popular application run across high-speed networks is FTP, whose performance is directly affected by the TCP window size. We propose an Automatic Buffer Tuning method at the application level as an attempt to solve this problem. An Automatic Buffer Tuning-enabled FTP client demonstrates that dynamic adjustment of TCP socket buffer size at the TCP connection setup time causes enhanced FTP application performance, improved utilization of network bandwidth, flexible reservation of system memory, and requires no kernel modification.

Core AS (Autonomous System) Internet Topology
Bradley Huffaker, Theresa Ott Boisseau, CAIDA/SDSC/UCSD

When the Internet was in its infancy, monitoring traffic was relatively simple. However, after experiencing phenomenal growth in the 1990s, tracking connectivity has become a daunting task. Recently, CAIDA researchers have attempted to strip away lesser connected autonomous systems (or ASes) in order to find out how Internet connectivity is distributed among ISPs. This graph, showing peering richness and geographic information, clearly reveals the highly "core connected" topology of ASes based in North America. All except one of the top 15 ASes are based in the U.S., and there are few links directly between ISPs in Asia and Europe.

Dynamic Load Balancing Techniques for Improving Adaptive Mesh Refinement
Zhiling Lan, Valerie Taylor, Northwestern University, Gregory Bryan, MIT

ENZO is one of the successful implementations of AMR (Adaptive Mesh Refinement) on parallel and distributed systems. One of the key issues related to this implementation is dynamic load balancing which allows large-scale and adaptive cosmology applications to run efficiently on parallel systems. We present a new technique for dynamic load balancing of AMR applications. We describe our grid-splitting technique and present some results illustrating the performance gains for the ENZO applications executed on parallel systems. Several metrics for measuring the quality of dynamic load balancing will be used.

Efficient Process Migration Mechanisms for a Heterogeneous Distributed Computing Environment
Kasidit Chanchio, Xian-He Sun, Illinois Institute of Technology

This poster presents a solution for process migration in a heterogeneous distributed computing environment. Our approach solves three fundamental problems in transferring execution, memory, and communication states of a process. To migrate the execution state, we propose a technique using source code annotation. To migrate the memory state, we propose a graphical model and its associated runtime system. Finally, we propose process migration and communication protocols to migrate the communication state. We have tested our migration mechanisms on several sequential and parallel programs. Experimental results advocate the correctness and practicality of our approach.

Enabling Technologies for High-Performance Computing Portals: The PUNCH Approach
Nirav H. Kapadia, Sumalatha Adabala, Renato J. Figueiredo, Purdue University, Dolors Royo, UPC, Spain, Jose Miguel-Alonso, UPV, Spain, Jose A. B. Fortes, Mark S. Lundstrom, Purdue University

Network-centric computing promises to revolutionize the way in which computing services are delivered to the end-user. Analogous to the power grids that distribute electricity today, computational grids will distribute and deliver computing services to users anytime, anywhere. PUNCH, the Purdue University Network Computing Hub, is a network-computing system that allows seamless management of applications, data, and machines spread across administrative domains and wide-area networks. Users can access any application from anywhere via standard Web browsers. PUNCH already provides computing services to 800 users across ten countries. The poster will focus on the core technologies that make up the PUNCH infrastructure.

Extending Quality of Service on the Grid: Gara and PBS
Thomas Milford, Jennifer M. Schopf, Northwestern University

Gara is the Quality of Service component of the Globus Toolkit. Currently, it provides a means for making reservations through Globus resources including disk bandwidth, CPU time and fork. We have extended Gara to allow communication with PBS thus taking advantage of the PBS reservation features. By managing the reservations in Gara, we add the ability to allow the users to define callback procedures (ie. when the reservation is about to go "live"). We also have a central point where we can track all of the reservation status information.

Extending the Portable Batch System with Preemptive Job Scheduling
Gabriel Mateescu, National Research Council, Canada

The Portable Batch System (PBS) provides robust and adaptable job and resource management. Unfortunately, PBS does not support job preemption. In environments with multiple user groups that have dynamic resource requirements, the lack of preemption causes throughput and resource utilization penalties. We propose a preemptive scheduling scheme that avoids these pitfalls. We employ a preemption manager, along with PBS hooks for stopping and resuming jobs. Based on resource requirements, system utilization, and scheduling policies, the manager determines when and what jobs to preempt, then instructs PBS to preempt the jobs. Preempted jobs become eligible for resumption after the requests that caused preemption are satisfied.

General Portable SHMEM Library for High Performance Computing
Krzysztof Parzyszek, Ames Laboratory, Jarek Nieplocha, Pacific Northwest National Laboratory, Ricky A. Kendall, Ames Laboratory

GPSHMEM is a communication library that follows the interface of the original Cray SHMEM library, but attempts to achieve full portability. It is implemented on top of the Aggregate Remote Memory Copy Interface (ARMCI) and a message-passing library (currently MPI). The functionality of GPSHMEM covers most of the SHMEM functionality available on the Cray T3D. The Generalized Portable SHMEM library will benefit CRI T3D/T3E users by providing them a mechanism to use SHMEM on different parallel supercomputers, basically any system that has an ARMCI port. Tests show that the GPSHMEM programming model can be as efficient as as the underlying ARMCI memory access implementation.

S. R. Melody, Jennifer M. Schopf, Northwestern University

This poster presents our work in developing a cross platform tool for resource discovery on the Grid. It expedites the process of searching for a machine and makes the Grid more accessible to application scientists. It attempts to create a quick, simple, and powerful way to find what you are looking for. In order to achieve this goal and maintain as much portability as possible, the software package has two modules, the first searches an MDS tree to create an XML/DOM tree and the second creates a table and generates an easy to search representation.

HDF5: High Performance Science Data Solution for the New Millennium
Albert Cheng, Michael Folk, NCSA/University of Illinois

The Hierarchical Data Format (HDF) has been used for scientific data management since 1988. NCSA, in collaboration with NASA, the ASCI project and others have developed a new format and library, HDF5, that addresses the needs of vastly expanded computational and storage systems. I/O performance, especially parallel I/O, plays a critical role in the HDF5 project. HDF has also incorporated the Grid Forum access features into the HDF5 library for remote resource access in the Virtual Machine Room environment. An XML-DTD specification of the HDF5 format is created to support exchange of HDF5 data for Web-based applications.

Model-Based Integration of Heterogeneous Neuroscience Data Sources
Bertram Ludaescher, Amarnath Gupta, San Diego Supercomputer Center, UCSD, Maryann. E. Martone, UCSD, Ilya Zaslavsky, San Diego Supercomputer Center, UCSD

We present a novel wrapper-mediator approach called Model-Based Mediation in which views are defined and executed at the level of conceptual models rather than at the structural (XML) level. Another novel feature of our architecture is the use of Domain Maps, a kind of semantic net used to mediate across sources from "multiple worlds"-in our case different neuroscience data sources. As part of registering a source's conceptual model with the mediator, the wrapper creates a "semantic index" of its data into the Domain Map. A prototype establishing the viability of the approach is operational. Our knowledge-guided mediation is used within the "Federation of Brain Data" project under NPACI.

Numerical Reproducibility on Distributed Platforms: An Accurate Arithmetics Approach
Yun (Helen) He, Chris H.Q. Ding, Lawrence Berkeley National Laboratory

Numerical reproducibility of large-scale scientific simulations, especially climate modeling, on distributed memory parallel computers are becoming critical issues. In particular, global summation and dot products of distributed arrays are very susceptible to rounding errors. We analyzed several accurate summation methods and found that two methods are particularly effective to improve (ensure) reproducibility: Kahan's self-compensated summation and Bailey's double-double precision summation. We provide an MPI operator MPI_SUMDD to work with MPI collective operations to ensure a scalable implementation on large number of processors. The final methods are particularly simple to adopt in practical codes.

Parallel Programming in Java with OpenMP-like Directives
J. Mark Bull, Edinburgh Parallel Computing Centre, Mark E. Kambites, University of York, Jan Obdrzalek, Masaryk University, Brno

We present a specification of directives, library routines, and system properties (in the spirit of OpenMP) for shared memory parallel programming in Java. A prototype reference implementation consisting of a compiler and a runtime library is described. The compiler translates Java source code with directives to Java source code with calls to the runtime library, which in turn uses Java threads to implement parallelism. The whole system is pure Java, and can be run on any Java virtual machine. The performance of the system is compared to hand-coded Java threads and to commercial C and Fortran OpenMP implementations.

Policy Specification and Restricted Delegation in Globus Proxies
Babu Sundaram, University of Houston, Christopher Nebergall, Western Illinois University, Steven Tuecke, Argonne National Laboratories

Grid Security gets complicated when users need a single log-on to distributed resources that have heterogeneous local policies. Policy languages with rich features and functionalities are needed to specify security policies in proxies. This approach is examined for the Globus toolkit. Also, facilities for site administrators to specify local policies were considered. Classified Advertisements, from the University of Wisconsin Condor project, were used to specify and evaluate policies as attributes. The appropriate attributes have been identified. Also, the authentication and the authorization processes are successfully implemented by modifying the Globus Proxy initiation, Gatekeeper, and Job-manager. This provides fine-grained control and more protection against stolen proxies.

Rapid Design Realization of 3D Woven Composites by HPC
Seung Jo Kim, Chang Sung Lee, Heon Shin, Seoul National University, Jeong Ho Kim, KORDIC Supercomputing Application Lab

A design realization of 3D woven composites is tackled by a full-scale FEA (not by the limited "unit cell approach") through High Performance Computing (HPC) techniques. An accurate finite element modeling, which includes warp yarns, filler yarns, stuffer yarns and resin regions and reveals these complex geometrical characteristics is first prepared. An efficient direct solver, parallel multi-frontal solver and a parallel explicit dynamic solver with central difference scheme are used. An optimal design pattern of 3D orthogonal woven composites, which has a required stiffness in-plane directions and a good through-the-thickness strength without local stress concentrations, can be achieved.

A Storage Broker for the Globus Environment - A ClassAd Based Implementation
Sudharshan S.Vazhkudai, The University of Mississippi, Steven Tuecke, Argonne National Laboratory

An increasing number of scientific applications ranging from high-energy physics to computational genomics require access to large amounts of data (to the tune of tera and peta bytes) with varied quality of service requirements. This diverse demand has contributed to the proliferation of storage system capabilities, thus making storage devices an integral part of the Grid environment. In order to access and utilize storage systems efficiently in the Globus Grid environment, we need to provide an efficient architecture (a Storage Broker) for querying and selecting them based on application requirements, and determine their properties and expose them to the interested Grid services and high-end applications. Our work illustrates the architecture of such a storage broker for the Globus Grid environment.

STWAVE: A Case Study in Dual Level Parallelism
Rebecca A. Fahey, Engineer Research and Development Center, Major Shared Resource Center

This case study explores a dual-level parallel implementation of the Steady-State Spectral Wave Model (STWAVE) code, a near-shore, wind-wave, growth and propagation simulation program with two natural levels of parallelism. The embarrassingly parallel calculations for processing multiple wave runs are distributed via MPI, resulting in a linearly scalable code. However, the achievable speedup is limited by the number of wave runs. To obtain additional speedup, loop-level parallelism is exploited with OpenMP. For STWAVE, the speedup attained with a dual-level approach surpasses the speedup possible with either method alone.

Two Levels of Parallelism for Models of Tracer Particle Transport
Vickie E. Lynch, Benjamin A. Carreras, Nathaniel D. Sizemore, Oak Ridge National Laboratory

Avalanche transport may be a dominant transport mechanism in magnetically confined plasmas. Models used range from a sandpile model to 3-D turbulence models. To understand what happens to transport, we evolve test particles simultaneously with the evolution of the macroscopic field. These models require long time evolutions and large grids to resolve the different scales. With the IBM SP capable of two levels of parallelism, we want to use the shared memory processors to do calculations of a partition of the grid using OpenMP and then run different bunches of particles on the distributed memory nodes using MPI.

Utilizing Idle Workstations in a Scheduled Parallel Computing System
Scott Hansen, Quinn Snell, Mark Clement, Brigham Young University

Increasing demand for parallel computational resources has motivated many researchers to use idle desktop systems in an attempt to increase computational power with minimal cost increase. This presentation describes the BYU Resource Manager (YRM) that, in conjunction with the Maui Scheduler, supports non-dedicated resources. We show that by effectively handling the complexities of non-dedicated resources, a resource management system can increase job throughput and job turnaround time. We also show that the YRM provides a resource management framework necessary for dynamic process management in the next generation of parallel programming libraries such as MPI-2.

VG Cluster: Large Scale Visual Computing System for Volumetric Simulations
Shigeru Muraki, Kazuro Shimokawa, Electrotechnical Laboratory, Masato Ogata, Kagenori Kajihara, Mitsubishi Precision Co., LTD., Kwan-Liu Ma, University of California-Davis, Yutaka Ishikawa, Real World Computing Partnership

This research aims at the development of a parallel visual computing system for large-scale scientific simulations. Our initial design, named VG cluster, is based on a Linux PC cluster with a high-speed switch, specially tuned system software and distributed volume rendering technique. We subdivide a large simulation space into sub-volumes of similar size and distribute the simulation and visualization computations to individual PCs. A 3-d neuron excitement simulation using confocal microscope data from real neurons will be used for our study on the VG cluster. The test results and our future plan of implementing the rendering process in hardware will be presented.

Visual MD: An Innovative Approach to Molecular Simulation
Jennifer Hare, Betsy Rice, U.S. Army Research Laboratory, Jerry Clarke, Raytheon Systems Company, Margaret Hurley, CCM PET ARL MSRC/OSC, William Mattson, University of Illinois, Urbana Champaign

Computational molecular modeling is recognized as an efficient step of modern developmental programs. We are developing an integrated suite of user-friendly, highly scalable molecular simulation codes to study reactive and non-reactive processes in solid materials. The software is designed with flexibility, portability and ease of use in mind, to meet ever changing user needs and changes in computational platforms. These codes allow both the non-expert in molecular simulation to exercise the codes and an expert to customize them. Unlike monolithic solutions, this integrated suite of software tools includes a GUI, scientific visualization, and the ability to add custom functionality.

Visualization of Events from the Relativistic Heavy Ion Collider
Michael D. McGuigan, Brookhaven National Lab, Stephen Murtagh, Carnegie-Mellon University

This Research Gem discusses a high performance visual display of events from the Relativistic Heavy Ion collider at Brookhaven National Lab used to search for a new state of matter not seen since the big bang. A comparison is made of several hardware and software combinations representing the events. Images and animations corresponding to an OpenInventor application displaying the collisions and detector responses are presented. Conclusions are drawn from comparisons of fitted track data directly from the visualizations.

Visualization of Extra Dimensions from String Theory
Michael D. McGuigan, Brookhaven National Lab, Stephen Murtagh, Carnegie-Mellon University

We address the use of advanced visualization techniques (implicit functions, algebraic geometry and topology) for high performance visualizations of the internal dimensions predicted to exist by String Theory, the leading candidate for a theory beyond the Standard Model of particle physics. One complication is that the theory is only consistent in ten dimensions. Since we live in four dimensions, three space and one time, the extra six dimensions are assumed to be curled up in small space that can't be directly seen. In this research we magnify the internal dimensions by a factor of 10^34 so they can be analyzed visually using a powerful visualization computer.

Workload Characterization and Similarity Analysis of SPECcpu95 Benchmarks
Abdullah I. Almojel, Ali S. AlSwayan, King Fahd University of Petroleum and Minerals

Quantifiable methods can help establish the basis for scientific design of parallel computers based on application needs, to optimize performance to cost. This research work is based on the Vector Space model of workload representation and similarity. It is shown that this model can effectively represent real-life workloads. The model also lends itself to many practical applications. These applications include the design of effective Benchmark Suites, Experimental Parallel Computer Design, and Performance Prediction of real-life applications on real machines. This research work investigates the architectural characteristics and the similarity between the SPECcpu95 benchmarks in order to remove the redundant workloads (exercise the machine similarly) from the benchmark suite.