Ian Foster and Robert Gardner, PIs
Goals: The GriPhyN (Grid Physics Network)
project brings together an outstanding team of
information technology (IT) researchers and
experimental physicists to provide the IT
advances required to enable Petabyte-scale data
intensive science in the 21st century. Driving the
project are unprecedented requirements for
geographically dispersed extraction of complex
scientific information from very large collections
of measured data. To meet these requirements,
which arise initially from the four physics
experiments involved in this project but will also
be fundamental to science and commerce in the
21st century, the GriPhyN team will pursue IT
advances centered on the creation of Petascale
Virtual Data Grids (PVDG) that meet the dataintensive
computational needs of a diverse
community of thousands of scientists spread
across the globe. The iVDGL (international
Virtual Data Grid Laboratory) is tasked with
establishing and utilizing an international Virtual-
Data Grid Laboratory (iVDGL) of unprecedented
scale and scope, comprising heterogeneous
computing and storage resources in the U.S.,
Europe and ultimately other regions linked by
high-speed networks, and operated as a single
system for the purposes of interdisciplinary
experimentation in Grid-enabled, data-intensive
scientific computing.
Our goal in establishing this laboratory is to drive
the development, and transition to every day
production use, of Petabyte-scale virtual data
applications required by frontier computationally
oriented science. In so doing, we seize the
opportunity presented by a convergence of rapid
advances in networking, information technology,
Data Grid software tools, and application
sciences, as well as substantial investments in
data-intensive science now underway in the
U.S., Europe, and Asia.
Significance: The data analysis for these experiments presents enormous IT challenges. Communities of thousands of scientists, distributed globally and served by networks of varying bandwidths, need to extract small signals from enormous backgrounds via computationally demanding analyses of datasets that will grow from the 100 Terabyte to the 100 Petabyte scale over the next decade. The computing and storage resources required will be distributed, for both technical and strategic reasons, across national centers, regional centers, university computing centers, and individual desktops. The scale of this task, far outpaces our current ability to manage and process data in a distributed environment, requiring fundamental advances in many areas of computer science.
Accomplishments: To meet these challenges, GriPhyN and iVDGL will pursue an aggressive program of fundamental IT research focused on realizing the concept of Virtual Data. Virtual Data encompasses the definition and delivery to a large community of a (potentially unlimited) virtual space of data products derived from experimental data. In this virtual data space, requests can be satisfied via direct access and/ or computation, with local and global resource management, policy, and security constraints determining the strategy used. Overcoming this challenge and realizing the Virtual Data concept requires advances in three major areas in which GriPhyN will target IT advances: Virtual data technologies, policy-driven request planning and scheduling of networked data and computational resources, and management of transactions and task-execution across national-scale and worldwide virtual organizations.