Data

The SIMDAT Data infrastructure consists of a set of carefully chosen, interoperable components that provide the wide range of data-related services required by the SIMDAT application scenarios:

Transparently access data that resides in remote file-based repositories or databases across the Grid;
efficiently transmit large amounts of data between different nodes on the Grid, be it for transmitting files or for synchronizing between data repositories;
effectively manage the storage, replication and synchronization of data at local and remote Grid nodes and support indirect access to data based on (meta)data catalogues;
handle the semantic mediation between different data models and replications, using dynamic techniques based on ontologies.

Following the general SIMDAT philosophy, the Data infrastructure is based on using an extending existing, third-party components wherever possible, with SIMDAT focusing on hardening these often academic SW components, ensuring interoperability between themselves and integration with the GRIA-based SIMDAT Grid infrastructure, and extending functionality or improving performance where required by the application scenarios.

The central elements of the Data infrastructure are based on the OGSA-DAI package developed and owned by the University of Edinburgh. OGSA-DAI provides the framework for accessing file repositories and databases through a Web Service interface regardless of their location. Since the beginning of SIMDAT, OGSA-DAI as evolved very significantly, adding functionality for industrial-strength fine-grained access control as suggested by SIMDAT, greatly improving performance in real-world application scenarios, and making it much easier to add higher-level services onto the basic file access technology through writing OGSA-DAI activities. Today, OGSA-DAI is used by SIMDAT as a common interface to local data repositories (f.i. abstracting the large variety of archive systems used by Weather centers in the Meteorology application area), and as the standard interface for accessing and manipulating data across a Grid (in the Automotive and Aerospace application scenarios).

Automatic distribution, replication and synchronization of data is performed through the IGOR-FS distributed filesystem developed and owned by the University of Karlsruhe. IGOR-FS partitions files (and directories) into blocks, each of which is uniquely characterized by it’s hash value. Blocks are looked up by hash value, and chains of blocks are likewise assembled by referencing hash values. In a Grid, a network of IGOR daemons provide access to file blocks – they can uniquely identify and verify each block regardless of its location (since blocks are indexed by their content, not their location), and manage adaptive, local caches of blocks. Synchronization of changes is fully automatic – IGOR-FS is designed for the case of one/few writers and many readers, and changes to a file are automatically propagated, since they amount to creating a new sequence of blocks rather than modifying existing blocks. This scheme also delivers a very powerful version control functionality.

IGOR-FS is used in the Pharmaceutical scenario to distribute large gene and protein databases amongst partners. Here, it really shines, since only blocks actually used by an application will be transferred, and since changes/updates are managed in a totally transparent way. IGOR-FS relies on the Linux FUSE mechanism, so there is no MS Windows version currently available.

SIMDAT modules for Data Infrastructure

OGSA-DAI by University of Endinburgh
IGOR-FS by Universitiy of Karlsruhe
OntoBroker® by ontoprise
MRS by University of Nijmegen