13 DataGrid
13.1 General information
2003 was the third year of the DataGrid project, which is a part of the 5th EU Framework Programme. The project has more than 20 participants coordinated by CERN. This year, the aim was especially to create a more stable grid environment with the possibility of testing even more large-scale application computations.
The CESNET Association team is involved in the activities of work package 1, which is responsible for resource management and for the development of a complex Workload Management Service (WMS). We are specifically responsible for the logging and bookkeeping service and security mechanisms in use. CESNET also operates the Certification Authority whose certificates are admitted by all partners of this project and other European Grid projects as well. An unfunded activity is CESNET participation in operating the DataGrid testbed together with the Institute of Physics of the Academy of Sciences of the Czech Republic.
13.2 Logging service
In 2003, the team was involved in following basic activities:
- Gradual implementation and support of the operating version 2.0.
- Continued development of the next version 2.1 (with the prospect of version 3).
- Integration of the logging service with R-GMA (Grid monitoring architecture).
13.2.1 Operating version 2.0
In February 2003, the DataGrid project passed a successful second review and the project management decided on using the new version 2 in the project's production testbed. This decision enabled the implementation of a conceptually new WMS architecture before the end of the first semester, including also advanced logging service properties. Beside the standard asynchronous mode of event delivery it supports priority-based and synchronous transfers (immediate event delivery to the stipulated bookkeeping server) and application events as well.
The new WMS architecture consists of components that transfer the job control to each other through a network connection, disk queue of requests or through a direct call of the corresponding procedure. All these control transfers are registered in the form of events by the logging service. Events are logged by both the transferring and accepting component, which, in addition to increasing robustness, enables very detailed post-mortem analysis of unexpected states (loss of task information, race conditions etc.). The logging state automaton that processes events and restores task state was also accordingly modified.
13.2.2 New extensions
The new production version had a much higher stability and enabled already large-scale tests of whole grid environment. Files with several tens of thousands jobs got through with a very high success rate - more than 95-97 % of jobs finished as expected. This stress tests exposed the limits of the existing architecture but also generated new user requirements on the functionality of the whole WMS, the logging service in particular. The tests have contributed to a quality improvement of the developed software.
As opposed to the previous version, the state automaton does not have a buffer memory function but immediately calculates the new state of each job, as soon as the server accepts an event about the job. The result of the computation is stored in database, thus avoiding a recalculation of the event state even in case of its crash. This version of the state automaton supports multiple jobs at the same time, with the dependency between particular sub-jobs being described by a directed acyclic graph (DAG).
The increasing stability of WMS and further DataGrid middleware components has caused an increased interest in statistical data collection about the whole DataGrid and its efficiency. The best source of such data (e.g., the ratio of successful and all input tasks, waiting time in the queue, ratio of waiting in queues to the duration of the computation) is the LB service. However, a direct access to this information through the user interface causes an unacceptable load on the database itself. Therefore, we had to propose and implement a pair of commands dump/load, which allow to create a copy of all events in the database in a controlled way. The database dump can be loaded into a separate database where independent very complex searches can be performed without any impact on the logging and bookkeeping service. A sequence of successive dump commands generates an exact copy of the original database, capturing its development in time. It also preserves finished jobs which are otherwise purged form the active database (after some grace period). Operation purge can be used for deleting all completed jobs from the database, whose data were already uploaded by users. Physical data removal of items marked by the purge commands will be performed only after the next dump command, ensuring the completeness of data provided in this way.
13.2.3 R-GMA and the logging service
During the year 2003 we finished the integration of the connection of L&B service with R-GMA, or Relational Grid Monitoring Architecture. The main problems of the long development cycle were permanent modifications and instability of the R-GMA code. At present, the extension of bookkeeping server is available and the server is now able to continually send information about new states of jobs to the basic R-GMA architecture. Data arrive to the StreamProducer, which continues to send them to other layers of the R-GMA infrastructure. Each higher layer registers at the previous one and defines a selection function (an SQL expression) indicating the kind of received data. At the end this chain there is either a user (e.g., receiving all states of his task) or a simple notification service sending an e-mail or SMS message to the user if such a state occurs.
Unfortunately, the current R-GMA implementation does not provide all building blocks necessary for a full utilization of this infrastructure. First, no security is available, data are transferred in an open form among particular nodes and even these nodes are not authenticated in any way (let alone authorized). Such an infrastructure is too vulnerable to attacks in order to be used on a production Grid.
Further implementation deficiency is the absence of some persistent components. LB sends data to R-GMA permanently and relies on the assumption that R-GMA does not lose any information - which is not true. Furthermore, there is no possibility to query R-GMA about the latest state value of a particular task.
Consequently, towards the end of the year 2003 we started to work on a proper implementation of the R-GMA infrastructure, which would utilize LB services components and provide persistence and full security.
13.3 Security
The security in the context of DataGrid (like the majority of grid projects) is based on the Public Key Infrastructure and its certificates. Certificates are always issued for a limited period, which complicates the situation for tasks waiting in the queue or running on computing nodes for too long. The certificate can expire prematurely and the task can be rejected from further processing. On the other hand, certificates with a too-long validity are more prone to theft and abuse. The solution is to extend the validity of certificates before they expire - we have already worked on this solution last year.
At present we extend proxy certificates for tasks that are known to the WMS (running or waiting in a queue). Following our proposals and corrections, the Myproxy server has been modified to support certificates renewal. The Globus job manager has been also modified in order to enable certificate extensions even for running jobs. The modifications were tested by Condor system developers and accepted to the stable Globus version together with their changes. We take care of the certificate renewal for WMS and Condor handles the transfer of new certificates to machines with running jobs.
In the framework of the DataGrid project, an authorization service was implemented in 2003 through the so-called Virtual Organization Management Service (VOMS). It keeps basic authorization information and provides it to entities in the form of attributed (and de facto authorization) certificates. We have extended the Myproxy service so that it queries also the VOMS server during the certificate renewal and ensures an update of the attributed (authorisation) certificate.
Being so-called VOMS "early adopters", we use authorization information for authorising access to the LB data. We support common manipulations with the ACL (Access Control List), where we accept user DN or VOMS groups etc. Regrettably, so far we have been the only users of this information within work package 1 and otherwise just a small group of work package 4 uses it. Consequently, for the present the WMS is not able to offer and support advanced authorisation operations like for example the possibility of cancelling other than the proper task. Provisionally, we don't distribute VOMS information to the R-GMA.
The logging service infrastructure is ready to consistently utilize authentication and authorization information, e.g., inter-logger certificate control etc. However, security requirements in the framework of the DataGrid project so far have not been that high.
13.4 The EGEE project
During the last year we were already involved in the preparation of a pan European EGEE project (Enabling Grids for E-science and industry in Europe) within the 6th EU Framework Programme, together with almost hundred participants from all European countries, Russia and USA. The aim of this project, coordinated by CERN as well, is to create a genuine production and stable pan-European grid infrastructure. The project passed successfully the initial reviews and is expected to start on the 1st of April, 2004. With a budget of almost 32 million Euros for two years, the project intends to interconnect all European national, regional and thematically oriented grids into a uniform European Grid infrastructure, which should then be available to all academic users asking for computing or data capacities. At the same time it should further intensify cooperation on the European and global level as well.
The project coordinator is CERN, centre of European research in the domain of high energy physics. In total around 70 institutions should be involved in this project, Czech Republic being represented by CESNET. Like CR, many countries are involved through their national research and education network or national Grid agency. Other partners are research institutions and universities. Russia is explicitly involved as well and EU is still investigating the most appropriate form of including USA and Japan (the condition is their financial participation). We expect each partner will bring national, regional or thematic grid infrastructures to the project (computers, storage capacities, Internet connectivity) and EGEE will provide the money specifically for whole Grid system management and operation and to a certain extent for necessary development and re-engineering of indispensable program facilities. Most project partners will assume the role of a regional support centre with duties in the domains of training and disseminating information about the Grid technology and EGEE.
In 2004, the Grid infrastructure should be built on 25 nodes with a total capacity of approx. 5,000 processors and 50 TB of disk space. At the end of the biennial project around 100 nodes with 50 thousand processors and one petabyte of disk capacity should be involved in the EGEE Grid. EGEE plans to provide a simple and controllable way for accessing all these capacities to European research community in the broadest sense of the word.
From a certain point of view, we can regard the EGEE project as a natural continuation of the soon-to-be-finished DataGrid project of the 5th EU Framework Programme. We suppose that during the first phase the EGEE will utilize exactly the software that has been developed by the DataGrid project and is now being adapted for the needs of CERN and its users. Apart from the primary target group of high energy physicists, new applications are expected in the domain of bioinformatics, later Earth sciences (remote sensing), astrophysics, chemistry and others. First three domains were mentioned in the project proposal, the involvement of users from further domains is nonetheless one of explicit aims of the EGEE project.
The project itself comprises several interrelated activities. Beside the European Grid operation mentioned above and activities in the domain of training and information dissemination, the following 4 areas have been identified, each with its own development potential:
- Grid middleware development and integration
- resulting software quality assurance
- security
- specific network services
Especially due to the successful contribution to the DataGrid EU project, CESNET, as the only institution in the Central Europe, was accepted as a partner in one of the directly financed involvement development activities, namely in further middleware development. This can be seen as an explicit recognition of the outstanding abilities of the CESNET grid team.
Along with the accession to EU we can thus look forward to a new pan-European infrastructure in the domain of large-scale distributed systems. Explicit and extensive involvement of CESNET (the most extensive from all partners in Central Europe) will mediate a direct access to the European Grid for all interested users in the Czech Republic.
|
|
contents |
next
|