9 MetaCentre
9.1 General information
The aim of the MetaCentre project is to provide academic users with a sophisticated computing environment, which fully exploits the possibilities of high-speed computer network and provides access to computing and disk resources of the biggest academic computing centres in the Czech Republic. The main emphasis is on the support of operational infrastructure, whose needs and requirements then determine the orientation of necessary research and development activities. Another important objective of the MetaCentre project is a close cooperation with analogical international projects in order to build a sound base of all necessary professional and technical know-how.
The result of the project is a national Grid, i.e., a distributed virtual computer that enables both simultaneous utilization of computing resources and the possibility of using individual nodes without knowing exactly the location and to a certain extent even the architecture of individual computers. While CESNET invests into own cluster computing capacities (based on the IA-32 architecture), the project also integrates external computing systems (Alpha, SGI), many terabytes of disk capacities in individual centres and a tape back-up library, purchased several years ago by CESNET.
The main activity of the project is development and operational maintenance of the Czech national grid. In the development area we pursued a closer cooperates with two international projects of the 5th EU Framework Program - DataGrid (see below) and GridLab, where Czech Republic is represented by the Masaryk University.
The following centres participated in the project in 2003:
- Institute of Computer Science, Masaryk University in Brno
- Institute of Computer Science, Charles University in Prague
- Centre for Information Technology, University of West Bohemia in Pilsen
- Computing Centre, Technical University of Ostrava
Except for the last one, all the centres contribute their computing capacities, particularly large computing systems from SGI and Compaq.
MetaCentre also directly administers or provides professional support to the managers of external clusters, in particular NCBR (National Centre for Biomolecular Research) at Masaryk University in Brno and NTC (New Technologies Research Centre) at University of West Bohemia in Pilsen. The total number of processors administered this way is about 165.
9.2 Operation
The basic structure of MetaCentre has been stable for several years. The computing resources are situated in four localities and three cities: Brno (ÚVT MU), Prague and Pilsen (CIV ZÈU). The two workplaces in Prague are located at the premises of ÚVT UK and CESNET. All centres are directly connected to the high-speed backbone of the CESNET2 network with a 1 Gbps link. This capacity can be upgraded according to the needs.
The bulk of computing power is provided by clusters based on processors with the IA-32 architecture. Processor counts and general features of these clusters are as follows:
- Brno: 32×Pentium III 1 GHz, 64×Pentium 4 Xeon 2.4 GHz
- Prague: 64×Pentium III 700 MHz
- Pilsen: 32×Pentium III 1 GHz
In total 192 processors are instantaneously available in the dual-processor configuration (i.e., 96 nodes) with RAM capacity of 1 GB per unit. The disk size increases from 9 GB per node in the oldest (and least powerful) nodes through 18 GB per node up to 36 GB in the latest nodes.
All cluster nodes are equipped with a Fast Ethernet network adapter (100 Mbps). Cluster nodes in Pilsen and in Brno are interconnected with high-speed networks. In Pilsen it is the Myrinet network (1.2 Gbps full duplex), while in Brno each node includes two active Gigabit Ethernet interfaces (integrated in the cascade of ProCurve switches from HP) and, in addition, 16 nodes are equipped with a Myrinet network interface (transmission speed up to 2 Gbps full duplex). This distribution enables solving jobs with enormous demands on inter-node transmission throughput and also testing the influence of different interfaces on program scalability.
However, the computing capacities of the MetaCentre are not limited only to clusters. At the end of the year 2002, CESNET purchased an HP server with two Itanium II processors (1 GHz, IA-64 architecture), 6 GB of internal memory and 100 GB of local disk space. The primary role of this server was to provide a test environment for this type of architecture. We found out that the real maximum performance was around 140 % of the performance of a Pentium 4 Xeon processor with 2.4 GHz frequency. For the sander program computations (LES - equilibration), we were able to achieve a performance almost identical to that of a dual-processor Pentium 4 Xeon with MPI/shmem compiled with a PGI compiler. On the other hand, operations with big blocks of memory (routines mem* in the libc library) are not really optimal - the achieved performance was often only half of the performance of the Pentium 4 Xeon processor. Compilers for the IA-64 architecture also have aggressive optimisations turned on by default (extensive reorganisations of floating point operations), which can cause computational errors.
The participating centres provide also their own computing capacities, consisting mostly from SGI computers with MIPS processors (ÚVT UK and ÚVT MU) and Compaq computers with Alpha processors (ZÈU). In addition to the computing capacities, disk arrays are also available with a total capacity exceeding 5 TB.
All cluster nodes run the Debian Linux operating system, which was upgraded to version 3.0 during 2003. All computing resources of the MetaCentre are accessed through the PBSpro batch system, maintained by Masaryk University in Brno. The PBSpro system cooperates with native batch systems of big computers (e.g., NQE). Furthermore, we have installed the globus system (version 2.2.4) together with a gateway between this system and PBS. Moreover, in the second part of the year we started the testing of globus version 3.0.
Data backup is realised by a large-volume tape library at ÚVT MU using the NetWorker system by Legato. Although the backup covers all machines and disk capacities and the volumes of backup data steadily increase, the transmission capacity of the CESNET2 network is still sufficient. The capacity of the tape library is now fully utilised (with backups being archived for a period of at least 3 months).
Many programs and software systems are available to the users of the MetaCentre. Their summary can be found at the portal web site meta.cesnet.cz. The project budget is used for updates of the development environment running on clusters, namely compilers (Portland Group and Intel) and monitoring and debugging tools for parallel programs (TotalView, Vampir, and VampirTrace). Among application software a special mention deserves the maintenance of the Matlab system. Data access is managed in part by the distributed AFS file system. On clusters we currently use the free OpenAFS software.
9.2.1 Extension in 2003
Since the equipment ordered in 2002 was delivered as late as in February 2003, the budget allocated for the latter year was used for enlarging capacity of selected nodes instead of increasing the number of processors. In particular, additional 32 disks with the capacity of 36 GB were purchased for the newest cluster (with Xeon processors) and in the half of the nodes the RAM memory capacity was increased up to 2 GB. This extension of memory and disk capacities enables us to utilize more widely the advanced methods of job planning, especially displacing the running jobs with lower priority when a job with a higher priority arrives - the so-called preemptive planning. However, this method requires that both the old job (with lower priority) and the new one (with higher priority) fit into disk and memory at the same time. Preemptive planning enables a faster execution of short application jobs with high priority. In addition to the extension of computing nodes we upgraded the disk capacity of cluster front-ends and purchased another one (dual CPU Intel Pentium 4 Xeon 3 GHz).
In reaction to an increasing demand for processors with a 64-bit architecture, we purchased an IBM p615 computer with two Power4+ processors (1.2 GHz, 6 GB RAM) and a server with two AMD-64 processors (together with the SuSE Linux operating system, as it is the only Linux system really supporting this kind of processors). The IBM computer will be installed in Brno and the AMD server in Pilsen. The experience with these computers will help us decide about further investments in 2004 as well. Furthermore, in 2003 we purchased additional 16 Myrinet cards and the corresponding switch extension card in Brno so all 32 nodes (64 Pentium 4 Xeon processors) will be interconnected with this high-speed network.
Acquisition of new software was limited, apart from the SuSE Linux operation system mentioned above, to the purchase of a supercomputing license for the Gaussian'03 program. Computations with the Gaussian program account for almost 50 % of the total MetaCentre computing capacity, and besides the new version contains many extensions and upgrades requested by end users.
9.2.2 Operational statistics
More detailed statistical data about utilization of all MetaCentre capacities during the last year are available in the MetaCentre Annual Report, what follows are just selected data about utilization of the clusters.
While the annual average utilization of clusters is only 41 %, during the last six months the interest in cluster computing has been evidently increasing:
- Utilization in the last 6 months is 51 %
- Utilization in the last two months is 65 %
- Utilization in the last month (November) is 80 %
Most demanded are the powerful systems with Pentium 4 Xeon processors.
In total 24,524 jobs were processed representing almost 39.5 thousand days of computation (between January 1st and December 12th, 2003). Per-job average processor utilisation was 1.79, indicating that one- and two-processor jobs prevail. On the other hand several users were running their jobs simultaneously on 11-12 processors on average, so there definitely is certain interest in using the clusters for medium-parallel jobs as well (see also below).
MetaCentre has approximately 200 users, about 70 of them being highly active. The 7 most active users have consumed 45 % of all processing time for their computations (these statistics involve clusters, not big computers). Accounts are assigned for one year and for their extension we require the users to deliver a report about their utilization of MetaCentre resources during the previous calendar year.
9.3 Information services
The portal at meta.cesnet.cz is the basic information gateway of the MetaCentre. The portal provides both general information for casual visitors and specific information for registered users - the latter requires authentication.
The current trend in the domain of information services is to use directory services as "shopwindows" for important data. One of the basic objectives is to make the search and acquisition of information as simple as possible. This information may be distributed in different parts of the computing infrastructure, e.g., in Perun or the batch system). Therefore, in 2003 we have been work on an update of MetaCentre directory services, both in terms of technology and contents - in order to make it a viewport to the Perun system.
The update of information services involved the following changes:
- Schema modification
- Update and extension of the LDAP
schema for the presentation of user data. The schema used is based on
a standard schema extended by several specific attributes. For
compatibility reasons, our strategy has been to reuse standard schemas
as much as possible. In particular, we use the
eduPersonschema and recommendations from the Internet2 project (concerning personal data) andrfc2307(concerning account and group data). Description of the new schema is included in the technical report about the Perun system [Køe04]. - Promotion of the Czech language
- The support for Czech diacritics in the clients of directory services is still far from satisfactory. Our solution tries to find a compromise between preserving compatibility and providing information in proper Czech, i.e., with all accents.
- Mechanism of data distribution from the MetaDatabase
- Independently of the data structure, this update allows to propagate database modifications to LDAP in real time. Some clients and applications can greatly benefit from this improvement.
- Infrastructure of directory servers
- The new mechanism of data distribution from the MetaDatabase will enable the replication of LDAP servers via fully standard mechanisms. The OpenLDAP supports optional Kerberos authentication during the replication.
9.3.1 Perun
The principal new feature of the Perun system for user account management is the implementation of the concept of homogeneous clusters, i.e., clusters where user accounts are identical on all nodes. The original system was proposed when these clusters were not widespread yet. Even though the system was able to administer them, the increase in the number of nodes beyond 50 started to cause considerable performance problems. The current solution which has eliminated repeated time-consuming generation of identical data files can accommodate clusters with thousands of nodes.
Another noteworthy achievement related to the administration of big clusters is the newly developed part of the administrative portal. It enables the administrator to observe the state of data propagation on individual nodes. Another extension of the administrative portal is a systematic interface for browsing and updating data about users, administered tools, accounts, etc.
In relation to our cooperation in international projects, the application data scheme has been extended with PKI infrastructure support. The system is able to keep track of the bindings of users to their X509 certificates and subsequently map them to particular accounts.
Data export from the Perun system to the MetaCentre information
service (LDAP) was updated so that it is now compatible with standard
schemas inetOrgPerson, eduPerson, and
rfc2307. Recent features implemented in the OpenLDAP software
enabled us to introduce an incremental propagation of modifications
into the LDAP tree during full server operation. Modifications
thus show up right after they take place, and not only after the regular
overnight shutdown of the LDAP server as before.
Furthermore, we tested the possibility of using the SQL back-end of the LDAP server (i.e., translation of LDAP queries into on-line database queries), but it was found unsatisfactory (results are summarised in the bachelor thesis by Milo¹ Malík at FI MU).
During the second half of 2003 we migrated the central server of the Perun system from the SGI station, which did not meet the performance requirements anymore and started to suffer from hardware errors, to a new IA32-based server running Linux. At the same time we have replaced the Oracle 8.0 database server with the up-to-date 9.2 version. As a by-product of this update, full Kerberos authentication became available.
The current architecture of the entire Perun system architecture is described in an extensive technical report. At the same time we have revised and completely documented all the used database structures.
Needless to mention, throughout the year we continued the routine tasks - system and data maintenance, handling of errors and user support.
9.4 Applications
In 2003 we also aimed at supporting the development and deployment of really large parallel applications utilising synchronously a large number of nodes in individual clusters.
9.4.1 Browsing the state space
In 2003 we finished the development of the application that models conformational behaviour of molecules using the force feedback mechanism (installed in the HCI laboratory at FI MU). The basic part of the application is the computation of state space for the haptic (i.e., force feedback) system. Because of a high computational complexity, it is implemented as a distributed application over the MPI communication model. The computation is distributed to individual nodes using a method called "Transposition Table Driven Scheduling" distribution of computation on particular nodes, thus representing a distributed application with many small asynchronous messages. Such applications are generally appropriate for the cluster environment.
In addition to tuning the application itself [Køe03], we performed a series of experiments focusing on parallel computation properties. Results can be summarised as follows:
- On one homogeneous PC cluster the application scales in accord with the theoretical limit (given by a slight non-uniformity in the distribution of the computation) up to the available 64 CPUs.
- The results for a distributed computation run simultaneously on clusters in Brno and Pilsen do not differ from those obtained on only one of these clusters. It proves the original hypothesis that this type of application is not dependent on the latency of network interconnection.
- Experimental computations on more than 100 CPUs have proved that for a non-trivial input a linear scaling can be achieved even for this number of processors. However, exact verification of this statement will require to find a better distribution of computations taking into account the different performance of non-homogeneous clusters.
Furthermore, we have identified anomalies in MPI properties during the mentioned experiments:
- When dual-processor cluster nodes are fully utilised (i.e., two computation processes allocated on such nodes), a considerable part of their performance is consumed by MPI overhead, resulting in a noticeable degradation of total performance.
- After sending from several hundred to thousand messages we observe MPI congestion causing long execution times of MPI functions calls that are non-blocking by their defined semantics. It results in an incomplete utilization of the CPU, degrading further the total performance. This phenomenon has been observed only with the ch_p4 communicator (i.e., the TCP connection), but not during a direct communication over Myrinet.
All these results were presented on the EuroPVM/MPI conference and are summarised in the article [KPM03].
9.4.2 Fluid dynamics
In 2003 we also tested the scalability of FLUENT (parallel computation software for numeric flow simulation) on PC clusters of the MetaCentre. The test objectives were to complete data from the end 2002 and provide information about the properties of this software using new hardware and/or a new version of the software.
To be able to compare the test results with those from the end of 2002, we used the same series of jobs representing a portfolio of typical jobs with different computing demands. We have also used several computing models and methods. For testing purposes on PC clusters only a few of computation iterations have been carried out. Apart from the duration of one iteration, which is the basic indicator of the computation procedure, we also simultaneously observed further parameters, such as for the start time of FLUENT on chosen parts of the cluster. For these tests we used the clusters in Brno (skirit) and in Pilsen (nympha and minos).
The benchmarks are run on a varying number of cluster processors using different network communicators and network devices. Either one or both processors of the cluster nodes are utilised, depending on the total number of cluster processors used - one for 4-15 processors and both for a higher number.
For 32 or more processors we use 4 parts of the available clusters (nympha, minos, skirit and skurut in Prague) and the number of allocated computing nodes is distributed in a uniform way.
Results for FLUENT 6.0.12
The dependence of computation times on the number of processors and type of communicator used for one of the more demanding jobs is shown in Figure and Figure. An almost linear acceleration is reached for the Myrinet network interface, other interfaces (using 100 Mbps) give solid results as well. Contrary to our expectations the use of Network MPI (NetMPI), which is intended exactly for clusters, does not increase performance in any way, not even if we used Gigabit Ethernet on minos cluster.
The Figure shows that for sockets a slight acceleration of the computation still takes place for 56 processors, then the performance starts to decrease. For Network MPI the situation is even worse.
Results for FLUENT 6.1.22
In the second half of 2003 we installed a newer version of FLUENT on AFS in Pilsen. Ordinary users can access it by means of an appropriate module. Being limited by the number of available licenses, we performed additional mid-sized tests on new machines which form a part of the skirit cluster. We intended to compare FLUENT properties on Myrinet (2 Gbps) and on Gigabit Ethernet. Jobs have been started on cluster via the PBSPro batch system in the ordinary way without applying special privileges.
We found that the new FLUENT version does not pose so strict requirements on installed drivers for Myrinet. The preparation of configuration files for a FLUENT run on Myrinet has also been simplified. On the other hand, we had to carry out some modifications in the FLUENT installation in order to meet the specific requirements of the MetaCentre PC clusters, not envisioned by FLUENT developers.
In Figure we can see iteration times for a job of a medium complexity for various communicators. Regrettably, this FLUENT version does not provide usual detailed reports about the parallel computation procedure when Myrinet is used. The reported values can thus be measured only indirectly and with potentially large errors.
Figure 9.4: Iteration time in relation to the number of processors and to the type of used network interconnection on Nympha
Users can be pleased that Network MPI functions well, from the point of view of iteration process the communicators are equivalent (for all jobs), even though Network MPI and Myrinet have a slightly better performance. The next Figure shows the acceleration of the computation with respect to the least number of processors used for the given benchmark (job). We can see that for jobs running on machines with the the second processor being free (up to 14 CPUs), the scalability is good, at the level of SMP supercomputers.
On the next Figure we can see the startup time of all processes on all participating computing nodes. Here the Network MPI communicator shows an improvement in contrast to the previous version. The figure also gives the startup time of a job with medium demands and its distribution over individual computing nodes. Once again the use of Sockets is not appropriate because of a high number of processors. For comparison, Figure shows the data volume that is transferred per iteration for particular jobs between computing nodes during the computation.
Figure 9.6: Startup time of data distribution in relation to the number of processors and to the type of used network interconnection on Nympha
The FLUENT version 6.1.22 has reached a status of a wider usability and higher stability even for larger-scale PC clusters of the MetaCentre type.
All in all, we are able to answer the question whether an investment in a high-speed system of Myrinet type pays off for FLUENT computations or not: Gigabit Ethernet together with the Network MPI communicator provides the same performance, and for lower number of processors (up to 10) even the Sockets use is acceptable.
Figure 9.7: Time for reading case file in relation to the number of processors and to the type of used network interconnection on Nympha
|
|
contents |
next
|