6   Performance Monitoring and Optimisation

This activity investigates theoretical and practical aspects of end-to-end performance in high-speed long-distance networks. We particularly concentrate on performance monitoring, study of protocol behaviour and optimisation of their operation. The activity is associated with the research project LOBSTER and with the JRA1 activity of the GN2 project.

In the following sections we summarise the most important results achieved in 2005. In particular, we describe our performance monitoring architecture, parallel socket library that we have developed and our new method for hardware anonymisation of packet headers in passive monitoring. More information about our research can be found on our web page.

6.1   Performance Monitoring in CESNET2 Network

We want to obtain information about performance characteristics of our network between the major PoPs (Points of Presence). Therefore, we decided to install monitoring stations in all major network nodes. At present, we have 9 stations installed as shown in Figure 6.1. Each station has one network interface for general connectivity and active measurements and one or two additional interfaces for passive monitoring. Active measurements allow us to monitor throughput, delay and packet loss. The measurements are scheduled and their results processed on request via our own scripts. We participate in the JRA1 activity of the GN2 project, which is developing a universal framework for using multiple monitoring tools together in a network. This framework should be available in 2006.

[Figure]

Figure 6.1: Monitoring stations in the CESNET2 network (large image)

6.2   Time Synchronisation

Measurements of some performance characteristics such as one-way delay require precise time synchronisation between the monitoring stations. In principle, we can use network synchronisation via the NTP protocol. However, in this way we cannot achieve the required precision and, moreover, it would be difficult to precisely measure one-way delay over links that were used by the NTP protocol for synchronisation. Therefore we decided to install an independent precise time source in each monitoring station. After considering properties and cost of several GPS and DCF receivers, we decided to use the following configuration:

Monitoring stations use Linux kernel 2.4.29 with the "nanokernel patch" (which is not yet available for 2.6 kernels). Time synchronisation is done by ntpd version 4.2.0.

This configuration is now being used in Brno, Plzeň, České Budějovice and Olomouc. Further installations in Ústí nad Labem and Hradec Králové are in preparation. The station in Prague uses the GPS receiver Trimble Acutime 2000, with the signal distributed by a splitter to several computers. The station in Ostrava uses the previously installed geodetic GPS receiver Topcon GB-1000. The installation in Liberec has not been completed yet due to logistic difficulties.

In order to increase the accuracy of time synchronisation, we installed the rubidium clock PRS-10 from Stanford Research Systems with frequency stability 5×10-12. This clock will allow our NTP time servers to keep their system time within a microsecond precision for up to several days without the GPS signal.

In 2006 we plan to implement a system for monitoring state and quality of time synchronisation of our time servers and monitoring stations.

6.3   Parallel Transfers

Congestion control in standard TCP cannot utilise all available bandwidth in fast long-distance networks because its AIMD(1, 0.5) algorithm for adjusting the congestion window (cwnd) is slow for networks with a high bandwidth×delay product. We can use one of the "fast TCP" implementations with more aggressive cwnd adjustment or our own AIMD patch, which allows per-socket congestion avoidance configuration. However, these solutions require root access, patching the kernel and rebooting of the operating system.

An alternative option for increasing throughput is to use several transfers in parallel. We started our work on implementing parallel transfers in 2004. In the last few months we rewrote the psock parallel socket library that provides easy access to parallel transfers in applications. The library can be downloaded from our web pages.

6.3.1   Psock Parallel Socket Library

We started by specifying the following requirements that the implementation of parallel transfers should satisfy:

An application that uses the psock library runs in the library thread. Psock forks off a new protocol thread, which distributes data into individual streams and controls the protocol behaviour. The two threads communicate asynchronously by message passing over Parallel Socket Control Interface (PSCIF). The psock architecture is shown in Figure 6.2.

[Figure]

Figure 6.2: Psock architecture

The protocol thread uses a control socket (1) to exchange control messages with the peer - to negotiate the number of individual transfers, to announce port numbers or to agree on a parallel transfer driver.

The protocol thread listens for events on data sockets (2), reads block headers from them and prepares the parallel transfer schedule table. The library thread uses this table to control multiplexing and demultiplexing (3) of data to and from the data sockets.

6.3.2   Parallel Transfer Schedule Table

The parallel transfer schedule table consists of two circular buffers, one for sending and one for receiving data. Each buffer item includes the number of the individual transfer that should be used next for sending or receiving data and the number of bytes that can be sent or received. It also has two pointers indicating which item should be used next. If this item is empty (no individual transfer is available for sending or receiving), the library thread must wait until the table is filled by the protocol thread. A table for three individual transfers may look as follows:

Read next data from this stream/bytes2/1448-0/1448
Write next data to this stream/bytes-0/14481/1448

Table 6.1: Parallel transfer schedule table

Read pointer0
Write pointer1

Table 6.2: Pointers to schedule table

6.3.3   Round-Robin Driver

The method of distributing data into individual streams is determined by a parallel transfer driver. We implemented two drivers. A round-robin driver divides data into equally-sized blocks and sends them over individual transfers in a round-robin fashion. Throughput of the parallel transfer is then c=n×min(ci), where ci is the throughput of i-th individual transfer and n is the number of individual transfers.

6.3.4   Poll-All Driver

Poll-all uses the poll() system call on all sockets of individual transfers. Data are distributed into individual transfers as they become available for sending. The sender can utilise different and varying available bandwidth of each individual transfer. Data blocks may not arrive in order. If some block arrives so early that it cannot fit in the table, its number is added to farsched_list structure linked to the item in the table where it should fit later.

6.3.5   Psock Evaluation

We tested the psock library in a number of scenarios. Two interesting cases are presented next.

Round-robin and poll-all over two distinct physical paths

We used two paths between the sender and receiver located about 260 km apart in two cities. One path had a bandwidth of 155 Mbps while the other path had a bandwidth of 620 Mbps. The setup is illustrated in Figure 6.3.

[Figure]

Figure 6.3: Testbed with two real network paths

We measured TCP throughput using iperf with socket buffer sizes set slightly above the bandwidth×RTT product.

This test showed that the poll-all driver can better utilise the available bandwidth of disparate paths used for parallel transfers.

Round-robin and poll-all drivers and fluctuation of available bandwidth

We configured both links for the same bandwidth of 310 Mbps. We measured TCP throughput of a parallel transfer with each driver for 15 seconds. During the middle third of these intervals (time offsets from 5 to 10 seconds), we added another TCP stream as cross-traffic to link 1. Measured throughput is shown in Figure 6.4. At the bottom, the figure also shows the throughput of the cross-traffic streams. It is apparent that the poll-all driver was able to keep significantly higher throughput of a parallel transfer than the round-robin driver by utilising the available bandwidth of the less loaded link, with only little effect on the cross-traffic streams.

[Figure]

Figure 6.4: Throughput over two real links, cross-traffic on one link

The advantages of our psock library over other implementations of parallel transfers are:

6.4   Development of Firmware for a Hardware Monitoring Adapter

CESNET, Masaryk University and Technical University in Brno have been developing firmware for network monitoring with COMBO cards for the past several years. New research projects and activities are emerging that can benefit from the use of programmable hardware, such as LOBSTER project or JRA1 activity of the GN2 project. As the hardware development team has currently no spare capacity, we decided to create another development team to work on the new projects. Two PCs with COMBO cards and development software will be installed in the Department of Telecommunications at the Faculty of Electrical Engineering of the Czech Technical University. These computers will be used for student work in courses. Independently of the cooperation with the Department of Telecommunications we started to work on hardware anonymisation for the LOBSTER project.

6.4.1   Hardware Anonymisation of Packet Headers

In passive monitoring we directly observe real traffic rather than test packets injected by ourselves. Captured traces of real user traffic are very useful for networking research, ranging from security analysis to testing routing algorithms. However, we must ensure privacy of the data sent by users. Therefore, it is very important to remove all sensitive information from captured packets, while preserving the original traffic dynamics.

As part of the LOBSTER project we are working on hardware anonymisation of packet headers. The anonymisation is done in two layers. The first layer is implemented in hardware of the monitoring adapter while the second runs as software on the host PC. Hardware anonymisation is faster and prevents sensitive information from getting to the host PC (it is removed by the monitoring adapter). Software anonymisation is slower, but can operate on higher layer protocols, such as HTTP or SNMP. CESNET is responsible for the first layer - hardware anonymisation.

As anonymisation is a kind of data transformation, we designed and implemented a universal Transformation Unit (TU) and added this unit to the SCAMPI firmware. The structure of the TU unit is shown in Figure 6.5. It is a soft core implementing a specialised instruction set. For each incoming packet a sequence of instructions is executed.

[Figure]

Figure 6.5: Structure of Transformation Unit (TU) (large image)

TU unit can currently transform the following fields in the IP header:

Therefore, the TU unit can effectively transform any field in the IP, TCP, UDP or ICMP headers. The following transformations can be applied to the above fields:

We successfully tested the functionality of the TU unit on the COMBO card with real traffic. We are currently extending the set of possible transformations and implementing other functions of the TU unit. Particularly, we are working on a prefix-preserving IP address mapping.

previous
contents
next
metacentrum elearning liberouter live shows videoserver eduroam