6 Performance Monitoring and Optimisation
This activity investigates theoretical and practical aspects of end-to-end performance in high-speed long-distance networks. We particularly concentrate on performance monitoring, study of protocol behaviour and optimisation of their operation. The activity is associated with the research project LOBSTER and with the JRA1 activity of the GN2 project.
In the following sections we summarise the most important results achieved in 2005. In particular, we describe our performance monitoring architecture, parallel socket library that we have developed and our new method for hardware anonymisation of packet headers in passive monitoring. More information about our research can be found on our web page.
6.1 Performance Monitoring in CESNET2 Network
We want to obtain information about performance characteristics of our network between the major PoPs (Points of Presence). Therefore, we decided to install monitoring stations in all major network nodes. At present, we have 9 stations installed as shown in Figure 6.1. Each station has one network interface for general connectivity and active measurements and one or two additional interfaces for passive monitoring. Active measurements allow us to monitor throughput, delay and packet loss. The measurements are scheduled and their results processed on request via our own scripts. We participate in the JRA1 activity of the GN2 project, which is developing a universal framework for using multiple monitoring tools together in a network. This framework should be available in 2006.
6.2 Time Synchronisation
Measurements of some performance characteristics such as one-way delay require precise time synchronisation between the monitoring stations. In principle, we can use network synchronisation via the NTP protocol. However, in this way we cannot achieve the required precision and, moreover, it would be difficult to precisely measure one-way delay over links that were used by the NTP protocol for synchronisation. Therefore we decided to install an independent precise time source in each monitoring station. After considering properties and cost of several GPS and DCF receivers, we decided to use the following configuration:
- GPS receiver Garmin GPS 18 (or GPS 35)
- interface converters RS-232 <-> RS-422
- receiver connected directly to serial port
Monitoring stations use Linux kernel 2.4.29 with the "nanokernel patch" (which is not yet available for 2.6 kernels). Time synchronisation is done by ntpd version 4.2.0.
This configuration is now being used in Brno, Plzeň, České Budějovice and Olomouc. Further installations in Ústí nad Labem and Hradec Králové are in preparation. The station in Prague uses the GPS receiver Trimble Acutime 2000, with the signal distributed by a splitter to several computers. The station in Ostrava uses the previously installed geodetic GPS receiver Topcon GB-1000. The installation in Liberec has not been completed yet due to logistic difficulties.
In order to increase the accuracy of time synchronisation, we installed the rubidium clock PRS-10 from Stanford Research Systems with frequency stability 5×10-12. This clock will allow our NTP time servers to keep their system time within a microsecond precision for up to several days without the GPS signal.
In 2006 we plan to implement a system for monitoring state and quality of time synchronisation of our time servers and monitoring stations.
6.3 Parallel Transfers
Congestion control in standard TCP cannot utilise all available bandwidth in fast long-distance networks because its AIMD(1, 0.5) algorithm for adjusting the congestion window (cwnd) is slow for networks with a high bandwidth×delay product. We can use one of the "fast TCP" implementations with more aggressive cwnd adjustment or our own AIMD patch, which allows per-socket congestion avoidance configuration. However, these solutions require root access, patching the kernel and rebooting of the operating system.
An alternative option for increasing throughput is to use several transfers in parallel. We started our work on implementing parallel transfers in 2004. In the last few months we rewrote the psock parallel socket library that provides easy access to parallel transfers in applications. The library can be downloaded from our web pages.
6.3.1 Psock Parallel Socket Library
We started by specifying the following requirements that the implementation of parallel transfers should satisfy:
- Parallel transfers must be easily implementable in applications that normally use the standard BSD socket library.
- The distribution of data into individual transfers should efficiently utilise the achievable throughput of individual transfers. A slower individual transfer should not limit a faster one.
- The method of distributing data into individual transfers must be configurable.
- The architecture must allow for parallel execution on a multiprocessor computer.
An application that uses the psock library runs in the library thread. Psock forks off a new protocol thread, which distributes data into individual streams and controls the protocol behaviour. The two threads communicate asynchronously by message passing over Parallel Socket Control Interface (PSCIF). The psock architecture is shown in Figure 6.2.
The protocol thread uses a control socket (1) to exchange control messages with the peer - to negotiate the number of individual transfers, to announce port numbers or to agree on a parallel transfer driver.
The protocol thread listens for events on data sockets (2), reads block headers from them and prepares the parallel transfer schedule table. The library thread uses this table to control multiplexing and demultiplexing (3) of data to and from the data sockets.
6.3.2 Parallel Transfer Schedule Table
The parallel transfer schedule table consists of two circular buffers, one for sending and one for receiving data. Each buffer item includes the number of the individual transfer that should be used next for sending or receiving data and the number of bytes that can be sent or received. It also has two pointers indicating which item should be used next. If this item is empty (no individual transfer is available for sending or receiving), the library thread must wait until the table is filled by the protocol thread. A table for three individual transfers may look as follows:
| Read next data from this stream/bytes | 2/1448 | - | 0/1448 |
|---|---|---|---|
| Write next data to this stream/bytes | - | 0/1448 | 1/1448 |
Table 6.1: Parallel transfer schedule table
| Read pointer | 0 |
|---|---|
| Write pointer | 1 |
Table 6.2: Pointers to schedule table
6.3.3 Round-Robin Driver
The method of distributing data into individual streams is determined by a parallel transfer driver. We implemented two drivers. A round-robin driver divides data into equally-sized blocks and sends them over individual transfers in a round-robin fashion. Throughput of the parallel transfer is then c=n×min(ci), where ci is the throughput of i-th individual transfer and n is the number of individual transfers.
6.3.4 Poll-All Driver
Poll-all uses the poll() system call on all sockets of
individual transfers. Data are distributed into individual transfers
as they become available for sending. The sender can utilise different
and varying available bandwidth of each individual transfer. Data
blocks may not arrive in order. If some block arrives so early that
it cannot fit in the table, its number is added to
farsched_list structure linked to the item in the table where
it should fit later.
6.3.5 Psock Evaluation
We tested the psock library in a number of scenarios. Two interesting cases are presented next.
Round-robin and poll-all over two distinct physical paths
We used two paths between the sender and receiver located about 260 km apart in two cities. One path had a bandwidth of 155 Mbps while the other path had a bandwidth of 620 Mbps. The setup is illustrated in Figure 6.3.
We measured TCP throughput using iperf with socket buffer sizes set slightly above the bandwidth×RTT product.
- TCP throughput over each of the links was 142 Mbps and 567 Mbps, respectively, which was close to the nominal bandwidths of 155 Mbps and 620 Mbps.
- TCP throughput of a parallel transfer using round-robin driver was 282.6 Mbps. As expected, this was nearly twice the throughput of the slower link, which was 2×142 Mbps=284 Mbps
- TCP throughput of the parallel transfer using poll-all driver was 707.7 Mbps, which was very close to the total throughput of both links, i.e., 142 Mbps+567 Mbps=709 Mbps
This test showed that the poll-all driver can better utilise the available bandwidth of disparate paths used for parallel transfers.
Round-robin and poll-all drivers and fluctuation of available bandwidth
We configured both links for the same bandwidth of 310 Mbps. We measured TCP throughput of a parallel transfer with each driver for 15 seconds. During the middle third of these intervals (time offsets from 5 to 10 seconds), we added another TCP stream as cross-traffic to link 1. Measured throughput is shown in Figure 6.4. At the bottom, the figure also shows the throughput of the cross-traffic streams. It is apparent that the poll-all driver was able to keep significantly higher throughput of a parallel transfer than the round-robin driver by utilising the available bandwidth of the less loaded link, with only little effect on the cross-traffic streams.
The advantages of our psock library over other implementations of parallel transfers are:
- no changes to the TCP stack in the operating system kernels is needed (it could be the standard TCP or some kind of "fast TCP"),
- easy adaptation of existing network applications,
- ability to use different available bandwidth of individual connections,
- possibility of experimenting with different algorithms for distributing data into individual transfers.
6.4 Development of Firmware for a Hardware Monitoring Adapter
CESNET, Masaryk University and Technical University in Brno have been developing firmware for network monitoring with COMBO cards for the past several years. New research projects and activities are emerging that can benefit from the use of programmable hardware, such as LOBSTER project or JRA1 activity of the GN2 project. As the hardware development team has currently no spare capacity, we decided to create another development team to work on the new projects. Two PCs with COMBO cards and development software will be installed in the Department of Telecommunications at the Faculty of Electrical Engineering of the Czech Technical University. These computers will be used for student work in courses. Independently of the cooperation with the Department of Telecommunications we started to work on hardware anonymisation for the LOBSTER project.
6.4.1 Hardware Anonymisation of Packet Headers
In passive monitoring we directly observe real traffic rather than test packets injected by ourselves. Captured traces of real user traffic are very useful for networking research, ranging from security analysis to testing routing algorithms. However, we must ensure privacy of the data sent by users. Therefore, it is very important to remove all sensitive information from captured packets, while preserving the original traffic dynamics.
As part of the LOBSTER project we are working on hardware anonymisation of packet headers. The anonymisation is done in two layers. The first layer is implemented in hardware of the monitoring adapter while the second runs as software on the host PC. Hardware anonymisation is faster and prevents sensitive information from getting to the host PC (it is removed by the monitoring adapter). Software anonymisation is slower, but can operate on higher layer protocols, such as HTTP or SNMP. CESNET is responsible for the first layer - hardware anonymisation.
As anonymisation is a kind of data transformation, we designed and implemented a universal Transformation Unit (TU) and added this unit to the SCAMPI firmware. The structure of the TU unit is shown in Figure 6.5. It is a soft core implementing a specialised instruction set. For each incoming packet a sequence of instructions is executed.
TU unit can currently transform the following fields in the IP header:
- Source and destination IP address
- Source and destination TCP or UDP port
- Any 16-bit or 8-bit field whose offset is specified relatively to any of the above fields
Therefore, the TU unit can effectively transform any field in the IP, TCP, UDP or ICMP headers. The following transformations can be applied to the above fields:
- Reset to zero
- Set to a constant
- Set to a pseudo-random number
- XOR with a constant
We successfully tested the functionality of the TU unit on the COMBO card with real traffic. We are currently extending the set of possible transformations and implementing other functions of the TU unit. Particularly, we are working on a prefix-preserving IP address mapping.
|
|
contents |
next
|
![[Figure]](infrastruktura.gif)
![[Figure]](tu.gif)