6   Performance monitoring and optimisation

This research activity investigates theoretical and practical aspects of end-to-end performance in high-speed long-distance networks. We concentrate particularly on monitoring their performance, studying the protocol behaviour and optimising their operation. This activity is associated with the SCAMPI and LOBSTER research projects and with the GN2 project JRA1 activity.

In the following sections we summarize the most important results achieved in 2004. More information about our publications, software and conducted experiments can be found on our web page.

6.1   Scheduling, processing and presenting the performance monitoring tests

When data traffic passes through a network, it experiences various performance characteristics, such as throughput, delay, packet loss rate, jitter, etc. Many performance monitoring tools have been developed to measure these characteristics or to check the current state of the network in order to verify that the required performance characteristics can be achieved. Performance monitoring tools are indespensable for locating fault points and performance affecting points, and to observe trends in network operation.

Different characteristics require different measurement methods. Also, different monitoring goals require different result processing and presentation. Tests can be started on-demand or scheduled regularly. Measured values need to be appropriately aggregated. Consequently, for a comprehensive view on various performance characteristics, a set of different tools is needed; this implies a need for an extensible framework for scheduling individual tests and for processing and presenting their results.

As part of our participation in the GN2 project JRA1 activity (Performance measurement and management) we developed a pilot version of the test scheduling and result processing framework. It allows us to acquire experience with various tools and observe their behaviour when used simultaneously between specified points in our network.

6.1.1   Test specification

Complete information about which tests should be run as well as the measured and processed results are stored in a MySQL database; its structure is illustrated in Figure.

The test_type table includes one record for each test type. The tests table includes one record for each test instance which can run between certain end points. The attributes table includes parameters of test instances, such as the IP addresses of end points. Measured values are stored in the mvalues table and aggregated according to instructions in the aggregations table. The database also includes information on graphs presenting the results and constraints about which tests cannot run at the same time because they would influence each other.

[Figure]

Figure 6.1: Database structure (large image)

6.1.2   Test scheduling and data aggregation

The test scheduler is implemented by a script which reads the tests table and executes tests at specified times. The script checks if particular tests can be started in parallel or if they must not overlap. Execution of tests at remote sites is done by NRPE. A remote site executes another script which wraps the particular test command (e.g., iperf) and translates its output to an appropriate format which is processed by a child process started from the scheduler and written to the database.

Aggregation is performed by another script (usually specific for each test type) executed periodically approximately every 10 minutes. There is no link between data aggregation intervals and the interval in which the script is executed. Aggregation intervals are specified in the aggregations table. The script aggregates and deletes individual measurements older than the current timestamp minus the deadline field.

6.1.3   Test result presentation

All measured results from all tests are available in a uniform web-based user interface illustrated in Figure. The user can click on any graph to display it in a larger resolution as illustrated in Figure. A set of graphs for different time scales is generated automatically.

[Figure]

Figure 6.2: Test result presentation

[Figure]

Figure 6.3: Enlarged graph of throughput measured by bwctl

Users first select the type of test. Currently we use three types of tests - bwctl (wrapper around iperf) for active throughput measurement, owamp for active one-way delay measurement, and reading router interface byte counters via SNMP for instantaneous link load measurements. We plan to add further tests as needed. The system finds all test instances of the selected type automatically.

In the next step, users select a particular test instance by specifying its end points. The system automatically finds all performance characteristics measured by this test instance. Afterwards, users choose one or more performance characteristics to be plotted in graphs using different colors and line styles.

Finally, users can choose if values exceeding the 95 % percentile should be omitted from graphs. This allows utilizing the graph space to display most values in finer detail - otherwise, several excessive values might compress most values to a small portion of the graph space. The user can also select plotting several test instances in the same set of graphs for easy comparison.

6.2   Congestion control monitoring and optimisation

Most network traffic is currently carried by the TCP protocol which provides reliable data transfer. TCP uses several congestion control mechanisms. Their task is adjusting the transmitting speed according to the current available bandwidth and trying to exploit it as much as possible. TCP was designed at the beginning of the Internet development when network lines were slow. Some of its mechanisms do not provide optimum performance on today's high-speed long-distance networks with high volume of outstanding data sent by the sender but not yet acknowledged by the receiver.

Many performance problems are caused by improperly set parameters of the TCP congestion control mechanisms. In order to optimise these parameters, we need a monitoring system providing real-time information about TCP internal runtime variables. We also need a tool to configure these parameters.

6.2.1   Bulk utility

We developed a utility called bulk for active throughput measurement (done in a similar way to the well-known iperf tool) which allows synchronous monitoring of TCP runtime variables. Bulk uses standard setsockopt() and getsockopt() system calls, particularly for the TCP_INFO socket option to set and get parameters of TCP connections. The utility includes mnemonics for many known socket options; however, their numeric codes can also be used. In this way the tool can be used with any newly added socket options in new operating system kernel versions.

For instance, we may want to measure the performance using ten parallel data streams and a 10 MB sender socket buffer. Wishing to check if the initial connection handshake (during which also the TCP window scaling factor is agreed upon) went properly, and also wishing to observe the TCP runtime variables rcv_ssthresh, rtt and the actual size of receiver socket buffer, we can use the following command:

./bulk -v -m -c50 -sSO_RCVBUF,10000000 -b10000000\
  -gwscales%Scales:\ ,rcv_ssthresh%Recv\ thresh:\ ,rtt%RTT:\
  -gSO_RCVBUF%Real\ RCVBUF\ size:\ >outfile_r.txt

The output may look as follows:

...
[5] Scales: 16 Recv thresh: 81919 RTT: 10000 Real RCVBUF size: 131070
[5] Scales: 16 Recv thresh: 81919 RTT: 10000 Real RCVBUF size: 131070
...

6.2.2   AIMD patch

We also developed the AIMD patch for the Linux operating system (available for both 2.4 and 2.6 kernels). This patch allows us to set the aggresivity and responsiveness of AIMD (Additive Increase Multiplicative Decrease) - the primary TCP congestion control mechanism. Standard TCP uses AIMD (1, 0.5); this increases sender congestion window (cwnd) by one MSS segment each RTT and decreases cwnd to 0.5 of the current value when packet loss is detected. These settings are normally fixed values in a TCP implementation and do not allow utilizing the available bandwidth in high-speed long-distance networks with high volume of outstanding data.

This patch also permits switching on/off and monitoring the CWV (Congestion Window Validation) and CWR (Congestion Window Reduction) activity. All these options can be configured and monitored individually for each socket connection. This is a significant advantage over similar existing tools which can only operate on all socket connections. This patch implements two new socket options setsockopt() and getsockopt - TCP_AIMD and TCP_COUNTERS; these are also supported by the bulk tool.

As an example, if we want to modify TCP congestion control just for the connections of the current application to AIMD(2, 0.75), we can use the following code fragment:

struct tcp_aimd aimd;
struct tcp_counters counters;

aimd.slope=200;
aimd.ratio=75;
aimd.cwven=0;
aimd.tqcwr=0;

setsockopt(sockfd, SOL_TCP, TCP_AIMD, &aimd, sizeof(aimd));
getsockopt(sockfd, SOL_TCP, TCP_AIMD, &aimd, &wsize);

printf("AIMD members: Slope: %d, Decr: %d, CWV: %d, CWR: %d\n",
       aimd.slope, aimd.ratio, aimd.cwven, aimd.tqcwr);

The sender congestion window will then proceed as illustrated in Figure. The volume of transferred data is given by the space below the cwnd curve. Of course, filling the available bandwidth more aggresively may affect other connections and may increase packet loss, thus influencing the throughput of all connections. For optimum results, the AIMD parameters need to be adjusted for particular network conditions.

[Figure]

Figure 6.4: Congestion window development for different AIMD parameters

6.3   Parallel scp and parallel socket library

One way to achieve higher throughput for transfers of large data volumes is to use several connections in parallel. This allows us not only to utilize several parallel physical paths, but also to better utilize the bandwidth available on a single high-speed network path. The latter case is an alternative to modifying the AIMD parameters. The advantage is that we do not need to modify operating system kernel; all changes can be done within the application.

We developed two implementations of parallel transfers. The first implementation is pscp - a parallel version of the well-known scp program for secure remote file copying. pscp operation for two parallel connections is illustrated in Figure. Each connection uses one instance of an underlying ssh program which does not need to be modified. An important feature is that we can use the standard sshd daemon on the server side whose modification would require root access.

[Figure]

Figure 6.5: pscp (parallel scp) operation (large image)

Results of performance measurement from copying a 100 MB file over two network paths through the GÉANT2 network are summarized in Table. PC1 was ezmp2.switch.ch in Switzerland and PC2 was tcp4-ge.uninett.no in Norway. Both sending and receiving socket buffers were large enough so that they did not limit communication. We tried 1, 2, 5, and 10 parallel connections. The table indicates the number of seconds needed to copy the file. We can see that parallel communication generally reduced the time needed to copy the file. In certain cases the performance stopped improving or even dropped with higher number of parallel connections. This can be caused by several reasons - sending or receiving machine may have been overloaded, or the maximum achievable throughput given by the current background traffic had already been reached and more streams just increased the processing overhead.

ConnectionsCesnet->PC1PC1->Cesnet Cesnet->PC2PC2->Cesnet
121.628.543.936.7
212.114.241.225.8
512.99.234.325.8
1014.89.432.828.0

Table 6.1: Duration of file copying using pscp

We also developed the first release of psock - parallel version of a standard socket library for network applications. The advantage of the psock library is the possibility to use parallel transfers with any application which needs only small modifications. The library also permits using different algorithms for distributing data among parallel streams, thus adjusting the transfer to conditions of individual connections and allowing experiments with different techniques. We are currently enhancing the psock and measuring the performance.

6.4   PERT

PERT (Performance Enhancement and Response Team) is an emerging international initiative attempting to create a technical and organisational structure to assist users in solving network performance problems.

At the present time a pilot PERT project is underway. The European NRENs take turn in weekly shifts to work on open performance problem cases. The knowledge acquired during problem investigation is stored in a database for futher reference. PERT day-to-day operation is documented in an electronic diary.

Typical problems investigated by PERT are sudden drops in throughput during communication between two points in European NRENs, increased packet loss or strongly asymmetric performance (particularly throughput) between some points. The PERT case database is illustrated in Figure.

[Figure]

Figure 6.6: PERT case database

6.5   Time synchronisation

Network monitoring requires assigning precise timestamps to all observed events, such as sending a packet, receiving a packet, reading registers of network devices, etc. These timestamps can then be used to compute certain important communication characteristics, such as one-way delay, round-trip time, jitter, etc.

Timestamps should comply with two requirements:

When monitoring high-speed networks, the required accuracy might reach several microseconds and the resolution should not be over tens of nanoseconds. To achieve these requirements, the system clocks in measurement points are synchronised usually using GPS receivers. There are several computers in CESNET premises that must have their clocks synchronised. This is why we developed and installed a distribution unit which can send the signal from one GPS receiver to eight computers using standard twisted-pair cabling. We plan to install a second unit in January 2005. The installation in some of the other locations of the CESNET2 network requires either a long cable from a GPS receiver to the computer or an optical coupler inserted in the connection for safety purposes. For these locations we ordered customised RS-232 to RS-422 converters suitable for outdoor installation.

6.6   Remote access to network generator/analyser

In cooperation with the Optical networks CESNET research activity we tried experimental long-distance access to a 10 Gbps network generator/analyser (Spirent AX/4000) at physical layer (L1). These generators/analyser are useful but expensive devices and the experts would welcome if they could share access to them using remote connection. The experiment configuration is illustrated in Figure.

[Figure]

Figure 6.7: Experimental remote access to network generator/analyser

We verified that using this device over the distance of 210 km was possible and we expect that this distance could be extended up to 252 km in another configuration. However, the equipment (optical amplifiers and filters) needed to access the device was expensive and would render remote access ineconomical. We plan to do more experiments with remote access at higher layers (L2 and L3), which should be less expensive but it would probably introduce some limitations on accessible functionality of the generator/analyser.

6.7   Future work

In our future work we plan to concentrate on the following topics: Firstly, we want to continue researching the congestion control mechanisms in high-speed long-distance networks. Secondly, we plan to continue our work in performance monitoring. We want to install a pilot version of a performance monitoring system in the CESNET2 network and to contribute to the JRA1 activity of the GN2 project by integrating the SCAMPI passive monitoring platform with the emerging JRA1 infrastructure. Next, we plan to work on low-level user data anonymisation within the LOBSTER project. Finally, we plan to contribute to the PERT initiative by setting up a pilot PERT in CESNET.

previous
contents
next
metacentrum elearning liberouter live shows videoserver eduroam