11 End-to-end performance
This project investigates theoretical and practical aspects of end-to-end performance to provide high throughput and other qualitative communication characteristics required by applications communicating over wide-area high-speed networks.
Our results are presented on the project web pages, These include all published papers, technical reports, presented talks, experimental data and developed software. In this chapter, an overview of selected project results from 2003 as well as some interesting technical problems are presented.
11.1 Transferring large data volumes over large-scale high-speed networks
The Internet has been a large-scale network spanning long distances almost since its origin. However, two new characteristics have changed the Internet only recently. First, it has become a truly high-speed network with backbone links operating at 10 Gbps or even higher speeds. Second, researchers in fields such as physics or astronomy have started to transfer large volumes of data, from terabytes to petabytes.
As all these three characteristics (long distances, high speeds and large date volumes) have met, one has found that the communication protocols used so far, particularly the reliable TCP transport protocol carrying over 95 % of Internet traffic, as well as the data processing mechanisms on connected computers, no longer suffice to provide required communication qualities. Their considerable improvement is necessary in order to achieve high throughput and other qualitative characteristics, such as low delay fluctuation, required by current applications. This usually belongs to the end-to-end performance field.
In 2003, several papers on transfer of large data volumes in large-scale networks were presented on both domestic and international conferences. Some interesting technical details are presented here.
11.2 End station configuration
We have found that at speeds approaching some 100-300 Mbps, most performance problems result from suboptimal configuration of end stations. At higher speeds above some 300 Mbps, modifying the characteristics of communication protocols is usually necessary. In this section, we shall mention the most important end station configuration details which influence the throughput achievable.
11.2.1 Socket buffers
Socket buffers on both sides of a connection (sender and receiver) limit the TCP protocol window of outstanding data (which must fit in the smaller of these two buffers) and are therefore a critical factor which influences the achievable throughput. The window size limits the volume of data that can be transferred during one RTT (round trip time) interval. Default size of socket buffers in most operating systems ranges from 16 kB to 64 kB. Such a small window combined with RTT at the order of tens or hundreds of milliseconds, which is common in long-distance communication, limit the throughput to several or several tens of Mbps regardless of the bandwidth available in the network.
Socket buffer sizes can be adjusted either for all new connections opened in an operating system, or individually for each socket opened within an application. Some operating systems, such as Linux, provide a sort of autoconfiguration and window moderation which adjusts the buffer size according to current requirements and available memory. For example, in the Linux operating system, one can use the following command to set default sender and receiver socket buffer sizes to 2 MB for all new connections:
systctl -w net/ipv4/tcp_rmem=4096 2097152 16777216 systctl -w net/ipv4/tcp_wmem=4096 2097152 16777216
An example of socket buffer size influencing the throughput achievable between the CESNET2 (CZ) and UNINETT (NO) networks is shown in Figure. However, there are more details involved. Linux further modifies the requested buffer sizes according to values of some kernel variables, resulting in a TCP window limit different from the specified socket buffer size. Linux also includes several other TCP implementation specifics influencing performance. We described some of them in [UbC03] and [UbC03a]. A more detailed technical report describing more Linux internals, whose understanding is useful for the end-host performance tuning, is being prepared.
Unfortunately, we can also get into trouble by setting the socket buffers too large and allowing the window to grow too much. Big windows can fill up router queues, which together with traffic fluctuations increases probability of a queue overflow and packet loss. As a result, congestion control will react by reducing the data sending rate. We can try to predict this phenomenon by observing the relation between current throughput and window size, or by monitoring the RTT fluctuations. For example, we can see in Figure that the RTT of a monitored connection reached up to several multiples of the basic RTT measured on an unloaded network which was about 40 ms. At some time, a packet was lost and throughput was reduced. Consequently, the RTT stabilised again. However, it turns out that in highly-multiplexed backbone circuits with complex traffic dynamics, determining conclusively that the RTT growth and fluctuations have been really caused by filling up the router queues is very difficult.
An important Linux networking component that requires proper configuration is a network adapter transmission queue (txqueue). Each network adapter has its own txqueue. Packets from all connections transmitting through a network adapter come to its txqueue before they are moved to the adapter and sent to the network. We discussed the txqueue behaviour in more detail in [UbC03a]. As a rule of thumb, the ifconfig command setting the txqueue to 1000 packets for a Gigabit Ethernet adapter can be used.
11.2.2 Application tuning
Throughput can also be limited by the application. For example, we noticed low throughput while copying files over a network using the well-known scp utility. Socket buffers, txqueue and other networking components in the operating system were configured properly. Processor load on both end stations was low.
We found that the problem was caused by the way the ssh protocol (used by the scp utility) handles its data: it is split into 32 kB blocks which are acknowledged at the application level; the default maximum number of outstanding blocks is four. Thus, the ssh protocol creates its own application window with a default maximum size of 128 kB above the TCP window. The size of this window can be set in the source code of the ssh distribution. The influence of increased ssh window on the throughput is shown in Figure. Of course, increasing the data rate to be ciphered and deciphered increases the processor load as well.
11.3 PERT
PERT (Performance Enhancement and Response Team) is an emerging international initiative which attempts to create technical and organisational framework to help users resolve their networking application performance problems. To some extent, PERT should enhance performance just like CERT improves security.
CESNET takes an active role in PERT preparation, presently within the TF-NGN Geant activity. In the second half of 2004, PERT should become a part of the proposed GN2 project. Our experience with PERT preparation has become a part of the D8.1 deliverable "Multi-domain monitoring and PERT" of the GN2 project.
We identified two groups of people interested in the PERT activities and willing to become pilot PERT users. The first group are the GRID researchers (particularly people from the Masaryk University); the other group are people taking care of the streaming video data transfer over the Internet.
We started to build the PERT web pages which should include three parts:
- PERT mission and operation
- frequently asked questions (FAQ), optionally updated by users
- database of known performance problems with user interface for case submission.
We proposed a structure of performance problem description and we created a trial database using MySQL and PHP4 scripts. After getting more experience and considering the requirements, we concluded that a new database version based on the Request Tracker (RT) will be needed. We identified the following requirements and motivations leadings to our decision to use the RT:
- RT including the LDAP authentication is already being used in CESNET.
- The database must include problem solution tracking (included in RT).
- The database must be accepted by the network operation people (who are accustomed to using the RT, we regard this as an important factor).
- User interface must be tailored for PERT purposes (in contrast to the generic RT user interface which includes several elements unnecessary or useless for PERT) and must be properly localised (in contrast to current mixture of English and Czech).
- The system must be distributed. The proposed structure includes one database for the backbone network (GN2) and one database for each NREN with optional case escalation. Detailed solution will require further in-depth analysis. One problem is already known: RT currently does not support any distributed structure, but we anticipate that this can be solved after some research and development.
- People developing the CESNET RT should be involved in this project. As the GRID research team has some more requests on adding functionality, we plan to include RT development in the CESNET activities for 2004 as well as in the GN2 project.
We propose that during escalation, each case will be investigated first to determine the likely problem area and subsequently forwarded to the person responsible for resolving problems in that area. The following potential problem areas have been identified:
- Unix (TCP window tuning, etc.)
- Windows (driver problems, etc.)
- PC hardware (interrupts, component selection, etc.)
- local or remote LAN (switches, DHCP, DNS, etc.)
- local or remote metropolitan network
- local or remote NREN
- GN2 or another global network
Perhaps the most difficult task will be finding and training the right "front-line" people accepting cases and identifying problem areas, as well as people responsible for individual problem areas.
Another task critically important for the PERT success is availability of a good performance monitoring system. Requirements on such system are currently being specified based on experience from many individual performance measurements. The system should also be developed in the GN2 project framework.
11.4 Data link bandwidth estimation
Available bandwidth along a certain network path, i. e. the part of the installed bandwidth not currently used by existing traffic, is a very important dynamic network characteristics. It suggests what throughput can be expected for additional applications, whether any network segment is overloaded or failing, or whether network upgrade may be necessary.
Available bandwidth measurement tools, such as iperf, try to completely fill all remaining bandwidth by sending data as fast as possible and measuring the achieved throughput. Obviously, this method affects the existing traffic significantly and may be used only for a short time.
In contrast, the free capacity estimation tools send only several carefully scheduled packets and try to estimate the bandwidth available by analysing the sending and receiving times of testing packets.
11.4.1 Classification of bandwidth estimation tools
As the prospect of estimating the available bandwidth without stressing the existing traffic appears attractive, we have decided to investigate if these tools can also be used in large high-speed networks. Previous studies were mostly limited to lower speeds or simple network topologies. The bandwidth estimation tools can be classified according to the following criteria:
- whether it can determine the bandwidth of the bottleneck or of all links along a path
- whether the installed or free bandwidth is reported
- whether it is based on observing the RTT changes of single testing packets or on observing delay dispersion of a set of testing packets
- whether installation on the sender, receiver or on both sides is required.
We classified several known tools representing different approaches in Table.
| Every link | Installed vs. | |||
| Tool | vs. bottleneck | free bw | Method | Location |
| Clink | bottleneck | installed bw | RTT | sender |
| Sprobe | bottleneck | installed bw | dispersion | sender + receiver |
| Pchar | every link | installed bw | RTT | sender |
| Pathchar | every link | installed bw | RTT | sender |
| Pathrate | bottleneck | installed bw | dispersion | sender + receiver |
| Pathload | bottleneck | free bw | dispersion* | sender + receiver |
| ABwE | bottleneck | free bw. | dispersion | sender + receiver |
| * relative one-way delay | ||||
Table 11.1: Classification of bandwidth estimation tools
The pathload tool reports IP-level available bandwidth, whereas the ABwE tool reports free bandwidth normalized for the TCP protocol.
11.4.2 Observation summary
We summarized our observations on behaviour of bandwidth estimation tools in the CESNET technical report 25/2003. We shall mention several interesting findings here.
The pathload tool, as distributed, can estimate bandwidth up to some 120 Mbps. After tuning some of its internal constants, we managed to make it work at some 800 Mbps. However, tests on our testbed with traffic bandwidth generated by a packet stream showed that pathload could provide only very coarse estimates in this range. When accuracy of 100 Mbps was requested, all results fitted in; however, when accuracy of 10 Mbps was requested, most results were out of range.
Within another experiment, a set of several bandwidth measurement and estimation tools was deployed for a period of one month on two paths over the Géant network, consisting of more than 10 routers and OC-48 or Gigabit Ethernet links. Every hour, one set of traffic measurements and estimations by each tool took place:
- TCP iperf with various socket buffer sizes
- parallel TCP iperf with five data streams
- UDP iperf
- Pathload
- ABwE
- TCP iperf with socket buffer size adjusted according to the ABwE results
A sample of measured results in one four-day period is shown in Figure.
One can see that values produced by different tools vary significantly and concluding which value is close to the real available bandwidth is difficult. We can assume that parallel TCP iperf or UDP iperf are more likely to fill the available bandwidth, but they also more stress the existing traffic and so they can report results higher than bandwidth really available. The pathload command is very unreliable and often systematically underestimates the available bandwidth. A more detailed discussion of our observations can be found in an internal project report [UKr03].
11.5 Computer network simulations for congestion control research
Computer network simulation and emulation allows researchers to conduct experiments on models of computer networks in order to evaluate protocol behaviour and compare alternatives under defined and repeatable conditions, which would not be possible on real networks with unpredictable traffic dynamics. The most widely known network simulator is the ns2. Our experience with using ns2 for congestion control research as well as our additions and enhancements to this simulator have been published in the CESNET technical report 26/2003. In this section, a summary of some of our findings and recommendations for use of ns2 follows.
The ns2 is a freely available discrete-event object-oriented network simulator which provides a framework for building a network model, input data specification, output data analysis and result presentation. Source code is also available which allows users to add new features to the simulator, such as support for new communication protocols, monitoring tools, etc.
In real networks, four components make up the end-to-end packet delay. The ns2 tool simulates all these delay components except for the processing delay:
- Serialisation delay - time needed to put the packet on network link
- Propagation delay - time needed for the energy representing a single bit to propagate along network links, bounded with the speed of light
- Queueing delay - time that the packet waits in network node queues to be served
- Processing delay - time needed to process a packet in network nodes
11.5.1 Installation and simulation scripts
The ns2 tool is implemented in C++ and Tcl and should run on any Posix-like operating system (tested on FreeBSD, Linux, SunOS and Solaris) and on Microsoft Windows. The ns2 uses several other software packages (Tcl/Tk, xgraph, etc.) which can be installed either separately or together with ns2 from the "ns-allinone" package. Some of these packages are mandatory, while others are optional, such as the nam-l for animation of a simulation run.
Once the ns2 is installed, a simulation task is specified by a simulation script written in Tcl. This script describes the network topology (nodes and their interconnection), communications protocols (e.g., TCP) and events (scheduling of data streams to be sent). Lengths of packet queues attached to links and maximum size of TCP window can also be specified. Creating the simulation scripts is a complex task which requires understanding of the ns2 object classes and Tcl programming.
11.5.2 TCP in ns2
There are two flavours of TCP in ns2. The first is a one-way TCP which uses objects of different classes on the sender and receiver sides. For the sender side, several classes are available for TCP: Tahoe, Reno, Newreno, Vegas and Sack or Fack, supporting selective acknowledgements. For the receiver side, three classes are available for TCP receiver: without delayed acknowledgements, with delayed acknowledgements and with selective acknowledgements. Subclasses can be derived from these supplied classes to implement modifications to the standard TCP congestion control. The second flavour is a two-way TCP which uses objects of the same class on both the sender and the receiver sides. One-way TCP is used more frequently than the two-way TCP which implements only the Reno congestion control and is considered under development.
TCP in the ns2 differs from real TCP implementations in several aspects that need to be considered during simulations, such as absence of flow control or sender blocking calls. It also does not include any throughput indication needed for almost any simulation. Our observations of TCP in ns2 have been published in the project report [UbK03].
11.5.3 Example of simulation using ns2
One of the network topologies frequently used in simulations is shown in Figure. Hosts connected to router R1 send data to hosts connected to router R2. The sum of data rates produced by source hosts is usually bigger than throughput of the link between router R1 and router R2, making it a bottleneck link. This link has also a specified non-zero packet loss rate and one-way delay while the links between hosts and routers usually are lossless and have fixed one-way delay and throughput.
The following steps must be taken:
- Create an object for the ns2 simulator.
- Create objects for network nodes, links and queues attached to links and specify their parameters, thus creating the network topology.
- Create objects for the TCP sender and TCP receiver and specify their maximum window sizes.
- Create objects for the sending and receiving applications and attach them to the TCP sender and TCP receiver objects, respectively.
- Schedule events, such as start and end times of data streams and when the simulation should stop.
- Start the simulation.
An example simulation script implementing the previous steps (refered to by corresponding numbers in comments) on the given network topology can look as follows:
# 1. Create an object of the ns2 simulator
set ns [new Simulator]
$ns color 0 Red
$ns color 1 Blue
proc finish {} {
exit 0
}
# 2. Create objects for network nodes, links and queues attached to links
# and specify their parameters, thus creating the network topology
set pc1 [$ns node]
set pc2 [$ns node]
set r1 [$ns node]
set r2 [$ns node]
set em [new ErrorModel]
# Set link characteristics
$ns duplex-link $pc1 $r1 90Mb 20ms DropTail
$ns duplex-link $r1 $r2 50M 100ms DropTail
$ns duplex-link $r2 $pc2 90Mb 20ms DropTail
$ns queue-limit $pc1 $r1 6000000
$ns queue-limit $r1 $r2 300000
$ns duplex-link-op $pc1 $r1 orient right
$ns duplex-link-op $r1 $r2 orient right
$ns duplex-link-op $r2 $pc2 orient right
$em unit pkt
$em ranvar [new RandomVariable/Uniform]
$em set rate_ 0.0001
set streams 5
set segsize 1500
for {set i 0} {$i < $streams} {incr i} {
# 3. Create objects for the TCP sender and receiver and specify maximum
# window sizes
set tcpz($i) [new Agent/TCP/Reno]
set tcpc($i) [new Agent/TCPSink]
$ns attach-agent $pc1 $tcpz($i)
$ns attach-agent $pc2 $tcpc($i)
$tcpz($i) set fid_ 0
$tcpc($i) set fid_ 1
$ns connect $tcpz($i) $tcpc($i)
$tcpc($i) listen
$tcpz($i) set window_ 500
$tcpz($i) set segsize_ $segsize
# 4. Create objects for the sending and receiving application and
# attach them to objects for the TCP sender and receiver, respectively
set snd($i) [new Application/FTP]
set rcv($i) [new Application/TCPCNT]
$snd($i) attach-agent $tcpz($i)
$rcv($i) attach-agent $tcpc($i)
}
set null [new Agent/Null]
$em drop-target $null
$ns lossmodel $em $r1 $r2
# 5. Schedule events, such as the start and end times of data streams
# and when the simulation is to stop
for {set i 0} {$i < $streams} {incr i} {
$ns at 0 "$snd($i) start"
}
$ns at 0 "$rcv(0) settimer 0.1"
$ns at 0 "$tcpc(0) settimer 0.1"
for {set i 0} {$i < $streams} {incr i} {
$ns at $TIME "$snd($i) stop"
}
$ns at $TIME "$rcv(0) stop"
$ns at $TIME "finish"
# 6. Start simulation
$ns run
11.5.4 Memory requirements
The volume of memory required by the ns2 for a simulation depends on the number of packets within the simulated network and on the number of packet headers maintained for each packet. In fast long-distance networks, which are often a subject of current research in congestion control, the number of packets within the network can be some tens or hundreds of thousands and the volume of memory required can grow to several gigabytes. The memory requirements can be lowered by first removing all packet headers and then adding only the required headers. For example, the following commands can be added at the beginning of a simulation script:
remove-all-packet-headers add-packet-header TCP IP
11.5.5 Scripts for batch processing
To evaluate the congestion control mechanisms under various network conditions, a set of simulations of a selected network topology must be run where the network characteristics of the bottleneck line are varied. These include the link bandwidth, packet loss rate and one-way delay. We may also wish to experiment with different packet sizes, number of parallel streams, as well as changing the test duration and time granularity for computing the resulting characteristics, such as the achieved throughput.
We added logging of TCP connection characteristics and created a set of scripts to simplify the use of ns2 for simulation of common experimental scenarios with various link characteristics and protocol parameters. The inter-relations of individual scripts are illustrated in Figure:
The sequence of script actions can be described as follows:
- Script sim1.tcl describes the network topology and simulation task
- Script simrun runs a simulation with some parameters:
- Script simrun creates the scripts simtemp.tcl and simtemp.gpl
- Script simrun calls the ns2 for the simtemp.tcl script
- Ns2 creates the output file sim1.out
- Script simrun calls the gnuplot for the simtemp.gpl script and sim1.out file
- Gnuplot creates diagrams in PNG format
- Script simbatch calls repeatedly the simrun script with different parameters
- Script simbatchg can create the simbatch script according to specified criteria
11.5.6 Throughput measurement
To monitor the throughput at the application level, we created a new class Application/TCPCNT; to monitor throughput at the TCP level, we modified the class Agent/TCPSink. A description of these enhacements can be found in [UbK03].
11.5.7 Reaction to a change of available bandwidth
In order to study responses of a congestion control mechanism to increased or decreased available bandwidth, we created a sender-side application class Application/TCPFTP which generates periodic bursts of packets. To start the application, the following commands in the simulation script can be used:
set snd [Application/TCPFTP] $snd set interval_ n $snd set burstsize_ m $snd start
where n is the period in seconds and m is the number of MSS-length packets to be sent in each period. The application must be attached to the TCP sender - see the example simulation scripts. To stop the application, the following command in the simulation script can be used:
$snd stop
11.5.8 Adjusting the AIMD parameters
In the original ns2 TCP, the congestion control parameters within the slow start as well as congestion avoidance phases are fixed. The latter is based on AIMD(1, 0.5). To be able to experiment with recent proposals of Fast TCP, changing the AIMD parameters should be possible. We have modified certain ns2 classes so as to be able to adjust both the slow start and congestion avoidance parameters. A detailed description of these enhancements can be found in [UbK03].
11.5.9 Asynchronous monitoring of TCP characteristics
The Ns2 can synchronously monitor the TCP charakteristics (cwnd, ssthresh,...) after any of them is changed. In some cases, an asynchronous monitoring (recording the values of all characteristics in a given time interval) may bring clearer results. Therefore, we modified the ns2 to allow asynchonous monitoring as well. A detailed description can also be found in [UbK03].
11.5.10 Difficulties we ran into
We encountered several symptoms of unexpected behaviour and ran into some problems when using the ns2:
- TCP stops sending data for a few seconds sometimes
- slow start occurs in TCP Reno after the router queue fills up
- throughput diagrams show fluctuations in fine-timescale
- non-numeric artefacts appear in the simulation log.
These phenomena are presented together with explanations for some of them in [UbK03].
At the present time, the ns2 simulator is used for research in congestion control for long-distance high-speed networks. A paper on this topic is being prepared.
11.6 Developed software
In 2003 we created the following software packages:
11.6.1 Evaluation of bandwidth measurement and estimation tools
A set of scripts used for evaluation of bandwidth measurement and estimation tools. The obtained results were presented in the CESNET technical report 25/2003.
11.6.2 Analysis of time and geographical characteristics of network traffic
A set of tools for analysing time and geographical characteristics of network traffic from netflow records. These tools were used to analyse the CESNET international traffic. A technical report on this topic is being prepared.
11.6.3 Linux kernel monitoring
A patch for configuring and monitoring certain events in Linux kernel that influence performance of TCP bulk transfers. Particularly, it allows setting up the AIMD speed as well as enabling, disabling and monitoring the CWV and CWR mechanisms. This patch is being used for congestion control research; a paper on this topic is being prepared.
11.6.4 NIST Net deterministic patch
A patch that provides deterministic packet loss and queue length for a popular emulation package NIST Net. Our enhancements will be described in a technical report on our experience with network emulation.
11.7 Other project activities
Together with the Optical networks and their development project we conducted an experiment using the Intel 10 Gigabit Ethernet PC adapters in order to evaluate the feasibility of providing a 10-Gigabit Ethernet connectivity up to the end stations. The results were presented in CESNET technical report 10/2003.
Building productive relationships with international partners leading to motivating proposals of further research activities within several planned 6th Framework Programme projects is also regarded as an important project result.
11.8 Planned activities
In 2004 we plan to concentrate on three research areas. The first area is congestion control in long-distance high-speed networks. We managed to gain a lot of experience in this field and we are working on several papers and technical reports on this topic. The second area is performance monitoring. Our intention is to implement the results of the SCAMPI project which is developing a programmable monitoring platform for the high-speed Internet. The third area is the Performance Enhancement and Response Team.
|
|
contents |
next
|