5   Network and traffic monitoring

5.1   Introduction

Starting points of monitoring system design depend on current state of the art and trends in networking, network services and applications. The characteristics of current networks are undergoing most significant changes in the following two fields:

This continuous progress must be systematically projected into development of monitoring methods, tools for primary data retrieval, processing as well as visualization, because the foremost goal of activity in this area is an ability to analyze events and explain them - not only identifying them.

5.2   Network Infrastructure Monitoring

The plan in the network infrastructure monitoring area for 2004 was to start analysing our formerly developed and operated systems and follow up the results with first designs and tests of components which should later become a core of a new system called G3. While designing and verifying the functionality of particular tools, we have concentrated especially on the following parameters:

Large-scale and continuous monitoring
The system should be able to provide large-scale as well as continuous infrastructure measurement in a selected area. This assumption implies use of relatively common technologies for primary data retrieval and searching for specific highly efficient alternatives of their implementation and further data processing.
Recording the dynamics of events and processes
Recording the dynamics with at least some statistical probability while providing large scale monitoring is a necessary condition for qualified analysis of the network behaviour. We need to catch the dynamics with accuracy corresponding to the communication characteristics of services and applications currently used (of course, the hard limits are determined by the measurement method used).
Convergence to human understandable information
This objective concerns two areas. Firstly, we need to bridge over the natural gap between the human perception of network infrastructure and logical structure of active network devices on the one hand, and their technologically defined structure given by the specific (and/or available) methods of their measurement which may differ from vendor to vendor on the other hand. Secondly, we need to provide aggregated information. As mentioned above, the infrastructure becomes more and more virtual and complex and there is no chance to embrace it all at the level of primary information about it (within the scope of measurement). Aggregation as a result of processing multiple primary values (even in the range of multiple instances of devices or their components) may help to get overall view while keeping the detailed information.
Automated adapting to device reconfiguration
We want to ensure that reconfiguration of a device will reflect in the measurement part of the system immediately or after a relatively short delay (given by configuration) without any need for user action. This "discovering" functionality is built-in within our current system as well but it is driven by a configured fixed time step. It should be made more flexible in the new system - driven also by results of actual measurement step.

5.2.1   Development of the G3 System

G3 should become our network infrastructure monitoring system based on standard measurement methods (mainly SNMP in this case) with non-standard measurement timing and specific data processing to satisfy the objectives described above. It may be the successor of GTDMS-II monitoring system which is being used in the Czech Republic NREN but first of all it should help us get new ideas and points of view on network infrastructure and the importance of various types of information about it. Although its primary purpose is measurement of large-scale networks, the system should be relatively small and should allow fast and easy installation for ad-hoc measurements without any specific needs for hardware resources or software packages.

The plan for 2004 was to begin with basic design of fundamental parts of the core components. Each idea was experimentally verified using an ad-hoc built tool when possible. First, the measurement core was built and tested. We need a very efficient and flexible but reliable mechanisms of SNMP data retrieval for the new system. We started with a small, relatively unified and version independent API to standard SNMP mechanisms (snmp-get, snmp-walk and bulk requests when available) to find a set of suitable measurement strategies. We continued with elementary architecture and process management. Currently, the measurement core is functional in its basic form and appears stable. Its data acquisition performance seems to be much faster when compared to the currently used GTDMS-II system.

We had several ideas about object identification on the data processing layer. It should be independent on the native indexing (mainly SNMP and therefore relatively dynamic). We implemented a strategy where identifiers are derived from measured values (optionally processed in a way given by configuration) of selected key items which are significant from the human point of view (interface descriptions, interface IP addresses and similar). This should allow us to follow the "travelling" items with the same meaning (human view) across the network device or even among several devices. Design and implementation allow us changing the set of key items without affecting the measurement core. Regardless of the SNMP interface indexes, we must be able to select all interface instances and corresponding time windows for any exclusive identification given by any combination of descriptive item values to construct both technology dependent courses (e.g., "POS 2/1", "GE 4/0") as well as the purpose dependent (e.g., "GEANT connection", "University XYZ").

To be able to measure short duration peaks effectively, we designed and tested a mechanism where the strategy of measurement time step can be configured in an unlimited way. It may be any combination of constants or parameters implying pseudo-random or random time step generation. The length of sequence is unlimited. This mechanism does not limit recording the dynamics of network behaviour (there are always some limits given by the basic measurement method, of course) and in addition, it allows us to keep the frequency of measurement within acceptable (non-destructive) limits.

Long-time experimental measurement of basic items at a small number of devices confirmed that the measured average values match those measured by other systems. There is a significant difference in envelope curves which indicate the network traffic dynamics. This is thanks to the timing mechanism described above. But reaching the optimal and safe configuration values as well as finding the limits of the time parameters will probably be an object of further research.

Another feature which had to be designed and tested is the controlled strategy of the "best value" selection. Some measurement alternatives to get a specific information exist. After certain kinds of processing (summaries, limits, algorithm), a group of several particular items may produce a single one. Typical examples are the 32-bit and 64-bit counters of the same meaning, but many others exist. Generally, we are interested in "best" values, but on the other hand, keeping the original sources and observing all the source items as well as the result may be useful under some conditions. Therefore we implemented a mechanism which enables both. Making items "virtual" is an effective strategy when giving overall summary views and may be useful in the future as the networks become more complex, virtual and abstract.

We started to work on one of the most important parts of the new user interface - the navigation scheme. Its mechanism must be complementary to object identification at the data processing layer described above. The currently being built tool we are testing allows us to configure and modify the structure of view (template) interactively. The template can contain any combination of key items in any hierarchy. The native (i.e., template content independent) attribute of the real navigation tree is aggregation. It means that every navigation object holds information about all real measured objects which have identical descriptive value (even a result of some future processing, e.g., substitution) within the scope of a particular template item. E.g., it enables to display a single navigation object (and later a single summarized result) for all interfaces having concrete IP address configured in requested time window regardless of its "travelling" across different interfaces of the device or (more often) interface SNMP index changes given by reconfiguration or booting. Navigation aggregation is not limited to single device data, so that one can reach, e.g., a single aggregated view on interface with specific description even if it had moved from one device to another or to a summarized view over the whole device.

Although the system in its current state can provide continuous measurements, we must point out that the set of items which can be measured is minimal and measurement itself is provided first of all as a prerequisite of its further development and long-term stabilization of its partially tested components.

5.3   Traffic Monitoring

Traffic monitoring is concerned with developing tools for efficient processing of specific elementary information about network traffic - the flows. Massive growth of network traffic in current networks leads to distributed systems with efficient classification and filtering and intelligent storage. We would like to offer an overall long-term view as well as particular network interaction analysis of the IPv4 as well as IPv6 traffic.

5.3.1   FTAS System

We focused on implementing and experimental operation of the FTAS (Flow Based Traffic Analysis System) in 2004. Its design notes and internal architecture are described in Technical reports 14/2004 and 15/2004. We implemented the system and started to operate it. Our preliminary experience shows that it is stable and functional. Currently, two separate instances of the FTAS are operating. The first one consists of seven multi-purpose collectors distributed among six servers for the CESNET2 backbone traffic monitoring. The second one consists of two multi-purpose collectors running on a single server. It is used as a test bed for optimizing parameters of specific long-term statistics which are running on the primary instance and as an information base for knowledge of traffic structure in this type of network. We are analyzing the experience with FTAS and making ready for next steps of its development. It seems there are two areas for possible improvement. We would like to shorten the response times (especially summary, non-filtered, aggregated requests for relatively long time intervals), decrease interactive work requirements and allow running all actions by single off-line requests. This added functionality of user interface will be a part of system development in 2005 which corresponds to the long-term research plan.

Within the scope of our plan, other sources were added to the existing FTAS infrastructure in 2004. The most important action was installation of a dedicated collector-host for input processing of flows from the primary border router between the CESNET2 backbone and global Internet. We have also reconfigured the system several times to reach better balanced load among available resources. This means that processing the incoming as well as redistributed flow streams moved from some collector-hosts to others. We regarded as a success that all these configuration changes have been made through the administrative interface only, without crashing any component of the system and without interrupting the data processing.

In autumn, almost all available capacity of this activity was consumed by practical use of FTAS. We observed a significant wave of DoS and DDoS attacks from/to nodes with Microsoft-based operating systems. With help of FTAS we were able to analyze and explain all incoming requests as well as validate or discard complaints of our internal sources. These conditions were also optimal for system stability tests under heavy load and for aggregation parameters optimization.

previous
contents
next
metacentrum elearning liberouter live shows videoserver eduroam