Conducting a Network Performance Analysis 101

When it comes to networks, in general, you want as much bandwidth you can possibly get, which usually means as much as you can afford. Bandwidth is one of those things that you really can’t have too much of; buy the best hardware, routers, switches, network adapters, and even cables that budget allows for. Networks are a great illustration of the classic “Weakest Link” theory: The slowest link in the chain determines the speed of the entire network!

To that end, before moving into any new building, flipping production traffic over to a newly commissioned circuit, and even before performing a large datacenter capacity upgrade — any changes that will significantly affect load on a network — it’s a good idea to ensure the network can handle what will be asked of it. Data obtained from a Network Performance Analysis (NPA) is important to establish both a network’s baseline characteristics as well as to find its limits, and to ensure the network delivers the bandwidth you are paying for!

Important Network Measurements

In general, the following are often considered important measures of a network’s performance:

Bandwidth commonly measured in bits/second is the theoretical maximum, raw rate that information can be moved through a communications channel
Throughput is the actual maximum observed rate that information is transferred across a network or processing system
Latency the delay between the sender and the receiver decoding it, this is mainly a function of the signals travel time, and processing time at any nodes the information traverses
Jitter is variation in packet delay at the receiving end of the information
Error rate a running count of the number of lost or corrupted bits expressed as a percentage or fraction of the total sent

Bandwidth

Bandwidth is the overall “size” of the pipe, usually measured in Mbps (MegaBits / second, Mb = 1/8MB), and often having a separate measurement both upstream (upload) and downstream for download when referencing connections to other networks (internet connections). Common connections range anywhere from 20-100MB/s down, and 5-100MB/s up, with faster being available for businesses. Providers sometimes allow “bursting”, which constitutes a temporary “burst” through the regular bandwidth cap/limit for a short period, for example, to allow a video to initially buffer or to handle other brief periods of high traffic.

The maximum bandwidth of an internal network is measured using the same unit [Mb/s], and is limited first and foremost by hardware and cable ratings. Typical LAN speeds are 100Mb/s or 1000Mb/s (gigabit), with the latter becoming more of a standard even in the home, while business users are slowly making 10 gig [10,000Mb/s] wired connections the norm. Many 10gig implementations and most anything faster requires fiber-optics, which in theory are only limited in bandwidth by the speed of light and your budget for hardware, though fiber itself continues to fall in price.

Throughput

Throughput is a measure of the actual, observed transfer rate, as opposed to the maximum theoretical transfer rate as limited by the network bandwidth. Throughput is instead limited by the slowest device in the entirety of the transfer chain. If the file being transferred must cross an 11 Mbit wireless link, for example, then that would limit the transfers maximum throughput to the actual speed of that wireless link.

For wired networks, the throughput is often limited by the speed of the hardware (often physical hard drives read rates) reading the data to send, as opposed to saturating the connection itself. With modern links, only in cases when many transfers take place at one time does throughput saturation happen, in practice.

Latency

Latency is often measured in Milliseconds, and is a measure of the time it takes for a packet to reach its intended destination. Latency is an important measurement to express the expected responsiveness of a network connection, and certain services like loading a web page are less sensitive to latency than others such as real time gaming or VoIP calling, which results in a noticeable delay in both game play and conversation, respectively.

Most of a given latency measurement can be attributed to the approximate physical distance between connection points plus delays introduced by equipment processing the connections, and the fastest possible connection between two points (and therefore minimum possible latency) is determined by the speed of light, which even at 299,792.458Km/s [or 186,000 Miles/sec] does begin to have noticeable effect over long distances. For example, it takes about 67ms to transmit a packet at the speed of light between NYC and LA, and 2580 Ms (2.58 seconds) between Earth and the Moon. This chart shows common network latencies between well-known geographical locations.

Jitter

Jitter is most often discussed when measuring the performance of VoIP systems and/or streaming video systems, and it refers to variations in the latency of a connection. Like latency, jitter is often measured in Milliseconds.

RFC 4689 defines jitter as “the absolute value of the difference between the forwarding delay of two consecutive received packets belonging to the same stream”. These delayed and/or out of order packets are usually the result of transient loads and/or different connection paths being utilized by the underlying transmission protocol itself. Jitter will manifest itself in “choppy” and less than real-time voice and/or video communication, buffer under runs, and generally poor performance.

Error rate

Error rate refers to packets that don’t make it to their intended destination intact, or at all. There are many reasons why packet loss occurs, and it is much more common across the open internet than within the context of a LAN, as one has no control over the hardware paths packets take once they leave networks under your control, and therefore no control over the reliability of delivery.

Error rate is usually measured over time, and most protocols can tolerate a small amount of error without noticeable function degradation. However, more than a small percentage of packets being regularly lost indicates a problem that needs to be addressed.

How To Test A Network

Now that we know what to look for when assessing a network’s performance, and the various signs and effects of each category, how do we measure them? You’re probably excited to get started, and you may be wondering, “How do I test my network?”

Most of the measurements and methods are rather straightforward, and the best information will be gathered not by performing a single test, but by testing multiple times and averaging the results. Once you have established baselines, a monitoring system that can alert you when metrics fall outside the established baselines is a great idea. There are many systems available – if you’d like to learn more about monitoring systems, leave a comment below and if there’s interest, we’ll do a follow up post!

Begin by measuring bandwidth

Measuring bandwidth is rather straightforward. There are many services available that can assist with this, but to do it yourself you’ll need a server at each location endpoint you are trying to measure between. The server must be able to serve traffic fast enough that it can saturate the link. This can be verified by first testing the servers in a local scenario, with all other variables removed. Once this is verified, testing is as simple as initiating a transfer between the two and measuring the throughput, or the time required to transfer a given amount of data. Once results are averaged, compare this with the claimed bandwidth you are paying for or your hardware states, and you have determined your networks available bandwidth, or throughput.

Measure latency

A useful tool in measuring a connection’s latency, is the “ping” tool. Ping sends a packet to the destination, and waits for a response, measuring the round-trip time and calculating connection latency in MS. The tool will even average the results of several pings over time for you. It is recommended to test latency multiple times (not just a single series of pings), and at different times of day and both on an idle network and a busy network to get an idea of how network load can and will affect latencies, as we noted prior, equipment processing load can slow down responsiveness, increasing latencies. If the measured latency is much greater than the theoretical latency (see linked chart), this could indicate a problem that should be investigated. Small delays can be expected, more so if you are traversing the open internet. Large delays could mean defective or under-performing hardware, and should be investigated.

Measure Jitter across the network

Jitter is usually introduced by network devices themselves, and can be slightly more complicated to measure than simple throughput or latency. Packets being buffered, queued, and switched around a network all introduce small delays that tend to add up, and are measured as the jitter the end connections experience. There are three common ways to measure jitter. The first involves comparing send and receive timestamps on pairs of packets transmitted as such:

Transmit and receive time of the first packet in the pair
Transmit and receive time of the second packet in the pair

The methods to measure are inter-arrival histograms, capturing and post-processing (e.g. reviewing packet captures), or the utilization of tools that measure jitter in real-time. For in depth jitter measurement information, reference RFC4689, or see this article, which is a great write-up on jitter.

Test for transmission errors

Error rate measurement is something that can and should be observed during the rest of the testing steps, and is monitored by regularly observing switch, router, and individual interface error counts. Direct testing of error rate involves resetting or noting the current value of the error rate counters at the send and receive origins, as well as on any network equipment in between those endpoints that you control.

Network Emulators

A network emulator (simulator) can be helpful when testing the intricacies of the large, complex networks that exist today. Wikipedia has a great list of simulators that can serve this purpose well. Many allow the recording and re-play of production network traffic to observe behavior of a new circuit under “production”-type loads. The network should be closely observed while replaying or generating traffic with the simulator, during transfers, or during regular utilization of the network, while error counts are monitored for any increase. Any packet loss within a network you control end to end (non-internet) should be investigated as anything from an overloaded or faulty switch, faulty switch port, or even a faulty cable can cause errors, which are not normal under any circumstances.

When measuring error rate across the public internet, as you do not control all the hardware along the path, some errors can be expected and must be tolerated, and indeed this is accounted for in most equipment and protocols designed to traverse the internet with everything from buffers to error detection and retransmission algorithms. If the measured error rate is consistent, and is affecting your ability to perform a task, options can include changing carriers, or in cases where the traffic is considered important enough, construction of private circuits between sites; both options can be costly, with costs exponentially higher the farther away the sites are. If this is not an option, you might try working with your carriers support team to improve the situation.

Post Testing

Once testing has been completed, and any discovered problems addressed, you should have baseline measurements for bandwidth & throughput, latency, jitter, and packet loss and/or errors common to the different segments of your network. Monitoring the network for anomalies and large deviations from these established, baseline measurements will ensure that your network is and continues to operate at peak performance. The specifics of configuring monitoring are beyond the scope of this article, but suffice to say once you’ve tuned your alerting thresholds to minimize noise, yet not silence real issues, you can rest soundly knowing that should any sign of a problem arise, your monitoring system should quickly let you know.

Links & Helpful resources:
https://technet.microsoft.com/en-us/library/cc938654.aspx
https://technet.microsoft.com/en-us/library/cc938652.aspx
http://staffwww.dcs.shef.ac.uk/people/J.Winkler/NetworkPerformanceAnalysis/lectureNotesStudent.pdf

Conducting a Network Performance Analysis 101