|
| |
» |
|
|
|
|
 |
|
|
 |
 |
 |
 |
|
Introduction
Performance and intermittent connectivity problems are among the most difficult of network issues to troubleshoot. An exhaustive coverage of these topics would fill a large book. This technical note is a troubleshooting oriented approach to isolating such problems involving Ethernet (10Mbps, 100Mbps, and 1000Mbps) hubs (repeaters), switches, and routers.
This article does not tell you in detail how to isolate the cause of the problem from the entire network down to LAN devices. Rather, it assumes that you have some reason to suspect the LAN devices, or that you hope that the LAN devices can help you isolate the problem. See "Narrowing the Geographical Scope of the Problem" below for some general guidance in this area.
Focus on dropped packets
Network devices rarely forward packets so slowly as to cause severe performance problems. Rather, severe performance problems in LANs usually involve dropped packets resulting in end nodes timing out and re-transmitting those packets. Each retransmission usually results in a delay on the order of a second or more. Tens of hundreds of such delays result in a network slowdown that end users will notice and complain about.
Similarly, connections will be lost if too many keep-alive packets are dropped.
Note that while the network may be perceived as running slowly, the LAN devices are usually running at full speed. That is, the LAN devices are forwarding packets at a rapid rate. The reasons packets are being dropped, resulting in a network slowdown, are either due to excessive collisions or just a delay caused by high rates of collisions which naturally occur on a shared media network.
One key cause of dropped packets is the design of the network topology itself. For example, if twenty 10Mbps clients all try to send data to a 10Mbps server, all connected via a switch, packets can be dropped through no fault of the network device.
Narrowing the geographical scope of the problem
If the performance problem includes one or more WAN links or a firewall, you should first investigate those parts of the network. WANs and firewalls are much more likely to be the source of the problem than your LANs.
When isolating the performance problem within a particular LAN, you should first try to determine whether the problem is limited to a certain portion of the LAN or a certain path through the LAN. Probably you already did this if you suspect a network as the cause. If you have not narrowed the range of the problem, you may be able to do so by timing data operations (for example, file transfers) across different portions of the network.
Another important tool is ping. You should be able to execute several thousand successful (that is, no timeouts) ping commands on a healthy network.
Finding the drops
Once you suspect a particular network device or a small number of network devices, use network management or the device's Web or console interface to get its error counters (statistics). Then, look for drops.
Drops may be very clearly indicated by counters with names such as: Drops Tx Drop Rx Frames Dropped.
Drops may also be indicated more indirectly as a media-specific fault such as the following Ethernet errors: FCS error CRC error Alignment Rx Runt Rx Short Event Giant Rx Too Long Rx Late Collision Tx Excessive Collision Tx Late Events Excessive Deferrals Tx, Babble error Loss of Carrier
When one of these errors occurs, a hub, switch, or routing switch will drop the packet involved. It is the responsibility of the source end node's transport layer (for example, TCP) to re-send the packet.
How many errors are too many?
Data link errors such as CRC errors, alignment errors, and
runts will occur on healthy networks. How do you distinguish
between a reasonable number of these errors and too many? A
rule of thumb is one error in 5,000. For example, on average,
for every 5,000 packets received you should have no more than
one receive error (CRC, alignment, runt, short, giant, or too
long). And on average, for every 5,000 packets transmitted,
you should have no more than one transmit error (late collision,
excessive collision, late event, excessive deferral, or loss
of carrier). At higher rates of errors, users will probably
perceive the network's performance as being poor.
One data link error in 5,000 does not necessarily indicate
a perfectly-performing network. Rather, it indicates a network
where the errors are probably not causing serious performance
problems that are apparent to the users.
Other link-level indications of bad performance
Ethernet also has some conditions that are normal unless they
happen too often. Collisions, jabbers, and fragments are good
examples. It is normal to have collisions, but they should not
occur in large numbers relative to the total number of transmitted
packets. Large numbers of collisions, jabbers, or fragments
will result in network slowdowns. Unfortunately, it is difficult
to define "too often" or "large numbers."
The device's LEDs or event log may indicate link-level problems
such as auto-partition or lost link. Link loss is normal during
device configuration changes. So, a few losses of link are
acceptable. Many losses of link may indicate faulty wiring,
bad NICs, bad transceivers, or an end node which has been
powered off.
Non-Ethernet links will have their own types of errors. The
Fault Finder capability in Hewlett-Packard ProCurve devices
may already be reporting one of these errors through the devices'
Web interface or event log.
Network device buffer problems
Buffer problems are typically the result of a network topology
which is not suited to the traffic patterns on the network.
For example, using a 10Mb backbone to interconnect switches
will frequently cause congestion (and buffer problems) on all
but the smallest networks. To resolve this problem, switch-to-switch
and switch-to-server connections should be faster (e.g., 100Mb)
than the connections to the clients.
A LAN device may indicate a drop through a report of a system-related
problem, such as: Packet Buffer Misses Message Buffer Misses
Buffer error Lack Of Resource error
These typically represent a dropped packet.
One or two occasional drops will not result in a noticeable
performance problem or a failed connection. But, you find
drops occurring more often than once per minute on a particular
link or cable, you may have isolated the location of the problem.
Eliminating the dropped packets
Once you have found the location of the dropped packets, you
have isolated the problem and are halfway to resolving it.
This article does not cover finding the root cause and solution.
Generally speaking, your next step is to fix the cause of
the dropped packets. This can involve the network design,
faulty cables, faulty transceivers faulty NICs (network interface
cards), or configuration problems such as full/half duplex
mismatches. For example, the following configuration will
cause severe network problems:
The hub, switch, or router will correctly sense (not auto-negotiate)
the 10Mbps or 100Mbps speed. Since the end node was configured
for a specific speed and duplex state, and therefore does not
negotiate, the hub, switch, or router will choose the communication
mode specified by the 802.3u standard, namely half-duplex.
With one device running at half-duplex and the device on
the other end of the connection at full-duplex, the connection
will work reasonably well at low levels of traffic. At high
levels of traffic the full-duplex device (end node, in this
case) will experience an abnormally high level of CRC or alignment
errors. The end users usually describe this situation as,
"Performance seems to be approximately 1 Mbps!". Often, end
nodes will drop connections to their servers.
For errors reported by the ProCurve Fault Finder, you should
look at the online help that will suggest likely (though not
exhaustive) root causes. Here are some examples.
|
| |
|
 |
| Counter Possible | Root Cause |
| Bad CRC or Alignment | Half/full duplex mismatch or faulty driver, NIC or transceiver or faulty cable |
| Giant | Problem driver or NIC |
| Collision | Usually, too much traffic for Ethernet to handle. In rare cases, can be caused by bad cables, NICs, or transceivers |
| Giant or Runt | Faulty NIC, NIC driver, or transceiver |
| Auto-partition | Loop in network or jabber, faulty NIC, NIC driver, transceiver, or cable |
| Frame Dropped, Drop Tx, Drop Rx, Buffer Overflow | High traffic or network design problems |
| Jabber | Bad cable, NIC, or transceiver |
|
|
Other information sources
Be sure to refer to the Troubleshooting section of your product manuals as a valuable source of information. Also, please look at the FAQs and white papers on ProCurve Networking by HP Web site.
|
| |
|
|