Should network hardware be set to "autonegotiate" speeds or fixed speeds?
Solution 1:
I have yet to see a problem with auto-negotiation of network speeds that isn't caused by either (a) a mismatch of manual on one end of the link and auto on the other or (b) a failing component of the link (cable, port, etc).
This depends on the admin, but my experience has shown me that if you manually specify the link speeds and duplex settings, than you are bound to run into speed mismatches. Why? Because it is nearly impossible to document the various connections between switches and servers and then follow that documentation when making changes. Most failures I have seen are because of 1(a) and you only get in to that situation when you start manually setting speed/duplex settings.
As mention in the Cisco documentation:
If you disable autonegotiation, it hides link drops and other physical layer problems. Only disable autonegotiation to end-devices, such as older Gigabit NICs that do not support Gigabit autonegotiation. Do not disable autonegotiation between switches unless absolutely required, as physical layer problems can go undetected and result in spanning tree loops.
Unless you are prepared to setup a change management system for network changes that requires the verification of speed/duplex (and don't forget flow control) or are willing to deal with occasional mismatches that come from manually specifying these settings on all network devices, then stick with the default configuration of auto/auto.
In the future, consider monitoring the errors on the switch ports with MRTG so you can spot these issues before you have a problem.
Edit: I do see a lot of people referencing negotiation failures on old equipment. Yes this was an issue a long time ago when the standards were being created and not all devices followed them. Are your NICs and switches less than 10 years old? If so, then this won't be an issue.
Solution 2:
Very common, I've had numerous problems over the years with various types of hardware.
In my opinion if the setup is static(i.e. a server rack) and you don't think there will be changes it is a good idea to setup the speeds and duplexs manually. As long as it is well documented so that future problems can be averted.
EDIT:
Just to clarify, I am not advocating using manual speeds on your entire network, I would say that 95% of the time auto/auto is the way to go. I'm just saying I've had problems with duplex/speed and there are small portions of my network (i.e. one of our server racks ) that have mostly manual settings. We operate a very tightly controlled LAN with unused ports being shutdown and MAC-Filters on most of the ports so keeping track of the speeds is not very difficult.
Solution 3:
I believe if autonegotiation was working for an hour a day or a month and then for some reason "something happens" that setting the link to fixed speed "fixes it" there is a problem that's not being solved but circumvented instead. I guess I see setting the link to fixed as a temporary solution until the real problem gets corrected.
Solution 4:
So the troubleshooting steps (assume you stop after each and wait for the issue to reappear):
- Check the logs on the switch to see if it tells you why it's using 100M.
- If you're still running it, turn off that extremely evil "Windows load balancing" bullshit that Joel is pushing all the time -- the way it works is by breaking the switch's cache, forcing it to software process every packet. Your switch is designed to forward packets in hardware, and has only the CPU required to figure out what physical path an unknown traffic flow has to take (in -> asic -> out), and program the hardware to do it (read: a calculator has a better CPU than your switch, don't do stupid things that make your switch's CPU work harder). Windows load balancing works by making your switch make that decision and reinstall the hardware cache for every packet. That may not fix this particular problem, but it bugs me from the podcasts... sorry.
- Make sure the config matches on both sides -- sounds like you've done that
- Google for autoneg bugs on your switch -- unless you built it yourself, you're not the only one trying to run autoneg on whatever it is you're using
- Replace the cable, with rated Cat5e or better -- ideally a cable you know works, like the one your workstation is plugged into. Don't try to use Cat5, or some crap somebody made, use one that has actual molded ends out of a package.
- Move the port -- Put the server on a different port on the same switch
- Change out the NIC -- use a different batch ordered at a different time
At this point, you've eliminated the configuration, the physical ports you're plugged into, the cabling between them. If it's still happening, some other causes may be:
- Cable routing -- be careful of EM interference from your AC power cables, route them down different sides of the rack.
- Cooling -- Make sure you're environmental temp isn't something like 90 degrees and your NIC cards aren't dropping into some kind of "dear god let me just forward this one packet please" mode. I've heard but not seen that Cisco routers stop doing fast-switching and forward packets via CPU when they're overheating, for example.
- Replace the switch with something that doesn't suck -- check how much bandwidth your hosts are talking per second in aggregate, and then look at the rated backplane capacitiy of your switch. 7 hosts out of the potential 48 all transmitting 1.0G is enough to stop a Cisco 3750, for example. Also be very careful about the cheapo also-ran network vendors: D-Link, Linksys, Dell, Intel, and HP. Nobody treating networking seriously uses those guys, and not because "nobody was ever fired for using Cisco", but because "people remember that Intel switch that had 20/48 ports fail over 2 years" or the "I used to use ProCurve exclusively and rail about how evil Cisco was, until I actually used Cisco, at which point I stopped buying anything less". Cisco is considered a mid-range network vendor, so what does that tell you about the guys below Cisco...? :-)
Background/why my answer is the most awesome: I work as a network/systems engineer in the financial industry, and here's my experience with our small-ish global network (15 branch offices, 8 datacenters):
All our LAN ports are autoneg, because we control the equipment on both ends, and have some kind of access to both sides---which may be as simple as getting on the phone to someone and having them check settings. In three years, I've only ever had one of our internal ports fail due to autoneg failing, and that was because of a bad cable---it went away after replacing the cable.
We had way more problems where predecessors had hardcoded 100/full on their NICs, and didn't document that fact. Reset everything to auto/auto at the next maint window and haven't had any issues with them since.
On the couple places where we've got copper handoff from a carrier for our WAN? You should pretty much expect a copper WAN/Internet connection to suck, all the time---in part because you've got no idea what's on the other side. Some ancient Extreme switch that happens to have buggy firmware for autoneg but does MPLS tagging? Some $5 media converter because your ISP's $200k Ciena edge device is simply too awesome to provide Ethernet over twisted pair? Decide in advance how that's going to be handled and stick to it, then expect some twit inside the carrier to change it at 10pm on a Saturday because the agreed-upon config was never documented and they have some policy to follow.
Seriously, though, get a fiber handoff from your ISP.
Solution 5:
The network that I'm responsible for (along with a few other guys) is made up of ~40 servers, 1000+ workstations (spread across a rather large campus) and ~1000 WAPs also spread across a large area with varying types and ages of network equipment.
As dimitri.p said, when something suddenly fails to stop autonegotiating, it's usually an indication of another problem. Setting the port manually is akin to putting a bandaid on someone who got stabbed in the gut - it might stop the bleeding, but there's sure to be damage underneath.
My usual checklist:
- did anything change on the machine? drivers? OS- or BIOS-level settings? Perhaps autoneg was disabled in the OS?
- have you swapped out the patch cables, and verified the cable runs (if it's a logner run than one rack?)
- have you tested to see if the switch port is bad or failing?
- could the NIC be going bad?
We, as a rule, never disable autoneg on servers (or anything else in the data center) unless it's a situation where all other possible causes have been eliminated, we moved switch ports, changed cables, tested the NIC, etc. and there's no other choice. In which case, it gets documented to death. This happens very rarely, and usually with appliances that we can't get access to check BIOS and OS settings.
The workstations and APs, on the other hand, are a different story. Failed autoneg is a classic sign of a bad cable run, and many times we have to manually set speed and duplex until the summer running-new-cables-in-the-walls season comes around.