I’ve had an interesting few months doing WAN circuit turn-ups for a new Data Centre. I dealt with three major carriers, and each experience was worse than the next. I’m not sure why I held such high expectations but I was surprised by their hopeless inefficiency in delivering what should have been a standard product. In this post I’ll examine the problems I saw and their root causes.
In all three situations, 1Gbps Layer-2 ethernet circuit was ordered with a copper ethernet handoff from a rack-installed NID/NTU/whatever-you-call-it-yourself. Lets look at the five issues I hit whilst troubleshooting.
Link up at both ends – No CDP received
There was a lot of blaming the end-customer on this one. “Are you sure that CDP is enabled?”. There was a huge amount of frustration here. The carrier would send an email to confirm that ‘they had tested’, provide no actionable details of their troubleshooting, then close the ticket. This went on for days bouncing between the annoyingly named ‘deliver’ and ‘assure’ teams. The ‘deliver’ team felt they had delivered the circuit and the ‘assure’ team assured us that the circuit wasn’t live and they couldn’t help us.
The ‘deliver’ team felt they had delivered the circuit and the ‘assure’ team assured us that the circuit wasn’t live and they couldn’t help us.
We kicked up an almighty stink and got an excellent engineer on a conference call who was able to solve the problem within hours. The problem was that one end of their circuit had a switch between their MPLS PE and their NID, the other didn’t. The provisioned circuit assumed no switch in the path, so the carrier switch added a transport Vlan tag at ingress that the egress customer router didn’t understand. Similarly the carrier switch saw my Vlan tags as an unknown transport tag and rejected my frames.
Diagnosis: Circuit not provisioned to expect dot1q on one end of circuit.
Link stays down after flap
In this case the carrier handed off a circuit that was passing CDP frames. When I shut/un-shut the port on my switch the corresponding NID port stayed down until the carrier shut and un-shut the NID port. I’m suspicious of a faulty hardware diagnosis generally, but in this case a NID swap fixed it. I later found out that the carrier manually stages the configuration on the NID before shipping to site, so it may well have been a provisioning issue… we’ll never know.
Diagnosis: Faulty NID
1Gbps Link negotiating to 100Mbps
I also saw issues negotiating the correct speed/duplex settings. In this case I couldn’t get beyond 100/full when auto negotiating on a 1G port. After a lot of ‘try all the options’ troubleshooting, it turned out that two of four pairs in the ethernet cable were dead, allowed a max autoneg rate of 100Mbps. I can’t blame the carrier for everything; I should have caught this one sooner.
Diagnosis: Damaged Cat5e cable prevented 1Gbps, but working twisted pairs allowed auto-negotiation to 100Mbps full.
CDP frames passing, but I couldn’t ping
This one stumped me for a while. I was trunking two vlans across the circuit, and trying to ping from SVI to SVI on 3750X’s. No dice. Once I converted my port layer-3 port I could ping without issues.
After checking my SVI and VLAN configuration was okay, I had a hunch and re-added the VLAN/SVI configuration again but configured one VLAN as a native vlan. Praise the god of ping…. it worked for the native VLAN.
Diagnosis: The carrier hadn’t configured the link to accept dot1q tagged frames from the customer.
Two days later I received reports of network backups failing but I couldn’t find an obvious network issue. But as soon as I heard a report from another use who could only fetch small files, I immediately knew what the issue was. A quick ping with an MTU of 1500 showed full loss, but a ping of 1496 bytes worked just fine.
Diagnosis : MTU issues caused by carrier not raising MTU to cater for VLAN tag.
These problems can’t be new nor unique to me. I’d love to see more folks publish their failures so that we could avoid the situation where we all solve the same set of problems in parallel. While I wait for nirvana, I hope these examples of failure help speed you through the circuit acceptance process and on to more meaningful work.