One of my readers encountered an interesting problem when upgrading a data center fabric to 100 Gbps leaf-to-spine links:
- They installed new fiber cables and SFPs;
- Everything looked great… until someone started complaining about application performance problems.
- Nothing else has changed, so the culprit must have been the network upgrade.
- A closer look at monitoring data revealed CRC errors on every leaf switch. Obviously something was badly wrong with the whole batch of SFPs.
Fortunately my reader took a closer look at the data before they requested a wholesale replacement… and spotted an interesting pattern:
- All leaf switches apart from one had input CRC errors on one of the uplinks but not the other. What if there was a problem with the spine switch?
- The spine switch on the other side of the “faulty” uplinks had input CRC errors on the link toward the one leaf switch with no CRC errors, and output CRC errors on all other links.
Knowing Ethernet fundamentals, one should ask a naive question at this point: “isn’t CRC checked when receiving packets and how could you get one on the transmit side?” Fortunately my reader also understood the behavior of cut-through switching and quickly identified what was really going on:
- The one link with no errors on the leaf switch had a bad cable, resulting in CRC errors on the spine switch, but no errors in the other direction;
- The spine switch was configured for cut-through switching, so it propagated the corrupted packets, and stomped the CRC on egress ports to prevent the downstream devices from accepting the packet… increasing the transmit CRC error counter at the same time.
- Downstream leaf switches received corrupted packets originally sent over the faulty link and increased their receive CRC error counters.
They replaced the faulty cable, the errors disappeared, and life was good. As for the original performance problem… maybe it wasn’t the network ;)