Network Troubleshooting Guidelines

Tuesday, October 2, 2018 08:00 +0200

Network Troubleshooting Guidelines

It all started with an interesting weird MLAG bugs discussion during our last Building Next-Generation Data Center online course. The discussion almost devolved into “when in doubt reload” yammering when Mark Horsfield stepped in saying “while that may be true, make sure to check and collect these things before reloading”.

I loved what he wrote so much that I asked him to turn it into a blog post… and he made it even better by expanding it into generic network troubleshooting guidelines. Enjoy!

This is a guest blog post by Mark Horsfield, CCIE#52702, Technical Support Engineer at Cumulus Networks.

Tech Debt is real

Operations teams are tasked to support production environments that comprise of a wide array of technologies, old and new. Technical debt is real — and it applies to all of us.

A lot of moving parts means opportunity for things to stop working (Murphy’s Law: "Anything that can go wrong will go wrong").

So what can we do?

Approach a problem with an open mind, distill the “signal from the noise”, and test your hypothesis. A well-defined problem description is essential to moving forward towards final resolution.

When a sports team wins the championship (you pick your favorite activity), it is not by accident. Behind every team is a coach, who leads by experience — with a game plan. Plans are built upon communication between team members.

Troubleshooting is a lot like storytelling, as it can be rewarding when you unravel the tangled mess. Start by outlining observations in a list; there will be gaps between the starting point (current state) and the ending point (final resolution).

Define the Problem

A problem definition begins with a comparative analysis, using what, when, where, to what extent.

What is not working

What is the observed behavior?
How is this different than expected behavior?

When does it happen

When was the first occurrence? Are there multiple occurrences?
Any pattern? Any clear trigger event?
Skim log files for the same time period.
Does it happen continuously or intermittently?

How is the problem cleared:

Is some action taken by a user?
Does it clear after some time?
Does some impacting event reset the environment (crash or reboot event)?

Where in the life-cycle

Where in the object life-cycle does the problem-state show itself?
Day-1 issues are usually mis-configurations, design-related, or bugs (hardware or software related)
Day-2 issues are usually seen after some “change” has taken place. This might be a link flap, or a memory leak over a long time.

The system might have been deployed without ever measuring the performance (especially in failure scenarios).

Should it work as described as-is?
Is there evidence to support the advertised performance numbers?

Marketing is well-known to omit important subtle details. You might find a (hardware or software) limitation exists (that might, or might not have a workaround).

Extent

How many objects show the problem-state? …out of how many total objects?

Are there some objects that _could_ show the problem but do not show the problem right now? Compare objects that are working versus the problem-state.
Certain features will influence the forwarding pipeline that a packet would follow through a network device.

How many occurrences of the problem were seen on each object?

A link that is flapping would usually show similar number of up/down transitions at each side.
An interface configured with sub-optimal MTU might cause fragmentation in a single direction, especially if there are two exit nodes on the network (traffic could follow an asymmetric forwarding path in/out of the network).

Narrow the Scope

Determine the appropriate method to isolate the problem to a direction, a single object, and then a singular component (hardware or software related).

Split the Difference

Imagine you are troubleshooting some connectivity problem with VM hosts that reside in a VXLAN segment

VXLAN is a network virtualization overlay technology comprised of the underlay network (between ingress VTEP and egress VTEP devices) and the overlay network (VM hosts at the outer edge of the network)
In this type of situation, I start by verifying the underlay reachability. If this fails, then it would be a waste of time to investigate the overlay network.
If the underlay network is working, then move on to the overlay network.
Verify connectivity between each VTEP to the locally connected host.
What direction is packet loss happening? Forward path or the return path? Look at interface counters, ACL counters, aggregate traffic statistics… tcpdump and ERSPAN can help isolate direction of packet loss

Is the problem specific to data plane traffic? Or control plane, or management plane traffic?

To help determine if hardware is mis-programmed, you can insert special flags (record route option) in an ICMP packet to force the router to punt the packet to CPU at each hop. Not all network vendors act (punt to CPU) upon it though
Is the problem specific to a type of traffic? IPv4 or IPv6? Unicast or BUM traffic? TCP, UDP or ICMP traffic?

Bottom Up

Physical layer issues fall into this category. Imagine a link is fully inserted, but fails to pass traffic

Check the port is not administratively disabled (yes, we overlook the easy answers).
What is the hardware state? The switch ASIC must first recognize and program the link.
If the ASIC has correctly programmed the type of the link (and recognized the transceiver), then what is the software state?
Assuming optical fiber is in play, what are the light levels from each termination point? The transmit signal on one side corresponds to the receive signal at the remote end. If the signal is too weak, is there a patch panel, or any intermediate transponder equipment? Check the signal at each point where the cable is terminated.

Top Down

Performance issues fall into this category.

Use packet captures to help tell the story how the system is actually working.

Often we make assumptions, which are sometimes false:

The environment changed since the last release.
A new variable was introduced into the environment.
We’re operating on information that was not carefully validated.

Use a traffic generator if necessary. iperf, nuttcp, and mz are just some of the open-source tools. Be careful, some of them are better suited for particular traffic characteristic. Get involved in the community and help make the tools better.

OODA Loop

I found the OODA loop to be insightful —Russ White wrote a series of blog posts about it. Go and read them all.

Focus on what you can see

From a comment to a blog post Matt Schmitz wrote on Tips for Working with Vendor Support:

Focus on data points that rely on the presence of something, rather than the absence of something, because there are usually multiple reasons why something might be missing. If something is missing, then usually you can take a look at the same problem from a different viewpoint to find something present that is out-of-place / not right — triage here.

Verify the Hypothesis

Do not skip the verification process — you are here because there is a complex problem in front of you. Problems have a tendency to return if you do not reveal the underlying cause.

A hypothesis is similar to storytelling, where you expect to find a headline and supporting evidence.

At each step of the process document your steps taken and the outcome.
Keep asking yourself, “does this finding / deviation explain the problem-state?” If not, then rule it out as a possible cause. Move to the next item on the list.
Assumptions can be dangerous if not verified. Multicast traffic may be treated differently than unicast traffic at certain points in the forwarding path.
Start by testing the most probable cause. If this would require a considerable amount of resources (time or money), such as sending a field engineer to an unmanned-site that is far away then try to eliminate the low-hanging fruit (something that can be tested quickly to further bolster your hypothesis or rule it out).
Keep an open mind when you approach a problem — be willing to broaden your search.

Environment

It’s best to troubleshoot the actual problem-state in a live environment if at all possible. However:

Sometimes it is not possible to leave the system in the problem-state for a long time, and it must be recovered to a normal, working state.
Hopefully you gleaned enough data points from the problem-state to attempt a lab re-create.
Often you do not need a scaled setup, and it can be reduced to a small number of devices (physical or virtualized lab environment).
Traffic generators can introduce new problems. If you are testing with a uni-directional traffic flow, this is different circumstances than most production traffic flows (bi-directional)

The Power of a Team

By working closely with others to solve a problem, all of us benefit in many ways, such as

spot gaps or flaws in your story;
learn a new way to approach a problem (that saves you time);
improve the fix to be more efficient.

Successful people are good communicators. Surround yourself with people that emulate the characteristics you wish to learn from.

data center

4 comments:

Bogdan Golab 02 October 2018 08:47

Some challenges from my experience sparked some thoughts:

- what is the network baseline - which "anomalies" are "normal" in the network - not to be misled by not-related-to-the-main problem (other) issues

- coincidence of a few issues a the same time - this makes the problem visible once per week/month/year (notice: issue vs problem)

- limitation of black box troubleshooting (places where the vendor's debug commands fail to report e.g packet drop OR when you enable debug command and the whole timing is affected and the problem is not seen)

- problem of reproducibility at vendor's lab

- etc, etc

I wonder what is the troubleshooting approach used by people doing troubleshooting of the complex systems for more than 15-25 years.

Do they always use this model for non-trivial cases?

Donal 03 October 2018 23:29

This comment has been removed by the author.

Donal at PanSift 04 October 2018 11:19

Reminds me of my generic Troubleshooting 101 https://bsdosx.blogspot.com/2009/11/troubleshooting-101.html and 10 laws of Networking https://bsdosx.blogspot.com/2009/02/10-laws-of-networking-partial.html which are essentially about seeking truth and partitioning fault domains recursively to hone in on variables.... observability and telemetry + logs a prerequisite of course!

Unknown 10 October 2018 19:16

All great points that need to be supported by a solid monitoring, alerting and incident management framework.

Add comment