OSPF Neighbors Stuck in EXSTART

This problem is rare but tantalizing enough to warrant mentioning: OSPF neighbors are forever stuck in the EXSTART state (occasionally going DOWN and back to EXSTART).

I’ve stumbled across it accidentally in my lab and have luckily seen it before, so I knew immediately what it was.

The moment you start suspecting that something might be wrong with the OSPF adjacencies and use debug ip ospf adj command, the problem becomes obvious: the Database Description packet contains an Interface MTU field and if the value received from the neighbor is higher than the IP MTU configured on the inbound interface, the DBD packet is rejected (section 10.6 of the RFC 2328). The router with the lower MTU complains that “Nbr x.x.x.x has larger interface MTU”; the other router moans about protocol violations (First DBD and we are not SLAVE).

As always, there are two ways to solve this problem:

  • The correct one: fix the MTU issues;
  • The other one: disable MTU checks with the ip ospf mtu-ignore interface configuration command (which might be OK if the hardware can receive oversized packets and the router is not using fixed-size input buffers).

16 comments:

  1. I have got an interesting one.

    A few years ago I got called to troubleshoot an OSPF Exstart problem, Both routers were connected together over an international frame relay PVC. Both side had MTU 1500 bytes set on their interfaces initially but OSPF got stuck in Exstart. I knew about the OSPF MTU Mismatch issue back then but this one didn't seem to be it because the MTU size match on both ends. However, I was told it was an international Frame Relay PVC so I asked how the PVC was built. It actually went through three providers, and the provider in the middle had the PVC mtu set at 1100 bytes for some reasons and that was the culprit. The fix, as it turned out, was to lower the interface IP MTU on the customer routers (IP MTU = 1024)because the ospf mtu-ignore bit didn't solve it (this was because the middle Frame Relay provider dropped the over-sized frames at layer 2). It was a very unique problem so I would like to pass along. Nowadays frame relay is going away so we may never encounter a problem like this one.
    Replies
    1. Thanks for sharing this.
  2. The place I've seen this several times is when running OSPF between a SVI on a 3550 switch and a router, which have different default MTUs.
  3. What is the best way (not "ip ospf mtu-ignore") to resolve MTU mismatch between 3550 SVI and router's physical or BVI interface?

    Without affecting other switch ports?

    I know about "system mtu routing ..." on 3550, but it is system-wide.

    Consider that router has BVI interface (which also produces different mtu) and switch has a SVI int.

    Router:

    bridge 1 protocol ieee
    bridge 1 route ip
    bridge irb

    interface GigabitEthernet0/0
    description trunk to 3750
    no ip address
    !
    interface GigabitEthernet0/0.1
    encapsulation dot1Q 100
    bridge-group 1

    interface BVI1
    ip address 10.1.1.2 255.255.255.0

    router ospf 1
    network 10.1.1.0 0.0.0.255 area 0

    BVI1 is up, line protocol is up
    MTU is 1514 bytes
  4. According to this discussion, you can only set system-wide MTU on 3550, not per interface.

    Once I get my hands on a Catalyst switch (and have time to spare), I'll run a few tests.
  5. Thank you.
    So should I set "system mtu routing 1514" on the 3750 to match the bvi's mtu and forget about it?

    Any negative consequences?

    What about other routers on the same L2 segment with regular routed intefaces? they currently have "ip ospf mtu-ignore" :)

    The bvi interface would not take mtu settings.

    Thanks,
    Vladimir
  6. You should set the system MTU to 1500, not 1514 (unless I'm gravely mistaken, the MTU specifies the payload size, not the layer-2 frame size).

    There SHOULD be no negative impact, unless the workstations in your LAN use jumbo frames (and let's assume that the switches are not MPLS PE routers :).

    As for the BVI interface; I can set the MTU and IP MTU on a BVI interface on a router (using 12.4(15)T1), but as I said in a previous comment, you cannot set per-interface MTU on a Cat3550 at all.
  7. Google got me here with the magic words mtu + ospf while looking for some info regarding this topic for a post in my new blog. I basically wrote the same (in spanish), but added something that I found pretty interesting; lowering back the mtu or removing the ip ospf mtu-ignore and see what would happen. Just the latter would bring us back to the issue. MTU would just be an issue again whenever the adjacency is rebuilt...just my two cents.
  8. Yeh, got a strange issue.

    If the MTU is set to 1500 or lower then full adjacency is achieved, anything higher and it stays in 2 way - Anyone got any ideas on that.

    Set up is - Juniper -> Foundry -> SmartEdge

    Set ups on Juniper and Smartedge as follows:-

    Juniper
    metric 65535;
    retransmit-interval 5;
    transit-delay 1;
    hello-interval 10;
    dead-interval 40;

    SmartEdge:

    transmit-delay 1
    router-priority 0
    hello-interval 10
    router-dead-interval 40
    cost 65534

    The only difference I can see is the metric cost, but then why would it work with 1500 but not anything larger?
  9. I would suspect the box in the middle is dropping jumbo frames. See also

    http://blog.ioshints.info/2009/11/ip-ospf-mtu-ignore-is-dangerous-command.html
  10. Funny enough I'm experiencing this issue right now on a Gigabit Ethernet link between two 7609s. Looks like the MTU on the transport network is wrong and the carrier is looking at it now.

    New technology, same old problems. :)
  11. hi Robin, im experiencing it right now. i have two routers between two 7609 and sometimes the ospf is going down. how did you resolve the issue?
  12. I am having an issue with OSPF, we have HP, Cisco, and H3C in our Area 0. Router priorities are set, remote sites are priority 0 and the main sites are 250 and lower (to specify DR). However, intermittently we are still getting some strange adjacency losses.
    This started with an existing network that I am trying to fix. Originally no priorities were set anywhere and all Area 0 routers were set to priority 1 (default). I fixed that and the problem became MORE common - it had been happening once or twice every 3 months.
    I discovered then that the NTP server config on all the network equipment was inconsistent. So I fixed that, pointed all devices to the appropriate NTP servers (One of which was the loopback on our core router which had and IP that already existed on the BDR as the router ID). Finally yesterday for the first time in 10 days there were no OSPF messages of adj change in the logs.
    All devices have identical MTU,Hello, Dead, and Carrier delay timers.

    My questions are:
    What affect did NTP have on OSPF? Could all the issues have been resolved by finding that duplicate IP in Area 0? Has anyone else seen issues with this type of mixed environment (HP, Cisco, H3C)?
    Replies
    1. Duplicate IPs (particularly if they're used for Router ID) could be the root cause of your problems.
    2. I agree, that is why I am going through the configs of all the devices on the network very carefully. I didn't build or design this network, but I can sure make it work better and redesign what I can to even improve the original design
Add comment
Sidebar