OSPF Neighbors Stuck in EXSTART
This problem is rare but tantalizing enough to warrant mentioning: OSPF neighbors are forever stuck in the EXSTART state (occasionally going DOWN and back to EXSTART).
The moment you start suspecting that something might be wrong with the OSPF adjacencies and use debug ip ospf adj command, the problem becomes obvious: the Database Description packet contains an Interface MTU field and if the value received from the neighbor is higher than the IP MTU configured on the inbound interface, the DBD packet is rejected (section 10.6 of the RFC 2328). The router with the lower MTU complains that “Nbr x.x.x.x has larger interface MTU”; the other router moans about protocol violations (First DBD and we are not SLAVE).
As always, there are two ways to solve this problem:
- The correct one: fix the MTU issues;
- The other one: disable MTU checks with the ip ospf mtu-ignore interface configuration command (which might be OK if the hardware can receive oversized packets and the router is not using fixed-size input buffers).
A few years ago I got called to troubleshoot an OSPF Exstart problem, Both routers were connected together over an international frame relay PVC. Both side had MTU 1500 bytes set on their interfaces initially but OSPF got stuck in Exstart. I knew about the OSPF MTU Mismatch issue back then but this one didn't seem to be it because the MTU size match on both ends. However, I was told it was an international Frame Relay PVC so I asked how the PVC was built. It actually went through three providers, and the provider in the middle had the PVC mtu set at 1100 bytes for some reasons and that was the culprit. The fix, as it turned out, was to lower the interface IP MTU on the customer routers (IP MTU = 1024)because the ospf mtu-ignore bit didn't solve it (this was because the middle Frame Relay provider dropped the over-sized frames at layer 2). It was a very unique problem so I would like to pass along. Nowadays frame relay is going away so we may never encounter a problem like this one.
Without affecting other switch ports?
I know about "system mtu routing ..." on 3550, but it is system-wide.
Consider that router has BVI interface (which also produces different mtu) and switch has a SVI int.
Router:
bridge 1 protocol ieee
bridge 1 route ip
bridge irb
interface GigabitEthernet0/0
description trunk to 3750
no ip address
!
interface GigabitEthernet0/0.1
encapsulation dot1Q 100
bridge-group 1
interface BVI1
ip address 10.1.1.2 255.255.255.0
router ospf 1
network 10.1.1.0 0.0.0.255 area 0
BVI1 is up, line protocol is up
MTU is 1514 bytes
Once I get my hands on a Catalyst switch (and have time to spare), I'll run a few tests.
So should I set "system mtu routing 1514" on the 3750 to match the bvi's mtu and forget about it?
Any negative consequences?
What about other routers on the same L2 segment with regular routed intefaces? they currently have "ip ospf mtu-ignore" :)
The bvi interface would not take mtu settings.
Thanks,
Vladimir
There SHOULD be no negative impact, unless the workstations in your LAN use jumbo frames (and let's assume that the switches are not MPLS PE routers :).
As for the BVI interface; I can set the MTU and IP MTU on a BVI interface on a router (using 12.4(15)T1), but as I said in a previous comment, you cannot set per-interface MTU on a Cat3550 at all.
If the MTU is set to 1500 or lower then full adjacency is achieved, anything higher and it stays in 2 way - Anyone got any ideas on that.
Set up is - Juniper -> Foundry -> SmartEdge
Set ups on Juniper and Smartedge as follows:-
Juniper
metric 65535;
retransmit-interval 5;
transit-delay 1;
hello-interval 10;
dead-interval 40;
SmartEdge:
transmit-delay 1
router-priority 0
hello-interval 10
router-dead-interval 40
cost 65534
The only difference I can see is the metric cost, but then why would it work with 1500 but not anything larger?
http://blog.ioshints.info/2009/11/ip-ospf-mtu-ignore-is-dangerous-command.html
New technology, same old problems. :)
This started with an existing network that I am trying to fix. Originally no priorities were set anywhere and all Area 0 routers were set to priority 1 (default). I fixed that and the problem became MORE common - it had been happening once or twice every 3 months.
I discovered then that the NTP server config on all the network equipment was inconsistent. So I fixed that, pointed all devices to the appropriate NTP servers (One of which was the loopback on our core router which had and IP that already existed on the BDR as the router ID). Finally yesterday for the first time in 10 days there were no OSPF messages of adj change in the logs.
All devices have identical MTU,Hello, Dead, and Carrier delay timers.
My questions are:
What affect did NTP have on OSPF? Could all the issues have been resolved by finding that duplicate IP in Area 0? Has anyone else seen issues with this type of mixed environment (HP, Cisco, H3C)?