QoS in Large-Scale DMVPN Networks
Got this question a few days ago:
I have a large DMVPN network (~ 1000 sites) using variety of DSL, cable modem, and wireless connections. In all of these cases the bandwidth is extremely dissimilar and even varies with time. How can I handle this in a scalable way?
Hub-to-spoke QoS implementations in DMVPN networks usually use one of the following options:
Per-spoke class with ACL-based classification. If you want to implement per-spoke QoS on the hub router, you need a huge policy-map with a class for each spoke. However, as all DMVPN traffic exits the hub site through a single tunnel interface, ACL-based classification is the only mechanism you can use (QPPB won’t work – IOS supports only 100 QoS groups). Clearly not scalable and a nightmare to manage.
Per-tunnel QoS. The name of this feature (described in my New DMVPN features in IOS release 15.x webinar) is highly misleading. It allows you to implement per-spoke QoS (maybe the developers had IPsec tunnels in mind) with the service policy being selected by the spoke router. You have to configure an NHRP group on the spoke router, the name of the group is sent to the hub router in the registration request and the hub router applies policy-map associated with the NHRP group to all traffic sent to that spoke.
The answer to the reader’s question:
- Define numerous policy maps on the hub router (each one with a different set of traffic shaping parameters matching the spoke bandwidth) and associate each one of them with a different NHRP group;
- Measure the actual hub-to-spoke bandwidth;
- Select the closest QoS policy and configure the corresponding NHRP group on the tunnel interface of the spoke router;
- The NHRP group name will be sent to the hub router in the next registration request and the hub router will start using the new QoS policy for that spoke.
Disclaimer: I have not used this myself and don't personally know anybody who does. My opinion is that it does look good on paper, though.
The "emulated teleengines" concept is interesting, but works only if you have well-behaving traffic. A single UDP flood and the remote site is dead.
- App QoS (distributed/coordinated) (~Packeteer)
- WAN optimisation (compression/protocol optimisation) (~Riverbed)
- Automatic WAN link selection (haven't seen anybody do this before)
The "emulated teleengines" should work with any traffic, as long as that traffic *originates* at a site with an actual appliance.
Cheers,
Krzysztof
#2 - Yeah, ASR Allmighty is a great box, isn't it (well, it might be eventually :-P )
#3 - Remote Ingress shaping: thanks for reminding me. need to fix the article.
Automatic WAN link selection: OER on Cisco routers (caveat: I have no real-life experience).
"... as long as that traffic ORIGINATES at a site with an actual appliance" <-- BINGO!
This feature, as I understood, was designed to take care of coordinated controlling of traffic flow originated at large/fast sites (HO/DC/whatever), without the need to have appliances at small crappy offices.
> OER on Cisco routers
Cool, thank you - didn't know about it!
> BINGO!
Typical deployment would see appliances at all sites with significant uplink bandwidths. Small remote sites would often have something like ADSL with relatively small uplink pipes, so it may well not matter at all.
We're implementing a dual-hub dual-DMVPN with per-tunnel QoS. The aggregate bandwidth of all spoke sites is greater than that available for either hub and is close to the aggregate BW of the hubs. We're hoping EIGRP's per-flow balancing will help even things out in that regard. The WAN provider gives us a BGP handoff to their publicly addressed MPLS cloud with two classes of service).
The per tunnel policy is being setup as advised: child policy is the same for all sites; there's one parent policy per remote bandwidth (only two). The child policy has three classes: EF priority bw 70%; CS3 and AF31 bw 5%; ACL of IPs precedence 3 (maps mgmt station traffic to highest priority vendor queue) and class-default fair queue. Prior to this implementation, the child policy was directly applied as a service policy to the WAN serial interface.
The question: before per tunnel, all EF traffic got up to 70% of the outbound bandwidth. However, once per-tunnel is applied, the only QoS you can apply to the serial interface is a class-default shaper. Ok, so within each tunnel the EF traffic is prioritized, but what happens when total outbound load exceeds serial interface bandwidth? Yes, the tunnels will prioritize the EF, but that if there's a ton of overall traffic how does the serial interface know to prioritize the EF traffic on delivery to the WAN? If there's one or two really loud tunnels that exceed outbound bw, will EF traffic from other tunnels be dropped since it's per tunnel and not per interface?
Or to rephrase: is there one set of queues for the Tunnel interface which is then shaped for each dynamic tunnel or are sets of queues created for each dynamic tunnel? Former situation would always make sure EF is prioritized at the serial iface but the latter could result in dropped priority packets.
Or am I over thinking it again?
A quick lab test would provide a definitive answer, don't you think so?
What's got me is this from Cisco's per-tunnel docs:
QoS on a physical interface is limited only to the class default shaper on the physical interface. No other QoS configurations on the physical interface are supported when two separate QoS policies are applied to the physical and tunnel interfaces.
Addition of a QoS policy with a class default shaper on a physical interface is not supported when multiple QoS policies are utilized.
Maybe if I source the tunnel from a loopback instead of the WAN interface. I'll post my lab results when I get something definitive.
"service-policy with queueing features on sessions is not allowed in conjunction with interface based"
And yes, that is the complete error message, not a truncated paste on my part.
So, how do I ensure EIGRP packets sourced from the tunnel interface get prioritized on the serial interface?
Alot of ink is spent making sure that your hub router can handle all the IPSEC traffic but very little is spent on the impact of GTS. In my lab testing, both in GNS3 and live lab, if your hub egress is congested and you have only two tunnels configured for traffic shaping then the hub router will go to 100% CPU and stop sending out IGP packets across the tunnel. In my GNS3 case this happened with a 7200 router and in the lab it happened with a 2800. More than 2 GTS instances uses up all CPU.
I've hit on the same limitation as described by Unknown 27 July, 2012 17:13, albeit I am trying to group a series of spoke sites and apply a shape to the group. Got that same message, dang.
I posted a cisco support forum entry here if anyone has any other solution/workaround to this issue:
https://supportforums.cisco.com/message/3801831