Distributed firewalls: how badly do you want to fail?

Are you ever asked to use a layer-2 Data Center Interconnect to implement distributed active-active firewalls, supposedly solving all the L3 issues and asymmetrical-traffic-flow-over-stateful-firewalls problems? Don’t be surprised; for example, a year ago I was stupid enough (or maybe just blinded by the L2 glitter) to draw the following diagram illustrating a sample use of VPLS services:

The solution looks ideal: both WAN routers would advertise the same IP prefix to the outside world, attract the customer traffic and pass the traffic through the nearest firewall. The inside routers would take care of proper traffic distribution (inbound traffic might need to traverse the DCI link) and the return traffic would follow the shortest path toward the WAN (or Internet) cloud. The active-active firewalls would exchange flow information, solving the asymmetrical flow problems.

Now ask yourself: what happens if the DCI link fails? Some of the inbound traffic will arrive to the wrong edge router and get dropped, and the firewalls will go into split-brain mode. You’ll clearly experience problems in both data centers.

Usually we use pairs of devices (routers and firewalls in this particular example) in redundant configurations to increase the overall system availability. I am no expert in high availability calculations, but one of the hidden assumptions in designs where devices have to exchange state information (active-active firewalls, for example) is that the non-redundant component (the link between the devices) has to be as reliable as the devices themselves.

In a stretched subnet design the weakest link of the whole system is the data center interconnect; in most cases, stretched subnets would decrease the overall availability of the system.

A reliable layer-3 solution is not much easier to design. A while ago I was involved in a redesign of a global network. The customer had very knowledgeable networking team (it was a total pleasure working with them) and we tried really hard to find a redundant data center design that would allow them to advertise a single L3 prefix from both data centers. We even reached the point where we had a working design that would survive all sorts of failures, but it got too complex for the customer (and they were absolutely right to reject it and fall back to simpler options).

Unless you believe in the miracles of TCP-based anycasting, it seems the best option you have to implement distributed data centers is still the time-proven design used by web content providers with good track record like Google: DNS-based load balancing between data centers in combination with data-center-specific+summary-as-a-backup prefix advertising into BGP.

You have to move the firewalls deeper into the data center if you want to avoid session loss following an Internet link failure.

More information

The Data Center Interconnects webinar (register here) describes numerous design and implementation aspects of L2 and L3 data center interconnects. It covers load balancing and external routing for L3 interconnects and numerous L2 DCI interconnect technologies, including VPLS, A-VPLS, OTV, TRILL, BGP MPLS based MAC VPN and EtherIP between load balancers (F5).

33 comments:

  1. Hey Ivan,

    Just few words of appreciation for you... is there no saturday or no sunday for you?? Dont you take holidays??? When do you actually rest.... Wish I had the same dedication :(

    ReplyDelete
  2. NotSoAnonymous19 April, 2011 00:18

    Love this part --> "....but it got too complex for the customer (and they were absolutely right to reject it and fall back to simpler options)"

    ReplyDelete
  3. Regarding your drawing, I have a question. If you would design this scenario in a redundant way (active - standby, with asymetrical traffic), and for some reason the FW lose the connection, both think they are the active ones, right? I think we'll have the same problem as described here, but from the opposite direction.
    Please correct me if I'm wrong.

    So, beside your Google style solution, there is no other alternative?

    Thanks

    ReplyDelete
  4. Ivan,
    To reduce the risk of the datacenter interconnect, what I have done in multiple occasions within the past year was recommend a strategy where the client had a primary fast datacenter interconnect (usually dark fiber) between his 2 sites and as a backup to that use an L2TPv3 pseudowire over what is usually the client's existing L3 WAN. To avoid bridging each site's STP domain, I would use either REP (Cisco) or RRPP (HP) as a ring protocol responsible for the datacenter "ring", each link being terminated on either ME3400s with Cisco or an IRF pair of A-series HP switches.

    With REP, you can only break the ring with STP in one location, and you have to use a stackwise pair in the other end with flex links to both of the local "ring devices". One of the Cisco Live sessions had some nice designs that suggested that, and it works very well.

    With an IRF pair, those problems are taken care of as the device is a single entity, and thus only runs the ring protocol over the datacenter interconnect (the private fiber and the pseudowire).

    The end result is that the risks of a dual active, split-brain scenario are very much reduced, and the ring protocol ensures fast convergence and STP domain isolation.

    ReplyDelete
  5. And here is the schema that should have accompanied that explanation...

    ReplyDelete
  6. Ivan Pepelnjak19 April, 2011 13:12

    In your scenario, both firewalls will become active.

    Alternatives depend on how reliable you want your design to be and how reliable you think your DCI link is (see also the comment from mnantel)

    ReplyDelete
  7. Can u tell me Ciscolive session where these designs were discussed?

    ReplyDelete
  8. BRKSPG-2205, titled Deploying and Designing with the Resilient Ethernet Protocol. PDF page 69 describes such solutions.

    ReplyDelete
  9. Hi Ivan,
    > Unless you believe in the miracles of TCP-based anycasting
    What problems do you expect there? Rare session failures and reshakings... anything else?

    ReplyDelete
  10. Ivan Pepelnjak20 April, 2011 10:29

    It probably depends on how far apart your data centers are and whether they connect to the same ISP or different ISPs.

    If the difference in paths needed to reach them (as seen from outside) is big enough, anycasting should work like a dream ... but then you'd probably need a different DCI design anyway as your DCI link won't have infinite bandwidth.

    ReplyDelete
  11. We go perfectly with 70 msecs between DCs and IPSec for interconnect over Internet. No L2, unique subnets with reliable routing and simple troubleshooting.
    ISPs must be different anyway for redundancy. Loadbalancing is a metter of playing with communities.

    ReplyDelete
  12. Ronald Bartels20 April, 2011 13:17

    I would suggest my solution but don't want to risk the holy cow battle all over again...

    ReplyDelete
  13. Ivan Pepelnjak20 April, 2011 13:38

    Sounds reasonable. Anycasting or two different prefixes?

    ReplyDelete
  14. Ivan Pepelnjak20 April, 2011 13:39

    By all means, please do. I promise there will be no bovine certification discussions :-E

    ReplyDelete
  15. Single prefix. It's more then resonable comparing pricing, stability and simplicity against solutions vendors push out of their clouds.

    ReplyDelete
  16. Ivan Pepelnjak20 April, 2011 14:21

    I would love to hear more. If you're willing to share anything (off-the record and totally anonymously, of course), please get in touch with me (contact link on the top of the page for example)

    ReplyDelete
  17. Ronald Bartels20 April, 2011 21:27

    Well firstly anyone who uses the word pseudowire deserves to become a bovine sacrifice.

    Secondly, I think you did a Hitchhiker's. The answer is 42 but what is the question?

    ReplyDelete
  18. Hi Evan,

    To be be honest I don't see a problem that can't be solved in the architecture in your first diagram. If a split brain situation occurs then propagate the external IP prefix to only one WAN router. (based on the quorum reachability)

    What am I overlooking here?

    Best regards,

    ReplyDelete
  19. My appologies, made a typo, s/Evan/Ivan

    ReplyDelete
  20. Ivan Pepelnjak21 April, 2011 07:23

    What would you use instead of "pseudowire"? That's how the technology used to implement P2P Ethernet links across MPLS or IP clouds happens to be called. Not my choice.

    ReplyDelete
  21. Ivan Pepelnjak21 April, 2011 07:26

    Assuming you could reliably detect that the DCI link is down (which was one of the trickier part of my L3 design I mentioned in the post), the DCI link failure would effectively cut off one DC from the Internet.

    Also, assuming your right DC is the "chosen victim", how does it know the DCI link failed (and it should cut itself from the Internet) and not the left DC (in which case it should not do so)?

    ReplyDelete
  22. Hi Ivan,
    Thanks for conversation yesterday. Still I think anycast is much better solution for lots of designs today. Despite you say it could not work in some situations, I can find too few of them.
    Even if a weird ISP gets equal path via different peers (very uncommon situation) it would be unbelivably weird of him to configure ebgp load sharing and further more balancing per packet not flow. True miracle are things working without human intervention, it's architect's/engineer's job to find these 1% issue cases and fix them.
    People do not rely on anycast because they were tought in their childhood "TCP-anycast doesn't work, forget it!" Well, it does in fact and saves huge amunts of money and nerves.
    Solution is not for vendors or integrators getting money for selling boxes and complicated designs. It's for internal staff willing to sleep well and get money for doing nothing.

    Anycast is not for clusters. Who cares? They are back-ends. You can choose DCI with AToM and pay for it or create your own with L2TPv3+IPSec+Internet to syncronize them. Traffic will go through the same ISP equipment with the same delays/jitter/ploss. These L2 links must be kept as few as possible or totally avoided by configuring primary/secondary options on front-ends.

    HTTP can work with anycast and it's the major application companies need today, even global monsters. I bet there are so few tcp applications lacking support of this technology that it should be brought back to humankind.

    If there's anyone with a large company network to design I'd love to participate and implement anycast. Really. 'Cos it works! =)

    ReplyDelete
  23. There are, of course, more things to consider than how to deal with DCI failure, but lets focus on this part for now.

    What I would do is create an IP SLA on both WAN routers that monitor one (or more) device(s) that can only be reached via the DCI. (for example the core router in the other data center)

    If the DCI fails then both WAN routers notice this. The question is then what should happen. It depends what services are available in both datacenters. Depending on that answer you have multiple options. You can designate one datacenter to take over all services by routing the ip-prefix to one location or you can split the ip-prefix in two (or more) and give each datacenter the part it can handle based on backend availability and expected load.

    Am I missing something?

    Best regards,

    ReplyDelete
  24. Keep in mind, some networks use systems that change BGP routes throughout the day as part of traffic engineering. While it's not likely that the same route would be picked to flop around across many providers, it's certainly reasonable to imagine that my route to your netblock might change 3-4 times a day. Those resetting TCP sessions may not be as rare as you think.

    Also, I guess it depends on what you consider acceptable. To me, having user TCP connections fail (and seeing an error page) during normal operation even only once a day is unacceptable. We should be striving for 0 errors during normal operation. Errors should only happen during failures.

    ReplyDelete
  25. Ivan Pepelnjak27 April, 2011 09:39

    As long as your BGP core is in front of your firewalls, you won't see TCP session failures at all. They only happen if the changes in BGP routing cause the sessions to go through different set of firewalls.

    ReplyDelete
  26. Derick Winkworth03 May, 2011 07:35

    Or don't do syncrhonized firewalls at all. I know a lot of people swear by this, but my experience has been that firewall clustering can cause more problems than resolve. Usually when it has been useful it was because of a planned outage of one or the other firewalls.

    Some firewall vendors make active/active synchronization as easy as clicking a button and waiting a few minutes for the flow data to syncrhonize. If we know there will be an outage, we sync then we perform the maintenance, then we un-sync.

    ReplyDelete
  27. Stretched clustering firewalls make sense IF:

    DCI fiber is confirmed to take different path.

    The chosen firewall is able to support LAG and is active with the chosen dynamic routing protocol.

    VRF introduce in DC.

    With that in mind, there will be "lots" of saving for the same type of service level.

    ReplyDelete
  28. "The active-active firewalls would exchange flow information, solving the asymmetrical flow problems" - Does the state needs to be in sync before the return traffic when asymetric hots the other firewall on its way out. What happens if the return packet arrives first and state sync happens a msec latter?

    ReplyDelete
    Replies
    1. Never deep-dived into those details, but based on the discussion I had with one of our security gurus, it seems the "wrong" box just forwards the packet to the "flow owner" ... and the synchronized state is needed primarily in failure scenarios.

      Delete
  29. So when is a DCI a "DCI"?

    I am considering a design for a customer in which their DCI is 4 LAGed pairs of customer-owned dark fiber between two buildings which are 100ft from each other. (I know...not much for DR, but it's what they have to work with). We could tag separate VLANs across the DCI to interconnect the various pairs of devices (eg. one for HSRP/IBGP on the internet edge routers, another for state-sync on the firewalls).

    Would you still frown on this architecture if the data centers were on opposite ends of the same building?

    Opposite ends of the same data center?

    Opposite ends of the same rack?

    You can see the crux of my question is becoming "What precise factors make the DCI such a weak to be avoided as a cluster link.

    Is it:
    a - High Latency?
    b - Probability of being mechanically disturbed by the ditch diggers?
    c - Probability of being logically disturbed by malfunctioning service provider equipment?
    d - Using the same link to connect multiple "clusters" (ie. tagging a VLAN for each cluster link instead of using a separate cable for each cluster link as would normally be the case if they were in the same data center) I DO hate the idea that losing this DCI would require every pair of devices to simultaneously recalculate mastership.

    In my case criteria A and C are not a concern due to the short distance, and the private nature of the customer-owned fiber. However B and D are still factors to be weighed.

    Thoughts?

    ReplyDelete
    Replies
    1. The answer is very simple: a single L2 network (VLAN, for example) is a single failure domain and as reliable as its weakest link.

      You want to have a single failure domain? Go ahead, create a VLAN. You want to have resilience? Do a proper L3 design.

      Delete
  30. Hi Ivan,

    Regarding the line that reads, "A reliable layer-3 solution is not much easier to design." --- Is the use of the word, "not" a typo perhaps? I ask this just because I'm confused, since the first sentence seems incongruent with the remainder of that paragraph. The rest of the paragraph appears to convey that your client decided to fall back to an L3 DCI, in light of the complexity of the stretched L2 design you devised.

    Please forgive my question, if this isn't a typo. I ask this only for the purpose of clarifying my understanding of your post.

    Thanks,
    Scott

    ReplyDelete
    Replies
    1. It's not a typo, a L3 solution advertising the same servers over multiple links is hard (particularly when firewalls and load balancers enter the picture), see this blog post for more details:

      http://blog.ipspace.net/2013/01/redundant-data-center-internet.html

      Hope this helps,
      Ivan

      Delete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.

Ivan Pepelnjak, CCIE#1354, is the chief technology advisor for NIL Data Communications. He has been designing and implementing large-scale data communications networks as well as teaching and writing books about advanced technologies since 1990. See his full profile, contact him or follow @ioshints on Twitter.