Wednesday, February 27, 2013

Cisco CSS: Load balancing from the inside too

Disclaimer: I'm not a Cisco expert.  Years ago, then-webmaster Karl Matthias convinced me that I was almost smart enough to barely understand this gear.  Turns out he was right.

The skinny

We use a Cisco CSS to load-balance client requests to multiple servers.  For years we couldn't load-balance requests from our inside network, only from the outside.


The setup

If you read the Cisco docs, the predominant use-case for a load balancer appears to be a single CSS with  a  single server group serving all the content. However, like at most shops, we have multiple server groups to serve different content.  For example, we have three servers to serve www.eclipse.org, two for Bugzilla, three for Git, three for wiki.eclipse.org, and so on.




Here, the Load Balancer acts as the gateway -- all inside servers use rfc 1918 private IPs and use 172.16.0.1 as their default gateway.  One /24 subnet is used: 172.16.0.X. The CSS has multiple "virtual" IPs: the real, Internet-routable IP addresses that represent the services.

For Internet clients, this setup works beautifully.  When you consider that a CSS is nothing more than a heavy Spoofing device, you can easily follow the flow of traffic from a client, say 108.10.50.81:

  • Client 108.10.50.81 sends SYN packet to www.eclipse.org, which is 198.41.30.199
  • The CSS immediately responds with a SYN-ACK, which is ACK'ed by the client, thus completing the three-way handshake
  • Meanwhile, the CSS spoofs the connection to one of the real servers in the group.  It crafts a new SYN packet --  Source: 108.10.50.81  Dest: 172.16.0.7 (it happened to pick that one)
  • The real server responds with a SYN-ACK: Source: 172.16.0.7 Dest: 108.10.50.81.  Since the Destination is remote, the packet is sent to the Default Gateway, which is the CSS (172.16.0.1)
  • The CSS simply discards the SYN-ACK since it has already established a socket with the real client. It ACKs the real server and completes the three-way handshake on the backend.
  • Everyone is happy, and traffic is free to flow from the client to the real server.


The problem

Problems arise when a server on the "inside" becomes a client to a load-balanced service, also on the inside. For some reason, it just doesn't work.  Years ago, the Cisco experts (not Karl) just told me it was how Cisco devices worked -- the load balancer is not meant to be accessed from the inside network.  The Cisco forums provided no particular guidance, other than to essentially "NAT" the inside servers as clients.  While that solution works, I didn't find it particularly pretty.

We originally resolved the issue with hosts file entries at first, then internal DNS.  Since all our servers share a backend network connection, server-to-server connections would flow over it.  It worked, but it was error prone and confusing.  If one load-balanced node died or was taken offline, we'd need to remember to update DNS.


Why it doesn't work

The years passed and I didn't spend much time thinking about it, but as our services grew in number, size and traffic volume, the problems became more frequent and annoying.

Understanding the root cause of the problem was key to developing a solution, which happened haphazardly while explaining it to a bunch of Linux students.  A light bulb lit up and I saw the light.

The following day I spent a bit of time with tcpdump and webalizer, I noticed that the internal "client" trying to reach an internal server from the CSS was eventually receiving two SYN-ACK packets.  The client, understandably confused, would RST the connection leading to failure.  Bingo.

Following the flow of traffic from the inside "client", the problem becomes apparent.  Say Bugzilla server 172.16.0.15 wants to talk to server www.eclipse.org, using the virtual IP:


  • Internal Bugzilla server (the client) 172.16.0.15 sends SYN packet to www.eclipse.org, which, through the magic of DNS, resolves to 198.41.30.199. Remember, that IP is the CSS.
  • The CSS immediately responds with a SYN-ACK, which is ACK'ed by the bugzilla server (the client), thus completing the three-way handshake.  So far, so good.
  • Meanwhile, the CSS spoofs the connection to one of the real servers in the group.  It crafts a new SYN packet --  Source: 172.16.0.15 (the bugzilla "client")  Dest: 172.16.0.6 (one of the www nodes)
  • The real server responds with a SYN-ACK: Source: 172.16.0.6 Dest: 172.16.0.15.  Are you seeing this?  Unlike earlier, this time the Destination IP is not remote, the packet is not sent to the Default Gateway.  The SYN-ACK is sent directly to the Destination.
  • Two things happen:  1) The "client" receives a second SYN-ACK -- one from CSS, which is the spoofed connection, and now one from the real server.   2) The CSS is not "seeing" the response.  The CSS must "see" all the traffic.
  • Bugzilla server (the client), confused by the two SYN-ACKs, issues a connection RST and the connection fails.


The solution

For internal load-balancing to work, the CSS must see all the traffic coming in-and-out.  The easiest solution here is to isolate the servers in content groups to their own subnet.  Consider this:

The changes may be hard to spot:

  1. No changes to the "www" servers.  They remain on the 172.16.0 subnet, with a 24-bit mask.
  2. Bugzilla server IP addresses change from 172.16.0.X to 172.16.1.X.  Also with a 24-bit subnet mask, they are now on a different IP subnet than the www servers.  Physically, no wiring or vlan changes are needed.  Default Gateway changes to 172.16.1.1
  3. On the CSS, a new IP address is assigned to the inside circuit: 172.16.1.1.  It will be the Default Gateway address for the Bugzilla group.
  4. service rules for Bugzilla servers are updates to reflect their new IP addresses.
Clients on the outside don't see a thing -- they are still happily talking to the CSS via virtual IP 198.41.30.X. However, on the inside, Bugzilla and "www" can now talk to each other using their load-balanced virtual IP 198.41.30.X since the CSS must be used to route all traffic between them.  If one node fails, the CSS continues to use the remaining nodes, and service remains functional for inside clients too.

Friday, February 15, 2013

Big Server Move Reason #4: Big Savings

This is the final part of a blog series on why Eclipse.org moved to a new datacenter.

See Also: Reason #1: Bigger Pipe
See Also: Reason #2: Big Power
See Also: Reason #3: Big Cooling

Reason #4: Big Savings

The new colo facility was eager to have our business.  Very eager.  They kept sweetening the pot until it was practically impossible to say 'no'.  The end result of this move: more bandwidth, more AC power, better cooling, more cabinet space, and a lower monthly bill for the Foundation.

What's there not to like?

Thursday, February 14, 2013

Big Server Move Reason #3: Big Cooling

This is part of a blog series on why Eclipse.org moved to a new datacenter.

See Also: Reason #1: Bigger Pipe
See Also: Reason #2: Big Power

Reason #3: Big Cooling

The by-product of consumed electricity is heat -- lots of it.  We felt we had outgrown our previous location since our cabinet temperatures were very high, even if the cabinets themselves still had vacancies.  In the last six months, we replaced no less than eight failed hard drives, all in relatively young servers.  Not an efficient use of our time.

The new facility has cabinets that are not only deeper, but also equipped with large chimneys which are ducted into the facility's air return (the ceiling).  Hot exhaust air is literally sucked out of our cabinets, drawing cool air from the perforated floor tiles in front.  Wayne's blog post has some neat pictures.

The result is a set of hot chimneys, cool servers and remarkably uniform temperatures inside the cabinets. 

Next up: Reason #4: Big Savings

Wednesday, February 13, 2013

Big Server Move Reason #2: Big Power

See Also: Reason #1: Bigger Pipe

This is part of a blog series on why Eclipse.org moved to a new datacenter.

Reason #2: Big Power

Today, small servers can deliver greater computational power than the bigger servers of only a few years ago.  But they are power-hungry: 700W to 1000W power supplies in small 1U servers means server cabinets now require plenty of power distribution units (PDUs).

Since efficiency drops with as AC current increases, it became clear that  North America's standard 120v AC power was inefficient usage of yesteryear's technology (the rest of the world has figured this out long ago).  As DC power is simply not there yet for colocation, 208v 3-phase AC was the way to go.  Not only do servers consume a bit less power at that higher voltage, our new PDU bars have built-in current monitors that display instant power usage on each phase, and can be graphed remotely thanks to SNMP.

Additionally, the new facility is providing us with A+B redundant power circuits that we must keep under 40% capacity.  This allows for the total loss of one circuit while still remaining within acceptable loads (below 80%) on the remaining circuit.

Next up: Reason #3: Big Cooling

Big Server Move Reason #1: Bigger Pipe

As you may know, last weekend we moved all our servers into a new Data Centre.  Since any move is disruptive, I thought I'd start a short series to outline the reasons why we did what we did.

Although I'm numbering the reasons, in reality they are in no particular order.


Reason #1: Bigger Pipe

The new facility was offering us more bandwidth -- 60 Mbps more.  From 140Mbps to today's 200Mbps, the jump is substantial.


We monitor our download throughput from a server at the OSU OSL, in Portland, Oregon.  In the picture below, it's clear that this week, even at the busiest of times our download speed has improved.  You can also see the flatline above Week 06, where our download speed went to absolute zero while we moved.



On the yearly graph, you can see how download performance has improved since last year.  In August, we lost the OSU OSL server, so monitoring flatlined while it was brought back into service.


Next up: Reason #2: Big Power

Monday, February 04, 2013

Eclipse.org is moving next Saturday, Feb. 9 2013!

As you may have heard, we're picking up the entire Eclipse.org server infrastructure and moving it to a new data centre, 20 minutes down the road from the existing location.

The new facility offers better cooling, more bandwidth, more rack space and more AC power for a lower monthly bill to the Foundation.  How can we possibly say no?

This means on the morning (in the Eastern Timezone) of Saturday, February 9th you'll be greeted with a "we're moving" page to remind you that our servers are sitting in a truck somewhere.  We're currently planning everything involved so that downtime will be kept to a minimum.

We appreciate your understanding and your patience while we move to a newer, modern facility that will allow Eclipse to continue its growth... for years to come!