Wednesday, March 20, 2013

Poll results: How do you prepare your EclipseCon talks

The results are in! In yesterday's poll I asked how you prepared for your talk(s) at EclipseCon. Here were your answers:



30 i-prepare-my-talks-on-the-plane-to-econ

27 oh-crap-ill-be-back-later-theres-something-i-must-go-do

22 i-prepared-my-talk-in-2009-maybe-i-should-update-it

19 i-prepare-my-talks-while-i-drive-to-econ

16 i-prepare-my-talk-when-i-get-there



124 I-assumed-Denis-was-preparing-my-talks-for-me

If you think I am in some way qualified to prepare your talks, the attendees are in for a painful experience.


56 like-Eric-I-prepare-mine-on-the-toilet


That is way too much information.  I'd be leery of printed handouts at those talks.



For some reason, there were more answers on the topic of beer than any other.   I am shocked!

9 this.is.my.first.eclipsecon.ever.its.going.to.be.great---they.say.you.buy.beers

Congratulations -- you are in for a great week. Catch me at the bar and I'll buy you a beer.  You'll need to be fast since I am not at the bar very often.


6 please.set.up.a.better.404.page.for.eclipsecon


In this age of Drupal-enabled websites, this is out of my control.

 

5 ive.been.to.many.eclipsecons---why-have-you-never-bought-me-a-beer

I am sorry.  I will try harder this year.




5 /webmaster.ive.been.to.all.the.eclipsecons---and.I.know.you.buy.beers


I know who you are.


5 /webmaster.ive.been.to.a.couple.of.eclipsecons---I-prefer-to-run-than-drink-beer


I will run with you -- to the beer store!


 


Thanks to everyone who participated!  I look forward to seeing everyone again at EclipseCon 2013!

Tuesday, March 19, 2013

EclipseCon Poll: How are you preparing your talk(s) for EclipseCon

It's time for another Webmaster Whacky poll, where failure is the only sign of success.  In less than a week we'll all be at our favourite hangout, EclipseCon 2013.  This time I ask a simple question:

How do you prepare for your talk(s) at EclipseCon?  Cast your votes by clicking the links below:


http://eclipsecon.org/webmaster-i-prepare-my-talks-on-the-plane-to-econ

http://eclipsecon.org/webmaster-i-prepare-my-talks-while-i-drive-to-econ

http://eclipsecon.org/webmaster-i-prepared-my-talk-in-2009-maybe-i-should-update-it

http://eclipsecon.org/webmaster-i-prepare-my-talk-when-i-get-there

http://eclipsecon.org/webmaster-oh-crap-ill-be-back-later-theres-something-i-must-go-do

http://eclipsecon.org/webmaster-what-no-beer?


I'll tally up the results tomorrow afternoon.  Remember -- no ballot stuffing!!

Wednesday, February 27, 2013

Cisco CSS: Load balancing from the inside too

Disclaimer: I'm not a Cisco expert.  Years ago, then-webmaster Karl Matthias convinced me that I was almost smart enough to barely understand this gear.  Turns out he was right.

The skinny

We use a Cisco CSS to load-balance client requests to multiple servers.  For years we couldn't load-balance requests from our inside network, only from the outside.


The setup

If you read the Cisco docs, the predominant use-case for a load balancer appears to be a single CSS with  a  single server group serving all the content. However, like at most shops, we have multiple server groups to serve different content.  For example, we have three servers to serve www.eclipse.org, two for Bugzilla, three for Git, three for wiki.eclipse.org, and so on.




Here, the Load Balancer acts as the gateway -- all inside servers use rfc 1918 private IPs and use 172.16.0.1 as their default gateway.  One /24 subnet is used: 172.16.0.X. The CSS has multiple "virtual" IPs: the real, Internet-routable IP addresses that represent the services.

For Internet clients, this setup works beautifully.  When you consider that a CSS is nothing more than a heavy Spoofing device, you can easily follow the flow of traffic from a client, say 108.10.50.81:

  • Client 108.10.50.81 sends SYN packet to www.eclipse.org, which is 198.41.30.199
  • The CSS immediately responds with a SYN-ACK, which is ACK'ed by the client, thus completing the three-way handshake
  • Meanwhile, the CSS spoofs the connection to one of the real servers in the group.  It crafts a new SYN packet --  Source: 108.10.50.81  Dest: 172.16.0.7 (it happened to pick that one)
  • The real server responds with a SYN-ACK: Source: 172.16.0.7 Dest: 108.10.50.81.  Since the Destination is remote, the packet is sent to the Default Gateway, which is the CSS (172.16.0.1)
  • The CSS simply discards the SYN-ACK since it has already established a socket with the real client. It ACKs the real server and completes the three-way handshake on the backend.
  • Everyone is happy, and traffic is free to flow from the client to the real server.


The problem

Problems arise when a server on the "inside" becomes a client to a load-balanced service, also on the inside. For some reason, it just doesn't work.  Years ago, the Cisco experts (not Karl) just told me it was how Cisco devices worked -- the load balancer is not meant to be accessed from the inside network.  The Cisco forums provided no particular guidance, other than to essentially "NAT" the inside servers as clients.  While that solution works, I didn't find it particularly pretty.

We originally resolved the issue with hosts file entries at first, then internal DNS.  Since all our servers share a backend network connection, server-to-server connections would flow over it.  It worked, but it was error prone and confusing.  If one load-balanced node died or was taken offline, we'd need to remember to update DNS.


Why it doesn't work

The years passed and I didn't spend much time thinking about it, but as our services grew in number, size and traffic volume, the problems became more frequent and annoying.

Understanding the root cause of the problem was key to developing a solution, which happened haphazardly while explaining it to a bunch of Linux students.  A light bulb lit up and I saw the light.

The following day I spent a bit of time with tcpdump and webalizer, I noticed that the internal "client" trying to reach an internal server from the CSS was eventually receiving two SYN-ACK packets.  The client, understandably confused, would RST the connection leading to failure.  Bingo.

Following the flow of traffic from the inside "client", the problem becomes apparent.  Say Bugzilla server 172.16.0.15 wants to talk to server www.eclipse.org, using the virtual IP:


  • Internal Bugzilla server (the client) 172.16.0.15 sends SYN packet to www.eclipse.org, which, through the magic of DNS, resolves to 198.41.30.199. Remember, that IP is the CSS.
  • The CSS immediately responds with a SYN-ACK, which is ACK'ed by the bugzilla server (the client), thus completing the three-way handshake.  So far, so good.
  • Meanwhile, the CSS spoofs the connection to one of the real servers in the group.  It crafts a new SYN packet --  Source: 172.16.0.15 (the bugzilla "client")  Dest: 172.16.0.6 (one of the www nodes)
  • The real server responds with a SYN-ACK: Source: 172.16.0.6 Dest: 172.16.0.15.  Are you seeing this?  Unlike earlier, this time the Destination IP is not remote, the packet is not sent to the Default Gateway.  The SYN-ACK is sent directly to the Destination.
  • Two things happen:  1) The "client" receives a second SYN-ACK -- one from CSS, which is the spoofed connection, and now one from the real server.   2) The CSS is not "seeing" the response.  The CSS must "see" all the traffic.
  • Bugzilla server (the client), confused by the two SYN-ACKs, issues a connection RST and the connection fails.


The solution

For internal load-balancing to work, the CSS must see all the traffic coming in-and-out.  The easiest solution here is to isolate the servers in content groups to their own subnet.  Consider this:

The changes may be hard to spot:

  1. No changes to the "www" servers.  They remain on the 172.16.0 subnet, with a 24-bit mask.
  2. Bugzilla server IP addresses change from 172.16.0.X to 172.16.1.X.  Also with a 24-bit subnet mask, they are now on a different IP subnet than the www servers.  Physically, no wiring or vlan changes are needed.  Default Gateway changes to 172.16.1.1
  3. On the CSS, a new IP address is assigned to the inside circuit: 172.16.1.1.  It will be the Default Gateway address for the Bugzilla group.
  4. service rules for Bugzilla servers are updates to reflect their new IP addresses.
Clients on the outside don't see a thing -- they are still happily talking to the CSS via virtual IP 198.41.30.X. However, on the inside, Bugzilla and "www" can now talk to each other using their load-balanced virtual IP 198.41.30.X since the CSS must be used to route all traffic between them.  If one node fails, the CSS continues to use the remaining nodes, and service remains functional for inside clients too.

Friday, February 15, 2013

Big Server Move Reason #4: Big Savings

This is the final part of a blog series on why Eclipse.org moved to a new datacenter.

See Also: Reason #1: Bigger Pipe
See Also: Reason #2: Big Power
See Also: Reason #3: Big Cooling

Reason #4: Big Savings

The new colo facility was eager to have our business.  Very eager.  They kept sweetening the pot until it was practically impossible to say 'no'.  The end result of this move: more bandwidth, more AC power, better cooling, more cabinet space, and a lower monthly bill for the Foundation.

What's there not to like?

Thursday, February 14, 2013

Big Server Move Reason #3: Big Cooling

This is part of a blog series on why Eclipse.org moved to a new datacenter.

See Also: Reason #1: Bigger Pipe
See Also: Reason #2: Big Power

Reason #3: Big Cooling

The by-product of consumed electricity is heat -- lots of it.  We felt we had outgrown our previous location since our cabinet temperatures were very high, even if the cabinets themselves still had vacancies.  In the last six months, we replaced no less than eight failed hard drives, all in relatively young servers.  Not an efficient use of our time.

The new facility has cabinets that are not only deeper, but also equipped with large chimneys which are ducted into the facility's air return (the ceiling).  Hot exhaust air is literally sucked out of our cabinets, drawing cool air from the perforated floor tiles in front.  Wayne's blog post has some neat pictures.

The result is a set of hot chimneys, cool servers and remarkably uniform temperatures inside the cabinets. 

Next up: Reason #4: Big Savings

Wednesday, February 13, 2013

Big Server Move Reason #2: Big Power

See Also: Reason #1: Bigger Pipe

This is part of a blog series on why Eclipse.org moved to a new datacenter.

Reason #2: Big Power

Today, small servers can deliver greater computational power than the bigger servers of only a few years ago.  But they are power-hungry: 700W to 1000W power supplies in small 1U servers means server cabinets now require plenty of power distribution units (PDUs).

Since efficiency drops with as AC current increases, it became clear that  North America's standard 120v AC power was inefficient usage of yesteryear's technology (the rest of the world has figured this out long ago).  As DC power is simply not there yet for colocation, 208v 3-phase AC was the way to go.  Not only do servers consume a bit less power at that higher voltage, our new PDU bars have built-in current monitors that display instant power usage on each phase, and can be graphed remotely thanks to SNMP.

Additionally, the new facility is providing us with A+B redundant power circuits that we must keep under 40% capacity.  This allows for the total loss of one circuit while still remaining within acceptable loads (below 80%) on the remaining circuit.

Next up: Reason #3: Big Cooling

Big Server Move Reason #1: Bigger Pipe

As you may know, last weekend we moved all our servers into a new Data Centre.  Since any move is disruptive, I thought I'd start a short series to outline the reasons why we did what we did.

Although I'm numbering the reasons, in reality they are in no particular order.


Reason #1: Bigger Pipe

The new facility was offering us more bandwidth -- 60 Mbps more.  From 140Mbps to today's 200Mbps, the jump is substantial.


We monitor our download throughput from a server at the OSU OSL, in Portland, Oregon.  In the picture below, it's clear that this week, even at the busiest of times our download speed has improved.  You can also see the flatline above Week 06, where our download speed went to absolute zero while we moved.



On the yearly graph, you can see how download performance has improved since last year.  In August, we lost the OSU OSL server, so monitoring flatlined while it was brought back into service.


Next up: Reason #2: Big Power

Monday, February 04, 2013

Eclipse.org is moving next Saturday, Feb. 9 2013!

As you may have heard, we're picking up the entire Eclipse.org server infrastructure and moving it to a new data centre, 20 minutes down the road from the existing location.

The new facility offers better cooling, more bandwidth, more rack space and more AC power for a lower monthly bill to the Foundation.  How can we possibly say no?

This means on the morning (in the Eastern Timezone) of Saturday, February 9th you'll be greeted with a "we're moving" page to remind you that our servers are sitting in a truck somewhere.  We're currently planning everything involved so that downtime will be kept to a minimum.

We appreciate your understanding and your patience while we move to a newer, modern facility that will allow Eclipse to continue its growth... for years to come!

Tuesday, October 30, 2012

EclipseCon Europe was amazing

Last week I was at EclipseCon Europe, and now that the dust has finally settled, I thought I'd share some highlights of the event.


Technical Track
I'm a server guy, not a developer, so I can't comment much on the track.  However, as I tend to talk to just about anyone while I'm there, the feedback I got was very positive.  I heard the talks were great, and most people seem to appreciate being introduced to different technologies that they can investigate for usage on their Eclipse products or work environment.

For a server guy like myself, there wasn't as much compelling content as there was in North America in March with the ALM track.  But isn't that why there are two conferences to attend?


Networking
Most of the attendees I have spoken to got real value in connecting with the actual developers and integrators of Eclipse-based technologies.  Something about "getting knowledge from the source", if that was never a pun  :)


As webmaster@eclipse.org
It's always great to sit down with committers and community members to discuss issues around the Eclipse.org infra, be it Sonar with Mickael Istria, Gyrex with Gunnar Wagankn Wagenkenc -- Gunnar, Community with Lars Vogel (I'm not old!), Hudson with Ed Willink, or just lending a helping hand to those people who don't normally get webmaster support in their timezone  :)


Wifi
Wifi seems to be a recurring issue.  As someone pointed out, the Forum's wifi is likely more than adequate for 362 days out of the year.  For the remaining three, our atypical population seems to destroy even the best of intentions.  I did work closely with the on-site staff, and I'll be recommending some improvements for next year.  We'll see how it pans out.

One thing is for sure -- 1992 called and wants those 10Mbps hubs from the Power-Up Lounge back  :-)


Newcomers
The amount of first-time EclipseCon attendees is always staggering.  However, I was even more impressed to meet many EclipseCon newcomers at the Nestor bar after the day's schedule, despite their being immersed in a sea of unknown faces.  It was great to meet everyone.


Fun
Okay, let's face it: EclipseCon is always a fun time.  This year, we had the Nestor bar with its usual cast of characters; the band, who rocked us until they were out of songs to play; the Plug-Ins trio ... you simply had to be there; the circus, who dazzled us with their flame- and object-throwing acts; the contests, none of which a civilian like myself could understand...
.. and then there was the Scout crew with their cool Legos!  LEGOS!


Yet another amazing conference in the history books.  Props to Ralph, Anne and their respective crews for putting on such a great event!

Tuesday, October 02, 2012

Optimizing www.eclipse.org over the years

Since I took over the controls of the eclipse.org servers in October 2004 (hey, that was eight years ago yesterday!) our main website, www.eclipse.org, was optimized along the way to support the tremendous and steady growth in traffic.  I thought I'd share some of those optimizations:

1. Before the Eclipse Foundation, www.eclipse.org was one single server.  There was no scaling it, and there was no fault tolerance.


2. With new hardware, my first iteration of the "improved" www.eclipse.org consisted of four servers: one NFS/MySQL backend and three front-end nodes running Apache.  These front-end nodes also served everything else at eclipse.org, including CVS, Bugzilla, email and all our other websites.  A Cisco load balancer was used to direct traffic, and site files were served directly from NFS.


3. Years later, with some new hardware came a second iteration: segregation of the services.  Having Bugzilla, dev, wiki, cvs and www on the same set of servers was not efficient, and thanks to virtualization, www.eclipse.org was  hosted on three virtual servers.  Data was still served from NFS


4. Since NFS was introducing some I/O latency, we began publishing the website files to the local disks of the servers.


5. We enabled mod_gzip and mod_deflate, and added aggressive cache headers to reduce payload and reduce round trips.


6. With new hardware again, NFS and MySQL were separated on the backend-side.  This gave MySQL much needed breathing room.


7. We shortened the local directory path from a long /path/to/the/www.eclipse.org/website/html to a much shorter /site/.  This reduces stat() calls on the filesystem and reduces I/O overhead.


8. At the same time as 7., we also set Apache's AllowOverride to None so that a .htaccess file is not examined in each and every directory leading up to the desired file, cutting I/O calls drastically.

What's next?  SPDY?  Varnish? More hardware?  Magic?   With any luck, yes to all!