WonderProxy Blog

June 22, 2011

Increasing Max-Connection time for VPN

Filed under: Uncategorized — Paul Reinheimer @ 4:13 pm

A few weeks ago we launched a new VPN Service aimed at workers on the go. Since launch we’ve improved the service twice (without touching the cost). We’ve adjusted the service so that everyone has access to both a North American server and a European one, and we’ve increased the max connection time from 4 to 12 hours.

The original 4 hour limit stems from how our customers were originally using the VPN service: to test their websites including flash and silverlight content. In that sense the four hour limit was reasonable.

Since launching the VPN service on its own, we’ve received an email as well as a comment or two from friends on that limit. We weren’t sold on increasing it (as there are other repercussions of that limit) until our friend Helgi at orchestra.io pointed out:

But here is my main problem, the VPN disconnected and everything I had running automagically connected on an unprotected network and thus negating the point of a VPN – This could happen while I go to the toilet or nip out for lunch/coffee and thus exposing my data longer than I’d want (e.g. until I come back).

Leaving our customers with an insecure connection, especially one where they might not be in a position to notice the change is definitely a problem. Accordingly, we’ve now raised the connection time to 12 hours.


April 26, 2011

Monitoring the WonderProxy Network

Filed under: Uncategorized — Will Roberts @ 12:53 pm

It’s important to know the state of your network and to be quickly informed when something goes wrong. There are a number of independent pieces on each machine that ensure that our proxies work correctly, and we need to be able to make sure that they’re all working correctly. We currently use three tools to monitor our network: Cacti, Nagios, and Smokeping.


We’ve been using Cacti for quite a while, and it provides us with historical information about our proxies: CPU, Load, RAM, Bandwidth. This data allows us to see how well each of our proxies are handling the load of the customers and allows us to plan upgrades if necessary. Squid isn’t a very CPU intensive process, so even when our proxies are shuttling data at 1MB/s in and out the CPU isn’t particularly taxed.

Nagios provides near real-time updates about the status of our network; it is the main source of status updates for our proxies. The information provided by Nagios is used to update the network status bar on our main page. We believe that our customers should be able to easily determine if something’s not working on their end or our end, so our main page is updated every minute with the current status of our proxies.

Our checking is currently centralized to one machine, so whether it can reach the proxy and properly authenticate may unfortunately not line up with what our customers see. However it does easily catch the cases where the machine is down or unreachable due to network issues at our host. There are also times where our monitoring machine decides it’s not happy and stops resolving hostnames which really makes our front page look sad. It doesn’t happen very often (4 times total I believe) and since it’s identical to one of our proxies that’s with the same host I’m at a bit of a loss as to what’s going wrong. If it happens again and I’m able to isolate the problem you can bet there will be a post about it!

When we first started using Nagios I installed the Debian package nagios-statd-server on all the proxies which would be periodically interrogated by the client on the monitoring machine. This worked fine for the most part, but every now and then the server on the proxy would get hung up and stop answering requests. As I upgraded our proxies to Debian Squeeze the problem got worse, and not being familiar with Python I didn’t really have any idea what could be going wrong. So I filed a bug against the Debian package since I couldn’t find an upstream.

Having the monitoring process regularly hang on our proxies wasn’t really sustainable so I went searching for another client/server program for Nagios and ended up with the nagios-nrpe-server which directly executes standard Nagios plugins. It has worked flawlessly since we switched many months ago. It appears that this is the preferred method of retrieving server statuses so I’m not sure how I stumbled across nagios-statd-server.

Smokeping provides us with a visualization of the latency between our monitoring host and our proxies around the world. We expect latency to grow in relation to the distance from the monitor to the proxy, but it doesn’t follow that packet loss should be expected. Smokeping can show us which of our proxies are likely experiencing network issues based on the packet loss and any spikes in latency. Unfortunately with some of our hosts there is a base level of packet loss so looking at the graphs isn’t always clear cut.

For most of our proxies, the graph above would be a relatively flat green line, but every now and again one of them has a bad day and we get colors!

March 30, 2011

Obtaining an Extended Verification SSL Certificate

Filed under: Uncategorized — Paul Reinheimer @ 4:41 pm

We decided to obtain an Extended Verification SSL certificate for WonderProxy and start running our website entirely through it (no standard http:// pages, just https:// for everything). Despite lots of regular SSL experience the process was rather foreign to us. We decided to obtain the certificate through GoDaddy for cost reasons.


  1. Register with GoDaddy and purchase an EV certificate token
  2. Flip over to their Certificate system, use the token to initiate a request
  3. Do the fun bits with OpenSSL to generate a Certificate Signing Request
  4. Hand that data off to GoDaddy

    Now this is the part where I thought the extra fees I was paying for the certificate would come into play, and GoDaddy’s team would leap into action researching my request, not so much. In fact what occurs is that your own highly paid lawyers or accountants leap into action, and bill you by the minute.

  5. Receive instructions from GoDaddy detailing the steps your Lawyer or Registered Accountant needs to follow. You need either a legal or accounting(?) opinion about the validity of your company and registration. The opinion letter has eight key elements:
    1. Your corporation is a valid, active, legal entity.
    2. You conduct business under this corporate name, and it is duly registered with the appropriate government agency
    3. The person signing & submitting the request is authorized to do so on behalf of the company
    4. The person approving the request is also authorized to do so (these were both me, it’s a small company)
    5. The company has a physical place of business and that address
    6. The company has a phone number and that phone number
    7. The company has an active bank account
    8. The company owns the domain in question

    Number 7 there caused us a few issues. Due to the official Quebec registrar being closed we hadn’t obtained a Quebec registration. We were registered federally, and had a provincial tax number, just not an official enterprise number. Without this enterprise number we were unable to obtain a bank account (or verify our PayPal account), so several things were delayed all for the want of a number.

  6. Submit opinion letter to GoDaddy
  7. Fill out a few forms from GoDaddy confirming the request, including the signer and approver, file with GoDaddy
  8. GoDaddy phones the lawyer who issued the opinion letter (using the phone number in some sort of lawyer registry (in the US this would be the Bar) to confirm the information and that they in fact issued the opinion letter
  9. GoDaddy phones the signer and possibly the approver (I was both people, so there was only one phone call) to confirm the details on their forms
  10. An internal GoDaddy “Audit” department reviews the data (this isn’t the person you deal with while completing the steps
  11. Certificate Issued

Total cost was probably ~$400 in professional services and GoDaddy fees. Our goal, clearly, is to have this cost outweighed by the level of trust and security the average user has for an EV certificate. Now that we’re offering dedicated VPN plans, protecting our users privacy from start to finish is even more important.

March 21, 2011

Usability Testing

Filed under: Uncategorized — Paul Reinheimer @ 1:21 pm

Several weeks ago while attending the fantastic Webstock conference I also attended a full day tutorial by Christine Perfetti of Perfetti Media on Usability Testing. The tutorial was fantastic, I’ve been interested in usability for years, and my shelf has several books on the subject (Designing Web Usability, Prioritizing Web Usability, and Don’t make me think) but I learned a lot.

During the tutorial we performed actual usability tests on our fellow attendees on our own websites. One of the tasks I assigned to my victims/volunteers was simple (or so I thought):

You work for a small company with a total of 5 developers and testers and require access to seven servers, mostly in the US but one or two in Europe would be great. Which plan will meet your needs?

Having developed the site, this is a question I can answer in seconds. It took the testers however several frustrating tries to actually find the information, generally they found it by exhausting all other options. The problem I discovered was the text of the link to the details page:

The testers read that link as being a “checkout” link, as opposed to a way to get more information. Thinking about it more critically, it was rather silly. The process of links you follow to actually purchase read: Sign Up -> Details -> Buy Now. The intermediate step seems like a step in the wrong direction. Placing Sign Up on the front page is a great call to action, but it’s not what the link does, and it hides critical details from enquiring minds.

Accordingly, we’ve now changed the text of the link to “Service Plans” (and tragically lost the snowman which has been with us since the start). This provides a much more sensible series of links “Service Plans” -> Details -> Buy Now. The Details button itself is still quite ugly, but that’s a problem for a future post.

I don’t expect this to have a radical and immediate affect on sales, but I hope I have made the site less confusing for the prospective client. I’d highly recommend usability testing (and Christine in particular as a trainer/facilitator) for any developer seeking to improve their site.

February 18, 2011

The Problem With Time

Filed under: Uncategorized — Will Roberts @ 5:15 pm

A frequent issue we have with our hosts is that the proxies will have the incorrect time. This makes billing unreliable and can make tracking down issues difficult since timestamps in logs will be wrong. The best part is that with the problem providers, they rarely seem to care. Depending on the type of VPS we have with the provider there are two solutions.

The first option is to bug the provider. If they’re running an NTP daemon then all of the nodes will inherit the correct time and everyone wins. Most of the time we can’t seem to get them to do this though.

The second option is to run our own NTP which we’re fine with, but sometimes still requires help from the provider. If our VPS is based on OpenVZ or Virtuozzo then the provider can set an option which allows our node to maintain its own wallclock. However, if our VPS is based on Xen then we can do it ourselves like so:

echo 1 > /proc/sys/xen/independent_wallclock

And more permanently by adding xen.independent_wallclock = 1 to /etc/sysctl.conf

We use SNMP to query the time of our machines using a custom OID that just returns the current UTC seconds since epoch. We then compare that number to the same value from our monitoring machine which is synched to the global NTP servers. Differences of a up to 5 seconds either way are considered acceptable though obviously we’d prefer if it were always dead on. We consider 5-60 seconds offset to be a warning state, and then further than 60 seconds to be an error state.

If the time is properly synched then if we get complaints that something wasn’t working for our customers at a given time I know exactly what time I should be looking at in the logs on the machine. When the time is wrong I have to start doing some mental math to figure out when that proxy thought the error event was, and if it communicates with another machine whose time is off I then have to do another calculation to figure out where the logs would be. So each machine in the chain that has the incorrect time makes it more difficult for us to track down what might have gone wrong.

February 9, 2011

Miles per Milisecond, a Look at the WonderProxy Network

Filed under: Uncategorized — Paul Reinheimer @ 2:33 pm

Update: This post has been updated to account for pings being round trip times while distances are only one way. The original post failed to account for this in the last two tables, thanks to Steve for pointing this out.

Update 2: You can now view updates data with ping time between cities at our new WonderNetwork site.

Running a global network of servers for GeoIP application testing leaves you with a lot of servers and some interesting questions, occasionally an interesting combination. I found myself asking if we could compare ping times to physical distances, to see how efficient the internet was, and to confirm my suspicion that transferring data between Sydney Australia and Fremont California would be faster, mile per mile, than transferring between Boston and Fremont. My reasoning was that Australia → United States is one long cable, whereas within the US it would be switched frequently, which is slower.

First we generated a script to ping every city in our network from every other city in our network (Boston → New York and New York → Boston would both be executed). This took something like ten hours (we’ve since smartened up and now execute the test in parallel). Our results look something like this:

Ping between cities

Baltimore Brisbane Dallas Fremont Milan Moscow New York Paris Sydney Zurich
Baltimore 239.38 35.64 79.68 104.70 141.15 174.71 97.71 251.07 114.11
Brisbane 238.32 221.60 174.38 350.12 357.21 245.89 335.79 33.45 346.92
Dallas 34.53 220.14 44.00 140.00 168.00 36.80 130.94 221.72 127.74
Fremont 79.39 176.07 44.57 167.85 214.59 78.43 176.22 184.91 167.16
Milan 104.84 339.54 139.86 170.73 67.08 112.17 21.61 344.70 11.17
Moscow 140.82 366.42 169.04 209.00 67.07 131.29 56.74 387.09 60.29
New York 174.61 246.87 39.46 77.49 111.60 131.34 78.33 319.33 101.03
Paris 97.75 337.47 131.79 175.19 21.29 56.13 77.31 348.84 108.06
Sydney 262.50 42.98 222.25 191.96 345.21 376.57 261.24 354.16 358.75
Zurich 102.18 346.71 127.59 173.61 11.25 60.32 100.24 108.25 345.05

This table shows us what we expected. Sydney is far away from most of our other servers, so the ping time is high. Our Fremont server is incredibly well connected with a top tier provider so it has good routes. The Baltimore → New York ping may raise your eyebrows, it certainly caught our attention. A quick look at the traceroute shows:

traceroute to newyork.wonderproxy.com (, 30 hops max, 60 byte packets
 1 (  0.551 ms  0.557 ms  0.604 ms
 2  4xe-pc400.vcore1-dc1.balt.gandi.net (  0.606 ms  0.661 ms  0.693 ms
 3  xe3-4-core4-d.paris.gandi.net (  97.342 ms  97.325 ms  97.272 ms
 4  p251-core3-d.paris.gandi.net (  123.541 ms  123.547 ms  123.528 ms
 5  linx.ge1-0.cr01.lhr01.mzima.net (  119.101 ms  119.085 ms  119.017 ms
 6  te0-5.cr1.nyc2.us.packetexchange.net (  181.881 ms  176.377 ms  176.349 ms

Our provider in Baltimore seems to be routing all of their traffic through their central datacenter in Paris, rather sub-optimal (we’ve opened a ticket).

Next we needed to determine how far apart these cities were, that being the other half of the equation. We worked with the city center when more specific information on location wasn’t available, and generated the approximate latitude and longitude for every server in our network using publicly available sources. We then used the excel calculation script from Movable Type Scripts to determine distances, and came up with this:

Distance between cities

Baltimore Brisbane Dallas Fremont Milan Moscow New York Paris Sydney Zurich
Baltimore 9553 1216 2448 4211 4853 171 3819 9845 4122
Brisbane 9553 8375 7138 10162 8795 9688 10349 457 10143
Dallas 1216 8375 1466 5355 5788 1376 4956 8638 5255
Fremont 2448 7138 1466 5983 5914 2564 5596 7481 5857
Milan 4211 10162 5355 5983 1429 4040 400 10348 137
Moscow 4853 8795 5788 5914 1429 4694 1554 9060 1371
New York 171 9688 1376 2564 4040 4694 3648 9993 3952
Paris 3819 10349 4956 5596 400 1554 3648 10600 304
Sydney 9845 457 8638 7481 10348 9060 9993 10600 10355
Zurich 4122 10143 5255 5857 137 1371 3952 304 10355

Finally, we merged the two tables, did some math and came up with this:

Miles per Milisecond

  Baltimore Brisbane Dallas Fremont Milan Moscow New York Paris Sydney Zurich
Baltimore 79.81 68.24 61.44 80.44 68.76 1.96 78.17 78.43 72.25
Brisbane 80.17 75.59 81.87 58.05 49.24 78.80 61.64 27.33 58.48
Dallas 70.42 76.09 66.63 76.50 68.90 74.78 75.70 77.92 82.28
Fremont 61.67 81.08 65.78 71.29 55.12 65.38 63.51 80.92 70.08
Milan 80.33 59.86 76.58 70.09 42.61 72.03 37.02 60.04 24.52
Moscow 68.92 48.00 68.48 56.59 42.61 71.51 54.78 46.81 45.48
New York 1.96 78.49 69.75 66.18 72.40 71.48 93.14 62.59 78.24
Paris 78.14 61.33 75.21 63.88 37.58 55.37 94.38 60.77 5.63
Sydney 75.01 21.26 77.73 77.94 59.95 48.12 76.50 59.86 57.73
Zurich 80.68 58.51 82.37 67.47 24.36 45.46 78.85 5.62 60.02

Here things start to look a bit better for several of the connections. In the first chart which looked at raw ping times Sydney faired poorly. Here, accounting for the extreme distance between Sydney and the majority of our network we see that, mile for mile, it’s actually doing quite well. Other links like Paris → Milan that were looking quite good previously are now exposed as being rather inefficient.

While we’re examining the efficiency of our network, this is the chart we’ll use. Simple ping times tell a story (we use smoke ping to monitor consistency and packet loss), but not the whole story. From this we get a more realistic view of hour our connections are performing.

One last experiment

Networks are fast, but just how fast? In a vacuum Light travels roughly 186,000 miles/second, or 186 miles/millisecond. Light travels slower through fiber optics, on average about 35% slower, which gives us ~120.9 miles/millisecond. Let’s look at the speed of pings across our network, as a percentage of the theoretical maximum:

Network speed as a percentage of the speed of light

  Baltimore Brisbane Dallas Fremont Milan Moscow New York Paris Sydney Zurich
Baltimore 66.02 56.44 50.82 66.53 56.88 1.62 64.66 64.87 59.76
Brisbane 66.31 62.52 67.72 48.01 40.73 65.18 50.98 22.60 48.37
Dallas 58.25 62.93 55.12 63.27 56.99 61.85 62.61 64.45 68.05
Fremont 51.01 67.06 54.41 58.96 45.59 54.08 52.53 66.93 57.96
Milan 66.44 49.51 63.34 57.97 35.24 59.58 30.62 49.66 20.28
Moscow 57.01 39.71 56.64 46.81 35.25 59.15 45.31 38.72 37.62
New York 1.62 64.92 57.69 54.74 59.88 59.12 77.04 51.77 64.71
Paris 64.63 50.73 62.21 52.84 31.08 45.80 78.06 50.27 4.65
Sydney 62.04 17.59 64.29 64.47 49.59 39.80 63.28 49.51 47.75
Zurich 66.74 48.40 68.13 55.81 20.15 37.60 65.22 4.65 49.65

For this comparison to be fair cables would need to be run directly between cities, which is clearly not the case. We still think it’s interesting. Hitting 68% of the theoretical best speed is quite remarkable.

January 28, 2011

Xen, OpenVZ, Virtuozzo, and You!

Filed under: Uncategorized — Will Roberts @ 11:32 pm

There are three main virtualization technologies out there being used to provide VPSs: Xen, OpenVZ, and Virtuozzo. Of our 38 proxies, 12 are on Xen, 12 are on OpenVZ, 10 are on Virtuozzo, 3 are physical machines, and we have a single VMware machine.

Physical machines give us absolute control over all the settings. The only limitation to the software we can run is what is available in the Debian repositories and what I’m willing to compile. The other benefit is that no one else has access to the box so there’s no concern with hostname and resolver settings being changed which can be issue with our VPSs. The hostname of the box isn’t overly important except when I’ve got a terminal open and I need to know where I am on our network (sorry I don’t actually know where vz2542 is). We run our own DNS resolvers on each box so that the box should get responses similar to other boxes in the geographic area so our resolv.conf is fairly simple and points at our local resolver. If dedicated servers weren’t so expensive, we’d use them everywhere.

Xen is our preferred virtualization technology mainly because it also allows us to run an OpenSwan based IPSec VPN. Xen also allows us to opt-in to an independent wall-clock so if the host’s clock isn’t properly synced we can fix our clock without their help. When a Xen VPS is rebooted some files are automatically overwritten to ensure that the VPS will work properly, in our case this actually changes our configuration and is very undesirable. Thankfully we can use the immutable bit like on a physical machine to prevent the change without any side effects.

We can use chattr to set the immutable bit so that files aren’t accidentally modified

chattr +i /etc/hostname /etc/resolv.conf

OpenVZ doesn’t currently allow for IPSec VPNs using OpenSwan though it does appear that it will change in the future; no telling how long that change will take to show up on production systems. Like Xen we can use the immutable bit to preserve files from undesired modification without side effects. Unlike Xen, we do need to ask the provider for help if our clock is wrong. OpenVZ tends to be popular among VPS providers because it is free; most customers don’t care which technology is being used and will buy based on price.

Virtuozzo similarly doesn’t allow for IPSec VPNs, and for the longest time I didn’t realize there was an actual difference between OpenVZ and Virtuozzo (oops!). The more unfortunate “feature” I found after I’d already made the change to all our proxies is that if you have the immutable bit set on certain important files (like we do) then the VPS will refuse to boot! Thankfully only one proxy was taken offline by this mistake, and even more fortunately Virtuozzo provides a fairly decent VPS management panel that allows you to reboot into a recovery mode and “fix” the problem. The control panel can be accessed by going to https://hostname.example.com:4643/vz/cp and putting in the root username/password. I haven’t had to use it for anything other than fixing my mess with the immutable bit, but it is nice to know it’s there.

Since we’re incredibly geographically sensitive, we don’t always get to choose our hosting providers based on their virtualization technology. When we do have the choice, we strongly prefer Xen over the competition. Between OpenVZ and Virtuozzo it generally comes down to other criteria like cost.

January 20, 2011

Improving Site Performance

Filed under: Uncategorized — Paul Reinheimer @ 9:48 am

Our site hasn’t really been our focus over the past months, instead I’ve been concentrating on acquiring new network locations, while Will has been improving our server setup and maintenance architecture (we’ve blogged about Setting up Proxy Servers and Managing 30+ servers previously). More recently we’ve been taking a harder look at how the site performs, both for us, and for our users, and found it lacking.

Examining our main page’s performance with XHGui quickly revealed that a considerable amount of time was being spent generating the information displayed in the footer (server list, server country list, and proxied traffic). This data was shuffled off to APC‘s user storage mechanism removing it from the average page load entirely. Google Webmaster Tools still reported a startlingly high average pageload time:

Google Webmaster Tools analysis of site performance

This was quite surprising as the site seemed pretty snappy overall. Further investigation showed that server specific pages loaded more slowly (3-5 seconds!). Since our goal is to provide proxies for GeoIP testing, having server specific pages load slowly is sub-optimal. Looking at the pages with YSlow and Page Speed reveals that the real culprit is the embedded Google Map. Switching to use a static map greatly reduced page load time (to ~800ms). This also reduced functionality, as the map is no longer dynamic, but we plan on switching to a combined static & dynamic system in the future.

If you’re interested in front end performance, High Performance Web Sites and Even Faster Web Sites are invaluable.

Reading through the suggestions from YSlow a bit more closely, then diving into the Apache documentation I also managed to find a few quick gains by configuring our web server to do a bit more work for us:

  ExpiresActive On
  ExpiresByType image/png "access plus 1 month"
  ExpiresByType text/css "access plus 1 month"
  ExpiresByType image/jpeg "acces plus 1 month"
  <Location />
    SetOutputFilter DEFLATE
    BrowserMatch ^Mozilla/4 gzip-only-text/html
    BrowserMatch ^Mozilla/4\.0[678] no-gzip
    BrowserMatch \bMSI[E] !no-gzip !gzip-only-text/html
    SetEnvIfNoCase Request_URI .(?:gif|jpe?g|png)$ no-gzip dont-vary

YSlow will tell you to turn off eTags under the default rule-set. If you’re running with a single web server this is bad advice. You may want to select the Small Site or Blog ruleset to get the most out of the tool. Moving forward we may decide to re-organize our javascript code to make expiry rules for it easy (we can’t set distant expiry for all javascript documents as our live status bar relies on it), for now we’ll leave them as is. We’re happy with our new scores:

Screenshot showing our A grade with YSlow

YSlow - A Grade

Screenshot showing our grade of 93 within Page Speed

Page Speed - 93/100

Having sorted out the low hanging fruit on the front end, I looked at the account pages and the administrative tools we’re using. Performance there was abysmal, with some pages taking at least 10 seconds to load. The pages with the worst performance were the ones displaying any sort of usage statistic; the very worst being ones that displayed aggregate statistics for all users. Looking at our usage table built on the squid logs it has nearly a million rows. Despite being indexed there’s still a lot of data to aggregate and sum.

With an eye toward improving performance I decided to build some summary tables. The first one aggregates usage by user, by server, by day. This summary table was roughly 1/23rd of the original usage table. Makes sense since it rolled up the 24 hourly reports into one. This table is considerably quicker to query, and I started rolling it out to various portions of the admin section immediately.

Table indicting much higher performance using summary tables

While these numbers are still rather pathetic, remember that these are admin actions, not forward facing pages. Optimizing for these would be folly; the time would be much better spent working on outward facing pages read by users and search engines alike. The significant increase here will simply make managing the system a speedier task.

Knowing that the summary table is going to be useful, we need to keep it up to date. To accomplish this task we’re running this query after every log rotation (a process described in our post Squid log parsing for proxy billing). Luckily I’ve got friends at Percona (authors of the MySQL Performance Blog) who gave me a hand with crafting the query:

INSERT INTO sum_usage_daily SELECT
	date(`timestamp`)AS `date`,
	sum(`bytes`)AS `bytesSum`
	`server_id` IS NOT NULL
AND timestamp BETWEEN date_sub(date(NOW()), INTERVAL 2 DAY) AND date(NOW())

Note: ON DUPLICATE KEY UPDATE had numerous bugs prior to MySQL 5.0.38, ensure you’re on a recent version before mimicking this query with your own data.

This query looks at the past two days of traffic, either inserting new records, or updating existing ones when they exist. The  AND timestamp BETWEEN date_sub(date(NOW()), INTERVAL 2 DAY) AND date(NOW()) clause ensures we’re looking at full days, rather than the last 48 hours (the latter would result in incorrect summaries for entire days). This keeps the summary table up to date throughout the day and ensures that yesterday’s data is correct as well.

My only regret was changing some of the column names in the summary table. While “date” represents the contents of the column better than “timestamp”, it did mean that column references in code had to be changed rather than just switching the table reference. Other than that the conversion has been quite quick and painless.

Having reigned in the performance of the site, it’s time to look at adding new site features, and a few new products. More on those later.

January 14, 2011

HOWTO: Managing 30+ Servers

Filed under: Uncategorized — Will Roberts @ 2:19 pm

When we started out we only had a handful of servers so I was doing each setup by hand, and manually applying each change to each server’s configuration. That meant I was spending an hour or more for each new setup, then probably 30-45 minutes for each change depending on its complexity. The setup time doesn’t really have a scalability issue, though it does mean that I can’t be doing something else at the same time. The bigger issue is rolling out a change to all our active servers; a 5 minute change suddenly becomes a 2.5 hour chore when you’ve got 30 servers.

After about 15 or so servers we reached a tipping point where I realized I was going to need a more automated mechanism for setting up the servers and for rolling out new changes. I don’t know all the ins and outs of Bourne Shell scripting, but I’ve managed to create some pretty creative scripts over time so that’s where I started. Pushing trivial updates out to existing machines now becomes a matter of running a script once for each server, and since we know all our hostnames, we can just loop over the hosts running each one in turn. I’ve toyed with the idea of running the scripts in parallel (there shouldn’t be an issue), but for the moment I’ve left them in serial so I can see the result of each box in turn.


SERVERS=`mysql --host=$MYSQL_HOST -u oursupersecretuser -s --skip-column-names -e "$QUERY" wonder_proxy`

export HOST_STATUS=/home/lilypad/billing/host_status.txt


for i in $SERVERS
  if [ -S /home/lilypad/.ssh/master-wproxy@$i:22 ]; then
    echo -n
  elif [ `grep -c "$i 2" $HOST_STATUS 2> /dev/null` -eq 0 ]; then
    nohup ssh -MNf $i  /dev/null 2> /dev/null
  $SCRIPT $i $*

The script above is the basic loop I use to run my other scripts on our machines. The MYSQL_HOST variable allows us to more easily migrate from one box to another which has already happened (and was an absolute pain the first time). The custom query allows this script to be called by other scripts to only select certain portions of our network. Once we have the list of hosts, we then ensure that each host has an active SSH tunnel or attempt to start one if the host isn’t known to be down. Then the script is executed with the hostname and all extra arguments.

The scripts are all fairly simple, and I tend to reuse/mangle them for other uses as needed, but here’s an example:


scp /data/proxy-setup/ipsec/etc/ipsec.conf $1:~/
ssh $1 sudo cp ipsec.conf /etc/
ssh $1 rm ipsec.conf
ssh $1 sudo /etc/init.d/ipsec restart

Pretty simple, but it’s nice not to repeat those 4 commands 37 times when I make a tiny change. So in order to push that tiny change I’d end up just running:

./run_all_vpn.sh ./push_ipsec_conf.sh

The ssh command we use in the first script allows multiple SSH connections to flow over the same TCP connection. This reduces the cost of initiating the TCP handshake as well as the SSH handshake for exchanging keys. The flags as explained by the man page:

Places the ssh client into “master” mode for connection sharing. Multiple -M options places ssh into “master” mode with
confirmation required before slave connections are accepted. Refer to the description of ControlMaster in ssh_config(5)
for details.
Do not execute a remote command. This is useful for just forwarding ports (protocol version 2 only).
Requests ssh to go to background just before command execution. This is useful if ssh is going to ask for passwords or
passphrases, but the user wants it in the background.

January 11, 2011

HOWTO: Speedy Server Setup

Filed under: Uncategorized — Will Roberts @ 2:06 pm

We tend to expand in bursts, so it’s helpful if I can be configuring multiple servers at once instead of dedicating an hour to one server, then another hour to another server. The most difficult part is removing all the unneeded packages from the boxes; installing the packages we want and configuring them is barely a quarter of the current setup script. Since we deal with so many hosts producing an image that we can have them create the server with isn’t exactly convenient; it’s been easier to take what they give us and then work from there.

The first thing we need to do is setup SSH key access to the new machine so that the rest of the install doesn’t need someone entering passwords. There might be a simpler way, but this is what we’ve got at the moment:

cat /home/lilypad/.ssh/id_rsa.pub | ssh root@$HOST "tee /dev/null > foo; mkdir .ssh 2> /dev/null; chmod 700 .ssh; chmod 600 foo; mv foo .ssh/authorized_keys"

So we pipe the SSH key over the SSH connection, write it to a file, make the .ssh directory and then move it to the correct location. At this point we now have easy SSH access to the machine, and we actually maintain active SSH master tunnels to all the machines on the network to reduce the connection lag when running scripts. More on how we do that in my next post.

Our first step on the new machine is to remove any software we explicitly know we don’t want and that will cause issues for our configuration. Things like Apache get nuked so that they don’t collide with the ports on which we run Squid. Then we update all the software on the box to the newest available versions in Debian 5 (a few of our boxes still start as Debian 4), then make the transition to Debian 6. At this point we still don’t have any of “our” packages installed so we start removing unneeded packages with a fairly simple set of rules:

  1. If the package is on our whitelist of known needed packages, leave it.
  2. If the package is on our blacklist of known unneeded packages, remove it.
  3. If removing the package will only remove it and no other packages, remove it.
  4. Ask!

Here’s the part of the script that handles those rules. The packages and packages-blacklist files are just lists of package names.

for i in `dpkg -l | sed -n s/"ii  \([^ ]*\).*"/"\\1"/p`
  grep "^$i$" setup/packages > /dev/null
  if [ $? -eq 0 ]; then
    echo KEEPING: $i

  grep "^$i$" setup/packages-blacklist > /dev/null
  if [ $? -eq 0 ]; then
    echo PURGING: $i
    apt-get -y purge $i

  echo $i | grep -v linux > /dev/null
  if [ $? -ne 0 ]; then
    echo ASKING: $i
    apt-get purge $i

  if [ `apt-get -s -qq remove $i | grep ^Remv | wc -l` -eq 1 ]; then
    echo PURGING: $i
    apt-get -y purge $i

  echo ASKING: $i
  apt-get purge $i

At this point it’s fairly rare that I get asked whether a package should be removed since I update the lists anytime a new package is encountered. Once that’s done we start copying our custom config files for each package and restart the program as needed. The install can run unattended and takes anywhere from 30-60 minutes depending on the speed of the downloads and the power of the machine, and I can be running multiple at once with little trouble.

Older Posts »

Blog at WordPress.com.