Tuesday, March 30, 2004

distcc optimizations

and how to compile kdelibs from scratch in six minutes


If you don't already know about distcc I recommend that you check it out. Distcc is a tool that sits between make and gcc sending compile jobs to other computers when free, thus distributing compiles and dramatically decreasing build times. Best of all it is very easy to set up.

This, of course, leads to the fantastic idea that anyone can create their own little cluster or farm (as it is often referred to) out of their extra old computers that they have sitting about.

Before getting started: In conjunction with distcc there is another tool called ccache, which is a caching pre-processor to C/C++ compilers, that I wont be discussing here. For all of the tests it was turned off to properly determine distcc's performance, but developers should also know about this tool and using it in conjunction for the best results and shortest compile times. There is a link to the homepage at the end of this article.

Farm Groundwork and Setup


As is the normal circle of life for computers in a corporate environment, I was recently lucky enough to go through a whole stack of computers before they were recycled. From the initial lot of forty or so computers I ended up with twelve desktop computers that ranged from 500MHz to 866MHz. The main limit for my choosing dealt with the fact that I only had room in my cube for fifteen computers. With that in mind I chose the computers with the best CPU's. Much of the ram was evened out so that almost all of the final twelve have 256MB. Fast computers with bad components had the bad parts swapped out for good components from the slower machines. Each computer was setup to boot from the CD-ROM and not output errors when booting if there wasn't a keyboard/mouse/monitor. They were also set to turn on when connected to power.

Having enough network administration experience to know better, I labeled all of the computers, the power cord and network cord that was attached to them. I even found different colored cable for the different areas of my cube. The first label specified the CPU speed and ram size so later when I was given faster computers, finding the slowest machine would be easy. The second label on each machine was the name of the machine, which was one of the many female characters from Shakespears plays. On the server side a dhcp server was set up to match each computer with their name and IP for easy diagnosis of problems down the line.

For the operating system I used distccKNOPPIX. distccKNOPPIX is a very small Linux distribution that is 40MB in size and resides on a CD. It does little more than boot, gets the machine on line and then starts off the distcc demon. Because it didn't use the hard disk at all, preparation of the computers required little more than testing to make sure that they all booted off the CD and could get an IP.

Initially, all twelve computers (plus the build master) were plugged into a hub and switch that I had borrowed from a friend. The build master is a 2.7Ghz Linux box with two network cards. The first network card pointed to the Internet and the second card pointed to the build network. This was done to reduce the network latency as much as possible by removing other network traffic. More on this later though.

A note on power and noise, the computers all have on-board components. Any unnecessary pci cards that were found in the machines were removed. Because nothing is installed on the hard disks they were set to spin down shortly after the machines are turned on. (I debated just unplugging the hard disk, but wanted to leave the option for installation open for later.) After booting up and after the first compile when gcc is read off the CD the CD-ROM also spins down. With no extra components, no spinning CD-ROM or hard disk drives the noise and heat level in my cube really didn't change any that I could notice (there were of course jokes galore by everyone about saunas and jet planes when I was setting up the system).

Optimizations


Since first obtaining the computers, I have tweaked the system quite a bit. My initial builds of kdelibs with distcc took around 45 minutes, which I was very happy with, but as time went by I discovered a number of ways to improve the compile speed. Today it takes six minutes from start to finish to compile all of kdelibs using 17 450Mhz-866Mhz computers and 2 3Ghz machines.

localhost/1


Depending on how large your farm is, how fast it is, the speed of your network and the capability of your build box, playing around with the localhost variable in the host list can be well worth your while. Try putting your localhost machine first in the list, in the middle and at the end. Normally you want to run twice the number of jobs as processors that you have. But if you have enough machines to feed, running 2 jobs on the localhost can actually increase your build times. Setting localhost to 1 or 0 jobs can decrease the build time significantly even though the build master's CPU might be idle part of the time. The obvious reason being that the machine can be more responsive to all of the build boxes requesting data.

Network


Because distcc has to transfer the data to the different machines before they can compile it, any time spent transferring the file is lost time. If, on average, 1/3 of the compile time for one file (send/compile/receive) is spent sending and receiving, it isn't the same as if the extra computer is only 2/3 as fast magahertz wise, but it will decrease your farm performance. As it is much cheaper to upgrade the network than it is to upgrade all of the computers. Here are some places for where network bottlenecks occur and can be easily fixed:

Other network traffic. Are the boxes on the common network where all other traffic can spill into it? Reducing traffic not related to building can help improve throughput. Putting them on a different subnet and different switches than the normal home/office can help.

Network compression. On networks slower than 100Mbps where there is no option of upgrading, it might be worth while to turn on the LZO compression for data transfers. Although the network speed may increase, watch out for the server usage.

Enabling compression makes the distcc client and server use more CPU time, but less network traffic. The compression ratio is typically 4:1 for source and 2:1 for object code.

The networks interconnect. Using several 10MB/s hub chained together is probably the worse scenario. Using a single switch that is 100MB/s or better can dramatic increase the transfer time and in turn lead to much faster compile times. Bang for your buck, this might be the best improvement that can be made.

The main box. All traffic for the compiling comes from and is sent to the main box. Using 1000MB/s card(s) on the build master can help reduce this bottleneck. Most networking companies sell switches that have one or two 1000MB/s ports with the rest being 100MB/s. Another possibility, if your switch supports it, is to use multiple nics as one interface (called trunking). Because the machines are independent of each other and only communicate with the main host rather than chaining switches, multiple nics can be installed in the main box, each one connected to a dedicated switch.

Depending on the build system, networking could have a bigger effect on compile times than it should. When using automake, the system enters a directory, builds all of the files in that directory and then continues building recursively through the directories. Each time it enters a directory it begins to build as many files as it can. In that moment there is a huge spike of network traffic (and collisions) slowing down the overall delivery speed. Compounding that problem, because all of the files are pulled from the hard disk at the same time, preprocessed, and sent out, the computer is tasked to the fullest, often to the point of slowing down the total compile than if it was doing one file at a time. Which brings us to...

Unsermake


Make doesn't know about any directory dependencies, only the order in which to build them. The simplest and best example of where this can show up is in the test directory for a project. Typically one will have a library/application and then build (lets say) thirty test applications. Each test application is in it's own directory and contains one file. Make would build each directory, one at a time in a linear order. Unsermake (which replaces automake) realizes they have no interdependencies and, assuming there are thirty boxes in the farm, compiling could be speed up by a factor of thirty! There are in fact a number of new tools that you can replace automake with including SCons. Unsermake is simply the one that I have become most familiar with, but they all have the same feature of decoupling the directory dependency and should give similar results.

Even in the best case scenario for automake, every single computer on the farm is guaranteed to be idle while the last file in the directory finishes its build. Using automake, one is quick to discover that the builds can't scale much more than a handful of computers because the returns will dramatically decrease, and the extra boxes sit idle most of the time.

There is yet another hidden benefit of using Unsermake, which was touched upon in the last section. As each machine finishes its build Unsermake will give them the next file off the stack. The boxes will almost never finish at the same time. So rather than having a spike in network traffic that is more than the system can handle there, is a continue stream at the top speed of the entire system. Rather than trying to read and preprocessor thirty jobs at once, it only has to do it for one. With only one job to do the master box will read it faster, preprocessor it faster and it will be transferred faster to the machine to build. On the small scale it doesn't matter much, but add that up over hundreds of files and you will see a very nice pay off on just this one small thing.

In most of the cases I have tried, using Unsermake has cut the compile time in half. Just to say that again... in Half!


For a good example of all of the problems that Unsermake takes care of, check out this image of distccmon-gnome when compiling Qt with automake. On the right hand side, one can see the red and rust colors where too much was trying to happen at once and everything was slower because of it. On the left, one can see many idle machines as they wait for the others to finish before starting a new directory.

More Machines


Of course, if you are only building with two computers, adding a third will speed it up. But what about the nineteenth or twentieth? If a project contained directories with, at most, five files in them, (using automake) fifteen computers would sit idle! But if Unsermake is used they won't, as many computers as files could be added and (to a point) see benefit from the additions. Rather than trying to always have the top of the line processors one could get slower machines, double the number, at half the price. Then one wouldn't have to worry that only at most five would be used when build time came around. Initially when I would use 450MHz boxes with 800MHz boxes the 450MHz boxes actually slowed down the system while the faster boxes waited for jobs to finish. But after using Unsermake the 800MHz boxes were no longer held back and adding the 450MHz boxes improved the system as a whole just as originally expected. Because of Unsermake, I was able to add three low-end boxes to the farm.

BinUtils


During the 2003 KDE Contributors Conference, Hewlett Packard provided laptops for the developer to use. The laptops were integrated into the already running Teambuilder compilation farm running on the computers contributed by the Polytechnic University of Upper Austria. With 52 computers attached, huge compilation speed increases, and in comparison the linker speed was quite slow. A little bit of work later a patched binutils was created which dramatically increases the speed of the linker. In the binutils 2.14.90 changelog, it is number 7. "Fix ELF weak symbol handling". Not only those who use distcc, but everyone can take advantage of this speed improvement. As of this writing, the current version of linux-binutils is 2.15.90. Make sure you are up to date.

Another place where the new binutils can really benefit is when distcc is used to cross compile. If for example a 200MHz mips box uses twenty P4 boxes with a crosscompiler on them, having a faster linker on the really slow box would improve the total compile time more than adding additional machines.

Memory


Although it might seem obvious, making sure that the build boxes have enough memory is important. Watching a few of the boxes build with one thread each, the memory didn't reach 60MB in usage. Having the build boxes contain only 128MB should suffice, but the build master box should contain much more, and even more when doing only local builds. In one test, I left in only 256MB and found it reaching over 200 most of the time. It did go over 256 and started swapping out a few times, so at minimum 512 is recommended. Presuming a developer will also be running X11 a desktop environment and other applications 1GB of memory isn't out of the question. Once the build computer start to swap, any gains that might have been had are now lost.

Linux Kernel


Just a last note about the kernel. I haven't run any tests using Linux 2.6 vs 2.4 yet, but this might be an area of improvement. From everything else out there that shows the dramatic bandwidth increase going from 2.4 to 2.6, I wouldn't be surprised if it also improved compile times. If you have system with 2.4 and 2.6 and are using distcc let me know if it makes much of a difference and I'll make mention of it here.

Could you go back to just one box?


No way! Once you start playing around with distcc you'll find build times shrinking. Most developers probably have one really good computer and several really old computers that have been rebuilt twenty times. Your eye twinkles as you think how fast things would be if you could get a rack of dual Xeon's. Coming back to reality, you apprehend several sensible requirements of your build farm:

  • Small form factor, if there are going to be several of these boxes, they can't take up much space under the desk.

  • Cheap, duh.

  • Low power, Running twenty P4's uses a lot of electric.



Some options appear:

  • 1" Rack mounted computers. Way, way to expensive for what is needed (Unless your are a corporation in which case, get that rack!).

  • A lot of old computers bought off e-bay or just collected from friends. You will probably get full desktops which aren't exactly small and by the time you buy enough machines you will have spent the same as buying a low end box which is just as fast as all of the old computers combined and use five times the electricity. Oh and don't forget the aggravation of working with a bunch of old computers that are all different.

  • A bunch of those cool cubes, but there is the problem, they are cool and so they go for top dollar for what is generally a VIA 1GB machine.



Those options don't look too good. Maybe there is something better. Micro-ATX is a form factor that is smaller and thinner than the typical desktop, the sacrifice coming from only getting three pci slots. Typically not sold as something "cool" or for servers, but for people who want a computer that doesn't take up much space and they are surprisingly cheap. Putting an AMD XP in the box to reduce electricity costs and if the sweet spot in the market is picked it wont cost you an arm and a leg. Adding it up: Case, Motherboard, CPU & Fan, 256 MB ram, my total casts including shipping were only $240. Considering that a full blown computer with the same CPU can be found for ten times as much this is very cheap and perfect for what we want to use it for. This box gives me the same performance as five old computers, doesn't take up much space, and doesn't use much electricity. Picturing yourself getting two or three of these (for still less than $1K) is actually reasonable now! If someone sold these together as a bundle the price could go even lower (and I would pick up one, hint hint). In my particular situation choosing the middle of the road CPU permits me, in six months, to sell off the CPU and ram and get the new middle of the road for very little cost in difference. The only item I left off the price was the CD-ROM drive. I have half a dozen old CD-ROM drives lying around and, because other then at boot times it wont be used, a new CD-ROM drive isn't needed. But if you don't have a spare already, finding one for only $20 shouldn't be a problem. Also many machines these days can net boot, so if you have enough of them, it might be worth while to set that up.

Conclusion


distcc is a fantastic program which enables developers to cut down on waiting. Although you can just stick more machines on the network and utilize them, if you spend some time testing you might be able to get a whole lot better performance.

If there are more than three boxes in the farm take a look at Unsermake or similar programs and see if you can use it in your compiling projects. The effort to get it to work with your project might pay you back big time. If you are using automake, you want to have a faster, smaller farm.

The idea of building small cheap dedicated build boxes is a very real alternative. Investigate your alternatives. Perhaps the bare bones computer is so cheap that it is the same as the electricity cost difference for running five PII's for a few months. Of course you could have both the new barebones box and those PII's....

Links


scons - http://www.scons.org/
distcc - http://distcc.samba.org/
ccache - http://ccache.samba.org/
distccKNOPPIX - http://opendoorsoftware.com/cgi/http.pl?cookies=1&p=distccKNOPPIX
Unsermake - http://www.kde.me.uk/index.php?page=unsermake

No comments:

Popular Posts