Firmware build time optimization

Discussion in 'Tomato Firmware' started by RMerlin, Apr 3, 2013.

  1. RMerlin

    RMerlin Network Guru Member

    As a follow-up to the discussion first started in another thread, this is about finding ways to make the firmware build more quickly. Granted, mostly of interest to us FW developers, but anything that makes our life simpler means we can spend more time working on stuff :)

    Note that while my experiments are done on a distance relative of Tomato, our build systems should still be similar enough for things to be shareable between both projects.

    I decided to do some experiments regarding parallel building to greatly speed up build times on modern multi-core systems. The global "build everything with -j8" doesn't work too well, as some components don't like this.

    I decided to go on a per-module basis instead.

    What I did first was define a new variable in release/src-rt/platform.mak (not sure if Tomato also uses this file):

    export PARALLEL_BUILD := -j6
    Then, I dig through both src-rt/Makefile and src-rt/router/Makefile, and modify any "make" rule that I wish to multithread to append that variable. For example, in src-rt/Makefile, my kernel rule now looks like this:

            @$(MAKE) -C router kernel $(PARALLEL_BUILD)
            @[ ! -e $(KERNEL_BINARY) ] || ls -l $(KERNEL_BINARY)
    Applying the same logic to src-rt/router/Makefile, I am making some select module compile with that variable appended (in my case, samba-3.5.8 is the biggest winner, as it takes forever to build, and is used by Asus's AiCloud).

    This system has two benefits:

    1) Makes it easy to adjust the number of threads based on one's machine (this could also be automated as to automatically use the amount of CPU cores if we wanted to make it all automatic)
    2) We can select which module we want to compile with multithreaded jobs

    Any comments? While I agree that ideally, it would be best to have parallel jobs at a higher level (i.e. compiling multiple modules at once), this might be tricky due to all the dependencies involved.

    I'll try a full build of my FW with and without these tweaks, and post the results once its done. Already a few limited tests (compiling just one specific module) showed great promises here:

    real    5m41.871s
    user    5m12.264s
    sys     0m25.462s
    real    1m30.575s
    user    7m29.244s
    sys     0m34.170s
    Samba 3.5.8:
    real    6m58.109s
    user    6m4.687s
    sys     0m30.382s
    real    2m12.301s
    user    9m0.378s
    sys     0m33.510s
    That's 8 minutes of build time saved just there :)
  2. phuque99

    phuque99 LI Guru Member

    Isn't the -j job parameter scaled according to the number of CPUs in your build machine? If I had only a 4-core CPU, -j 8 will result with 4 additional job queue that the CPU has to multitask.
  3. shibby20

    shibby20 Network Guru Member

    As i see we are able to use $(PARALLEL_BUILD) to the most of objects. We can`t use this for openssl (compilation will crash) and there is no neccesary use it with sqlite/snmpd (even with -jX it will compile with only 1 core).

    I`m compiling now my All-In-One image and we will see how much time i will take.


    looks pretty nice ;)
  4. shibby20

    shibby20 Network Guru Member

    r2z compilation time
  5. RMerlin

    RMerlin Network Guru Member

    Seems correct. According to the man page, "-j" without arguments won't limit the number of parallel jobs - without specifying how many jobs it will then use tho.
  6. koitsu

    koitsu Network Guru Member

    It'll use as many as it needs; this is unrelated/separate from how many cores/CPUs/etc. the system has. For example, if the underlying Makefile semantics (dependencies/objects/targets) has 193 things to build/process, then make -j will fork off 193 processes all at once. I'm sure you can imagine the effects this can have on a system. The rule of thumb is to not use make -j but use make -jX and set X to the number of physical CPU cores you have.

    In general, parallel building across all *IX systems has been spotty at best. There are many, many caveats to it. I'll tell folks about the most common one:

    Use of make -jX with a failing build will often obfuscate/make a mess (output-wise) of where the build fails, often sending people on a wild goose chase looking at the wrong program/Makefile/part of the build tree. We deal with this problem on FreeBSD constantly; people use make -j4 then complain that some library/program/etc. fails ("Error 1", etc.) when in fact the build is failing deeper/within some other parallel build phase. At this point the user is told to stop using -jX and make clean (which has been proven on TomatoUSB to not always clean up everything correctly) then rebuild using make so that the real location of the problem can be seen.

    Consider yourself warned. Parallel building is a very, very unreliable/spotty thing.

    The only build infrastructure I've seen work better (reliably) with parallel building is CMake (which is used by things like MySQL and MariaDB as a build framework at this point in time).
  7. RMerlin

    RMerlin Network Guru Member

    Then having a specific value is definitely desirable rather than letting it run wild.

    You are correct in pointing out that this optimization does indeed make debugging more difficult. Something to keep in mind there - there are definitely cases where you don't want to use paralleled building. But in many cases, this will be a great time saver.

    Personally, I would mostly parallelize bits that don't really get modified by us. Mostly static stuff like the kernel (except when working on new patches) or Samba, for instance. But heavily worked bits like httpd or rc should be kept single-threaded - not that much to be gained there with a parallel build anyway.

    I will post a timing benchmark of my own build tests later today once this second rebuild finishes, along with a list of what I have set to parallel building here.
  8. RMerlin

    RMerlin Network Guru Member

    i7 860 (o/c at 3 GHz), using a Virtualbox VM with 6 virtual CPUs allocated (out of my 8).

    Here's my build time results for Asuswrt-Merlin (RT-N66U), with conservative (i.e. limited subset of modules) parallel building:

    real    27m40.635s
    user    22m18.404s
    sys     2m9.040s
    real    17m27.652s
    user    28m13.162s
    sys     2m22.233s
    This is what I got parallelized (I still need to ensure that they all do really build with it):
                $(MAKE) -C $(LINUXDIR) vmlinux CC=$(KERNELCC) LD=$(KERNELLD) $(PARALLEL_BUILD); \
                $(MAKE) -C $(LINUXDIR) modules CC=$(KERNELCC) LD=$(KERNELLD) $(PARALLEL_BUILD); \
            $(MAKE) -C $@ $(PARALLEL_BUILD) (e2fsprogs)
            -@$(MAKE) -C samba3 $(PARALLEL_BUILD)
            -@$(MAKE) -C samba-3.5.8 $(PARALLEL_BUILD)
            @$(MAKE) -C ffmpeg all $(PARALLEL_BUILD)
            $(MAKE) -C lighttpd-1.4.29 $(PARALLEL_BUILD)
    Probably more could also be added, but not sure it would yield any major improvement in build time.
  9. mstombs

    mstombs Network Guru Member

    Can you also apply parallel processing higher up in the chain and build multiple packages at same time?
    I've also seen mention of compiler caching tools, but like make clean and dependencies difficult to trust!
  10. koitsu

    koitsu Network Guru Member

    This introduces "dependency hell" into the Makefile structure, and becomes basically impossible to manage. Make is not the equivalent of a package manager. :)

    If you're talking about things like ccache -- avoid them like the plague. They make troubleshooting even more difficult/impossible than if make -jX is used. I've even seen FreeBSD folks try to use both and then continually (multiple times a year) complain on the mailing lists about "random breakage" that happens intermittently for them, where every time removing ccache from the picture solved the issue.

    Remember folks: KISS principle should be applied heavily when it comes to building something as key and critical as a firmware on an embedded device. Don't f*** around.

    TomatoUSB's build environment really is not very clean/prepared for any of this stuff in a sane, manageable way. OpenWRT, on the other hand, has their ducks in a row.
  11. RMerlin

    RMerlin Network Guru Member

    That's why I like my current selective approach. Can easily be toggled on/off, and you can avoid parallelizing stuff that can potentially have trouble dealing with multiple build jobs.

    Tomato (and Asuswrt)'s build system is a beast that probably none of us devs currently around can claim to fully understand. I've been playing with this code for a year now, and there are still dark corners of the filesystem I have never really dared to investigate, for fear that the simple act of being looked at might cause them to break apart.
  12. RMerlin

    RMerlin Network Guru Member

    I brought my build down to 15m14s now with a bunch of additional parallelized builds :)

    I also made the build system automatically detect the correct number of threads to use:

    export PARALLEL_BUILD := -j`grep -c '^processor' /proc/cpuinfo`
    mstombs likes this.
  13. lancethepants

    lancethepants Network Guru Member

    export PARALLEL_BUILD := -j`nproc`
    Maybe just cleaner looking.

    Probably not an issue for most, but if you'r running on OpenVZ or Xen, sometimes it doesn't return the actual amount of CPUs your instance is really allotted.
  14. RMerlin

    RMerlin Network Guru Member

    That command doesn't seem to be always available. While my Ubuntu 12.04 dev machine has it, I just tested a customer's CentOS 5 VM, and it doesn't have this command, despite having coreutils installed (the package that provides it under Ubuntu). Probably safer to keep using cpuinfo.
  15. koitsu

    koitsu Network Guru Member

    I wouldn't recommend trying to "automate" the number, particularly because of systems with HT (hyperthreading) -- they'll show, for example, 8 logical processors (what the idiotic marketing world calls "threads" -- thanks for adding confusion, marketing bastards!) but actually only have 4 physical. On these systems, make -j4 performs better than make -j8.
  16. RMerlin

    RMerlin Network Guru Member

    I do get better performance here using both physical and virtual threads. HT works much better today than it did when first introduced on the Pentium 4. Keep in mind that while compiling, there are a lot of "dead" times, unlike if you are doing video encoding.
  17. RMerlin

    RMerlin Network Guru Member

    So far my testers are reporting various random build failures in Samba3 and in other modules. I might try to limit this to fewer modules (the kernel is one that I would trust to build cleanly there).

    But definitely I won't make parallel build enabled by default now that the feedback seems to show how random issues can be - I don't have any problem myself with the same build. I suspect timing issues can occur with some modules.
  18. mstombs

    mstombs Network Guru Member

    Is their any potential benefit of using a newer toolchain for the userspace apps?
    Have seen elsewhere a different compiler used for kernel/modules. There must be some new features in the Entware toolchain (which I guess cant be used directly due to library location etc) but of course lots of potential to break things.
  19. RMerlin

    RMerlin Network Guru Member

    WL500G does use a newer toolchain, however it would require a good amount of work to upgrade. In the end I'm not sure you'd gain anything worth those hours of work.
  20. koitsu

    koitsu Network Guru Member

    Entware's toolchain contains a newer/improved/different version of uClibc, for starters. That's why you see stuff like this (and this IS NOT due to "library location" -- this is intentional, using RPATH/rpath capability):

    root@gw:/tmp/home/root# /opt/bin/objdump -x /opt/bin/vim | grep RPATH
      RPATH                /opt/lib
    root@gw:/tmp/home/root# /opt/bin/ldd /opt/bin/vim
   => /opt/lib/ (0x2aac0000)
   => /opt/lib/ (0x2aadc000)
   => /opt/lib/ (0x2ab2f000)
   => /opt/lib/ (0x2ab50000)
   => /opt/lib/ (0x2aaa8000)
    root@gw:/tmp/home/root# ls -l /opt/lib/libgcc* /opt/lib/libc.* /opt/lib/ld-uC*
    -rwxr-xr-x    1 root    root        31696 Mar 11 01:46 /opt/lib/
    lrwxrwxrwx    1 root    root            19 Apr  1 08:58 /opt/lib/ ->
    lrwxrwxrwx    1 root    root            19 Apr  1 08:58 /opt/lib/ ->
    -rw-r--r--    1 root    root        78848 Mar 11 01:46 /opt/lib/
    Now compare this to what's in the "base system" (TomatoUSB itself):

    root@gw:/tmp/home/root# ls -l /lib/libgcc* /lib/libc.* /lib/ld-uC*
    -r-xr-xr-x    1 root    root        26800 Mar 27 05:13 /lib/
    -r-xr-xr-x    1 root    root        392044 Mar 27 05:13 /lib/
    -r-xr-xr-x    1 root    root        66152 Mar 27 05:13 /lib/
    They're different -- very very different. They are not interchangeable. Now you understand why the RPATH stuff in Entware binaries is set to what it is -- if it wasn't set, it'd try to use the libraries in the path, which would be /lib, and that would result in serious breakage (missing symbols, and probably random SIGBUS crashes).

    All that said -- yes, I am of the opinion the Entware toolchain is a significant improvement over the "base" Tomato toolchain stuff. The WL500G guys really do a good job, as do the OpenWRT guys. The only folks worse than us are the DD-WRT guys, whose toolchain/build model is thrown together like a pile of rubbish.
    Victek likes this.
  21. RMerlin

    RMerlin Network Guru Member

    The toolchain is pretty much in the same boat as the kernel: it's all tied to the SDK used by Broadcom for these ageing SoCs. You can fiddle with them, but the return on your invested time will be quite low: many hours of work, for probably no noticeable performance or stability improvements. For every esoteric bug you might fix, there's a good chance of introducing a new equally esoteric bug.

    The best time to move forward would be when supporting newer SoCs, which would then use a more up-to-date SDK as well. That would probably imply a major break from the legacy support as well. Otherwise, expect pretty much a full-time job just trying to maintain everything.
    koitsu likes this.
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice