Using FPU emulator

Discussion in 'Tomato Firmware' started by ryzhov_al, Apr 23, 2014.

  1. ryzhov_al

    ryzhov_al Addicted to LI Member

    A little theory first.

    As you may know, we've got FPU on board on every router. To be more precisely, it's not a real FPU, it's a FPU emulator. Which means every FPU command is been «atomized» on-the-fly down to simple arithmetic operations which can be executed on CPU. I don't know how this emulator becomes a standard part of MIPS SoCs, may be it's a part of some spec like MIPS R3000.

    So, every program can be compiled for using FPU or not: hard-float and soft-float code respectively.

    I found some floating points operations can be executed twice faster when FPU emulator is not used. I know we can't recompile whole Tomato code as a soft-float due to proprietary Broadcom binaries, but we may use soft-float code for any user-space software.

    You may do your own tests on Tomato using any of two Entware feed you want:

    * a hard-float feed (which was available all time):

    wget -O - | sh

    * a soft-float feed (which was created for DD-WRT and Zyxel Keenetic guys, who's using soft-float code for their kernels):

    wget -O - | sh

    There is no any barriers for using soft-float repo on hard-float based firmware (including Tomato) but not vice versa: on soft-float firmware you'll get "Bus error" when trying to use hard-float repo.
    Last edited: Apr 23, 2014
  2. mstombs

    mstombs Network Guru Member

    Interesting, thanks. If my maths is correct you get a 500% boost on double precision on the N14U! I recall RaspberryPI folk got a boost going to hard-float because their Broadcom CPU had real hard-float.

    Wonder why MIPS hardware emulation so poor, perhaps GCC optimiser much better than it used to be, or doesn't properly optimize to the hardware?

    I also wonder why a router needs double precision arithmetic - for cryptography maybe? C-programming defaults to double precision for real variables it may be possible to further optimize by dropping down to scaled integers for embedded devices - real PCs have had real hard float since 486 before that we had to by 80287/80387 co-processors so can be no benefit to standard library devs there.

    RPi folk also replaced some libc functions with hand-optimized assembler, for memcpy/memset etc, wonder if there's any mileage for our routers there?
  3. lancethepants

    lancethepants Network Guru Member

    So let me see if I understand correctly. I had to read this through a few times. Let me know if I'm right.

    Generally hard-float > soft-float.
    But in the case of our router's, our 'hard-float' is really a 'soft-float' emulated in hardware (poorly, or at-least not as efficient as it could be).
    So letting gcc figure out the 'soft-float' stuff is more effecient than our router's fake 'hard-float'.

    So you're saying this new soft-float version entware is preferrable and likely faster than the hard-float version we've been using so far?
    Does this also mean that entware was not usable on dd-wrt soft-float firmware, but this new soft-float entware is?

    Interesting that dd-wrt uses soft-float for their kernel. I'm guessing this is because they have access to the broadcom source that we do not?
    This may explain to me why I've seen they use two toolchains sometimes for one firmware in dd-wrt (presumably before they had broadcom source).
    I'm guessing they were using a hard-float chain for the kernel, and then a soft-float for user-space binaries.

    edit: let me know if I've erred in any of this please.
    Last edited: Apr 23, 2014
  4. ryzhov_al

    ryzhov_al Addicted to LI Member

    Not 500. 50 per cents, which means floating point calculations is going twice faster.

    Yep, GCC do this job better then FPU emulation onboard. I suspect that's because of context switching between CPU and FPU, which takes too much time and lower overall FP performance.

    crypto-, encoding, compression maybe. A real life tests is needed. As an example, we may compare performance of:
    • libvorbisidec — a fixed-point implementation of libvorbis,
    • libvorbis on hard-float,
    • libvorbis on soft-float.
    They are all present in Entware.

    You decide. On a hard-float based kernel you may use any of two Entware feeds.

    Yes. DD-WRT switched to soft-float kernel about year ago, this step makes Optware and Entware incompatible with their builds. A strange decision, isn't?
  5. mstombs

    mstombs Network Guru Member

    If "50% means benchmark was ran twice faster on soft-float code", what does "18%" mean, I assumed it took 18% of the time (double precision division on n14)

    i agree real world tests in condition where CPU is a limit needed. Still unsure why even crypto needs floating point (should be all exact integer arithmetic). I'm sure lossy jpg/mpeg use float/double arithmetic, but wouldn't do that on a router!
  6. ryzhov_al

    ryzhov_al Addicted to LI Member

    Please, open link from the first post to clarify any misunderstandings, there is a raw numbers on first sheet.

    I think it can be a bit wrong direction of decision. All well known vendors put FPU emu on their SoC's, especially designed for routers. Not sure it's done just for fun.
  7. mstombs

    mstombs Network Guru Member

    OK i now see the calcs, you are reporting % improvement, so bigger the better, float mul on the N66 is 58% better with soft float, hard float takes 2.4 times the time that soft float does. But double precision division must be avoided whatever mode!
    No wonder MIPS are losing out to ARM!
  8. RMerlin

    RMerlin Network Guru Member

    I would expect crypto code to use integer maths, but just out of curiosity, have you tried an openssl benchmark (openssl speed) using either methods?
  9. Monk E. Boy

    Monk E. Boy Network Guru Member

    Al, you're always doing the coolest stuff.

    If I had to guess, I'd say the FPU emulation is done so they can share code with open source projects without rewrites and the testing it requires.

    I wouldn't necessarily assume that ARM includes a full FPU in every CPU. Up until ARMv7 they were doing the same FPU emulation. Even so it's kind of rash to think that all ARM CPUs from now on will include an FPU, because most ARM CPUs are tailored by the vendor for their product, and some vendors may decide to not include an FPU for power/heat/cost savings. ARM isn't Intel, it's far more customized, quite literally tailor made for the embedded market.
  10. lancethepants

    lancethepants Network Guru Member

    Here are my unscientifically produced numbers. Ran 'openssl speed' once for soft float and hard float. Each time traffic should have been nearly the same when run (almost nothing).
    Ran on an Asus RT-N16.

    Soft Float
    OpenSSL 1.0.1g 7 Apr 2014
    built on: Thu Apr 24 18:32:50 MDT 2014
    options:bn(64,32) rc4(ptr,char) des(idx,cisc,16,long) aes(partial) idea(int) blowfish(ptr)
    compiler: mipsel-linux-gcc -fPIC -DOPENSSL_PIC -DZLIB_SHARED -DZLIB -DOPENSSL_THREADS -D_REENTRAN                                                                                                                                                                              T -DDSO_DLFCN -DHAVE_DLFCN_H -DL_ENDIAN -DTERMIO -O3 -mtune=mips32 -mips32 -fomit-frame-pointer -                                                                                                                                                                              Wall -DSHA1_ASM -DSHA256_ASM -DAES_ASM
    The 'numbers' are in 1000s of bytes per second processed.
    type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
    md2                  0.00         0.00         0.00         0.00         0.00
    mdc2               584.24k      694.21k      722.16k      726.67k      734.55k
    md4               1874.27k     6614.89k    19694.79k    38464.98k    55371.12k
    md5               1768.31k     5820.47k    15489.72k    26387.59k    32814.42k
    hmac(md5)         1843.85k     6165.20k    15748.90k    26661.02k    33067.38k
    sha1              1452.11k     4464.63k    10526.17k    15860.36k    18874.59k
    rmd160            1315.37k     3714.36k     8271.45k    11849.29k    13403.10k
    rc4              25043.54k    27192.87k    28237.82k    28571.98k    28445.35k
    des cbc           3933.69k     4054.42k     4118.36k     4156.48k     4192.86k
    des ede3          1437.56k     1484.04k     1492.25k     1486.51k     1498.20k
    idea cbc          6268.35k     6675.16k     6860.03k     6917.50k     6861.14k
    seed cbc          5081.87k     5380.21k     5490.77k     5505.28k     5526.18k
    rc2 cbc           3631.47k     3642.28k     3767.85k     3712.77k     3751.22k
    rc5-32/12 cbc        0.00         0.00         0.00         0.00         0.00
    blowfish cbc      8809.23k     9629.74k     9919.62k    10038.99k    10080.56k
    cast cbc          7226.11k     7739.35k     7924.91k     7931.25k     8029.81k
    aes-128 cbc       8324.01k     9071.96k     9281.15k     9314.33k     9451.04k
    aes-192 cbc       7283.63k     7756.67k     7928.01k     7966.79k     7994.73k
    aes-256 cbc       6281.32k     6778.66k     6952.96k     6966.61k     7102.52k
    camellia-128 cbc     6285.68k     6832.04k     7021.80k     7068.66k     7087.86k
    camellia-192 cbc     4970.44k     5316.48k     5468.95k     5503.83k     5495.24k
    camellia-256 cbc     4971.77k     5310.34k     5473.85k     5466.16k     5500.74k
    sha256            1428.26k     3420.51k     6230.50k     8041.72k     8637.82k
    sha512             461.34k     1839.30k     3007.36k     4332.41k     4990.18k
    whirlpool          290.16k      583.72k      953.92k     1120.26k     1202.42k
    aes-128 ige       8112.64k     8855.88k     9184.17k     9204.09k     9260.52k
    aes-192 ige       7058.85k     7516.77k     7916.61k     7968.64k     7990.65k
    aes-256 ige       6203.51k     6662.04k     6966.49k     6877.29k     6886.74k
    ghash             9709.76k    10603.62k    10822.74k    10956.46k    10954.42k
                      sign    verify    sign/s verify/s
    rsa  512 bits 0.004262s 0.000375s    234.6   2663.7
    rsa 1024 bits 0.020892s 0.001074s     47.9    931.0
    rsa 2048 bits 0.126026s 0.003454s      7.9    289.5
    rsa 4096 bits 0.816154s 0.011937s      1.2     83.8
                      sign    verify    sign/s verify/s
    dsa  512 bits 0.003873s 0.004248s    258.2    235.4
    dsa 1024 bits 0.010670s 0.012605s     93.7     79.3
    dsa 2048 bits 0.033986s 0.040656s     29.4     24.6
                                  sign    verify    sign/s verify/s
     160 bit ecdsa (secp160r1)   0.0029s   0.0119s    347.0     84.2
     192 bit ecdsa (nistp192)   0.0031s   0.0132s    322.9     76.0
     224 bit ecdsa (nistp224)   0.0038s   0.0171s    262.4     58.6
     256 bit ecdsa (nistp256)   0.0042s   0.0184s    238.5     54.5
     384 bit ecdsa (nistp384)   0.0087s   0.0442s    115.4     22.6
     521 bit ecdsa (nistp521)   0.0215s   0.1124s     46.4      8.9
     163 bit ecdsa (nistk163)   0.0059s   0.0286s    168.1     35.0
     233 bit ecdsa (nistk233)   0.0126s   0.0535s     79.5     18.7
     283 bit ecdsa (nistk283)   0.0193s   0.1002s     51.9     10.0
     409 bit ecdsa (nistk409)   0.0488s   0.2312s     20.5      4.3
     571 bit ecdsa (nistk571)   0.1179s   0.5458s      8.5      1.8
     163 bit ecdsa (nistb163)   0.0059s   0.0309s    170.3     32.4
     233 bit ecdsa (nistb233)   0.0126s   0.0591s     79.5     16.9
     283 bit ecdsa (nistb283)   0.0193s   0.1144s     51.8      8.7
     409 bit ecdsa (nistb409)   0.0488s   0.2642s     20.5      3.8
     571 bit ecdsa (nistb571)   0.1170s   0.6300s      8.5      1.6
                                  op      op/s
     160 bit ecdh (secp160r1)   0.0099s    100.6
     192 bit ecdh (nistp192)   0.0111s     89.7
     224 bit ecdh (nistp224)   0.0143s     69.8
     256 bit ecdh (nistp256)   0.0154s     64.9
     384 bit ecdh (nistp384)   0.0366s     27.4
     521 bit ecdh (nistp521)   0.0944s     10.6
     163 bit ecdh (nistk163)   0.0141s     71.0
     233 bit ecdh (nistk233)   0.0266s     37.6
     283 bit ecdh (nistk283)   0.0506s     19.8
     409 bit ecdh (nistk409)   0.1152s      8.7
     571 bit ecdh (nistk571)   0.2727s      3.7
     163 bit ecdh (nistb163)   0.0152s     65.7
     233 bit ecdh (nistb233)   0.0293s     34.1
     283 bit ecdh (nistb283)   0.0560s     17.9
     409 bit ecdh (nistb409)   0.1312s      7.6
     571 bit ecdh (nistb571)   0.3156s      3.2
    Hard Float
    OpenSSL 1.0.1g 7 Apr 2014
    built on: Thu Apr 24 20:11:07 MDT 2014
    options:bn(64,32) rc4(ptr,char) des(idx,cisc,16,long) aes(partial) idea(int) blo                                                                                                                                                                                               wfish(ptr)
    compiler: mipsel-linux-gcc -fPIC -DOPENSSL_PIC -DZLIB_SHARED -DZLIB -DOPENSSL_TH                                                                                                                                                                                               READS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DL_ENDIAN -DTERMIO -O3 -mtune=mip                                                                                                                                                                                               s32 -mips32 -fomit-frame-pointer -Wall -DSHA1_ASM -DSHA256_ASM -DAES_ASM
    The 'numbers' are in 1000s of bytes per second processed.
    type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
    md2                  0.00         0.00         0.00         0.00         0.00
    mdc2               567.18k      685.34k      723.41k      730.89k      736.73k
    md4               2105.09k     7293.92k    21040.62k    40181.96k    54202.83k
    md5               1440.36k     4900.01k    13663.28k    25054.78k    32781.65k
    hmac(md5)         1831.77k     6107.59k    15903.06k    26491.76k    33159.91k
    sha1              1362.45k     4204.97k    10085.97k    15672.01k    18701.40k
    rmd160            1206.47k     3554.06k     8117.54k    11670.54k    13497.69k
    rc4              24932.54k    27186.52k    27941.13k    28570.62k    28669.28k
    des cbc           3964.01k     4080.02k     4105.10k     4124.24k     4123.13k
    des ede3          1457.54k     1482.33k     1493.73k     1470.68k     1477.83k
    idea cbc          6236.94k     6707.54k     6800.11k     6882.30k     6862.17k
    seed cbc          5046.29k     5370.96k     5475.08k     5482.47k     5505.02k
    rc2 cbc           3627.96k     3715.79k     3741.70k     3778.83k     3739.47k
    rc5-32/12 cbc        0.00         0.00         0.00         0.00         0.00
    blowfish cbc      8874.75k     9699.24k     9938.09k    10021.93k    10007.89k
    cast cbc          7252.67k     7754.65k     7890.75k     8044.24k     8111.19k
    aes-128 cbc       8084.09k     8361.51k     9017.17k     9050.66k     9068.54k
    aes-192 cbc       6883.57k     7426.16k     7757.57k     7777.64k     7737.49k
    aes-256 cbc       6125.40k     6398.79k     6865.83k     6607.89k     6599.87k
    camellia-128 cbc     6282.34k     6875.01k     6986.77k     7037.46k     7154.90                                                                                                                                                                                               k
    camellia-192 cbc     4995.31k     5325.25k     5431.95k     5472.79k     5494.10                                                                                                                                                                                               k
    camellia-256 cbc     4987.13k     5355.19k     5496.18k     5456.70k     5455.87                                                                                                                                                                                               k
    sha256            1431.99k     3421.62k     6333.32k     7907.67k     8579.75k
    sha512             437.80k     1770.33k     2957.14k     4293.29k     4936.97k
    whirlpool          291.01k      589.14k      950.52k     1124.35k     1187.84k
    aes-128 ige       7780.79k     8336.23k     9321.95k     9046.50k     8869.21k
    aes-192 ige       6848.40k     7368.26k     7679.49k     7415.47k     7714.13k
    aes-256 ige       5854.35k     6358.45k     6442.27k     6712.35k     6442.01k
    ghash             9617.40k    10570.10k    10827.86k    10913.06k    11023.70k
                      sign    verify    sign/s verify/s
    rsa  512 bits 0.004276s 0.000384s    233.9   2603.5
    rsa 1024 bits 0.020970s 0.001074s     47.7    931.4
    rsa 2048 bits 0.126154s 0.003447s      7.9    290.1
    rsa 4096 bits 0.815385s 0.011892s      1.2     84.1
                      sign    verify    sign/s verify/s
    dsa  512 bits 0.003905s 0.004307s    256.1    232.2
    dsa 1024 bits 0.010628s 0.012586s     94.1     79.5
    dsa 2048 bits 0.033707s 0.040121s     29.7     24.9
                                  sign    verify    sign/s verify/s
     160 bit ecdsa (secp160r1)   0.0029s   0.0120s    345.5     83.1
     192 bit ecdsa (nistp192)   0.0031s   0.0130s    323.6     77.2
     224 bit ecdsa (nistp224)   0.0038s   0.0169s    261.6     59.3
     256 bit ecdsa (nistp256)   0.0042s   0.0186s    238.4     53.7
     384 bit ecdsa (nistp384)   0.0086s   0.0431s    115.7     23.2
     521 bit ecdsa (nistp521)   0.0214s   0.1120s     46.7      8.9
     163 bit ecdsa (nistk163)   0.0059s   0.0286s    169.7     35.0
     233 bit ecdsa (nistk233)   0.0127s   0.0541s     78.9     18.5
     283 bit ecdsa (nistk283)   0.0192s   0.1005s     52.1     10.0
     409 bit ecdsa (nistk409)   0.0484s   0.2326s     20.6      4.3
     571 bit ecdsa (nistk571)   0.1176s   0.5447s      8.5      1.8
     163 bit ecdsa (nistb163)   0.0059s   0.0310s    168.7     32.3
     233 bit ecdsa (nistb233)   0.0125s   0.0588s     79.7     17.0
     283 bit ecdsa (nistb283)   0.0194s   0.1139s     51.6      8.8
     409 bit ecdsa (nistb409)   0.0485s   0.2650s     20.6      3.8
     571 bit ecdsa (nistb571)   0.1176s   0.6256s      8.5      1.6
                                  op      op/s
     160 bit ecdh (secp160r1)   0.0103s     97.0
     192 bit ecdh (nistp192)   0.0110s     91.0
     224 bit ecdh (nistp224)   0.0145s     68.9
     256 bit ecdh (nistp256)   0.0158s     63.3
     384 bit ecdh (nistp384)   0.0357s     28.0
     521 bit ecdh (nistp521)   0.0934s     10.7
     163 bit ecdh (nistk163)   0.0138s     72.3
     233 bit ecdh (nistk233)   0.0264s     37.8
     283 bit ecdh (nistk283)   0.0503s     19.9
     409 bit ecdh (nistk409)   0.1146s      8.7
     571 bit ecdh (nistk571)   0.2724s      3.7
     163 bit ecdh (nistb163)   0.0153s     65.5
     233 bit ecdh (nistb233)   0.0291s     34.4
     283 bit ecdh (nistb283)   0.0564s     17.7
     409 bit ecdh (nistb409)   0.1312s      7.6
     571 bit ecdh (nistb571)   0.3125s      3.2
  11. RMerlin

    RMerlin Network Guru Member

    Interesting results on some of these, such as the AES cryptos.

    I would have to see how it looks like on my end, since I use more advanced ASM optimizations in my openssl source tree than Tomato (I backported a portion of the 1.0.2 ASM code for aes).
  12. ryzhov_al

    ryzhov_al Addicted to LI Member

    Nothing interesting in OpenSSL benchmarks except SHA1.
  13. gatorback

    gatorback LI Guru Member

    Is there a command that will indicate whether or not FPU emulation was included in he Firmware build?
    Constructive suggestions are appreciated: thank you.

    I tried: nvram show | grep CONFIG_MIPS_FPU_EMU

    however this was not helpful.
  14. Monk E. Boy

    Monk E. Boy Network Guru Member

    All builds will include FPU emulation. Tomato has no control over what Broadcom does in their closed source drivers, so therefore Tomato must support FPU emulation in the firmware. Whether or not the other userspace code uses FPU emulation or not is the question.
    Last edited: Aug 25, 2015
  15. Superhai

    Superhai Serious Server Member

    FPU emulation is usually invisible for the code if it is properly implemented. Usually it "catches" the use of the instruction set if the instructions are not implemented, and jumps to some alternative code. The reason is so you can use the same software both on CPUs with and without the extended instructions.

    There is no "magic" in creating floating point arithmetic using integer operations, but different CPU's will handle them differently and hence also the gain will depend on the model in your router (or computer). GCC surely must have builtin version of ways to handle floating-point arithmetic if told to not use certain instruction sets in the CPU it is compiling for.

    Compiler optimizations are usually able to infer ways to make shortcuts in calculations, and getting better for each version. What they can't tell is if you will run it on a proper FPU or on an emulated FPU, so when compiling it is always better to tell the compiler if you know this beforehand for your specific scenario.
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice