A little theory first. As you may know, we've got FPU on board on every router. To be more precisely, it's not a real FPU, it's a FPU emulator. Which means every FPU command is been «atomized» on-the-fly down to simple arithmetic operations which can be executed on CPU. I don't know how this emulator becomes a standard part of MIPS SoCs, may be it's a part of some spec like MIPS R3000. So, every program can be compiled for using FPU or not: hard-float and soft-float code respectively. I found some floating points operations can be executed twice faster when FPU emulator is not used. I know we can't recompile whole Tomato code as a soft-float due to proprietary Broadcom binaries, but we may use soft-float code for any user-space software. You may do your own tests on Tomato using any of two Entware feed you want: * a hard-float feed (which was available all time): wget -O - http://entware.wl500g.info/binaries/entware/installer/entware_install.sh | sh * a soft-float feed (which was created for DD-WRT and Zyxel Keenetic guys, who's using soft-float code for their kernels): wget -O - http://entware.wl500g.info/binaries/mipselsf/installer/entware_install.sh | sh There is no any barriers for using soft-float repo on hard-float based firmware (including Tomato) but not vice versa: on soft-float firmware you'll get "Bus error" when trying to use hard-float repo.

Interesting, thanks. If my maths is correct you get a 500% boost on double precision on the N14U! I recall RaspberryPI folk got a boost going to hard-float because their Broadcom CPU had real hard-float. Wonder why MIPS hardware emulation so poor, perhaps GCC optimiser much better than it used to be, or doesn't properly optimize to the hardware? I also wonder why a router needs double precision arithmetic - for cryptography maybe? C-programming defaults to double precision for real variables it may be possible to further optimize by dropping down to scaled integers for embedded devices - real PCs have had real hard float since 486 before that we had to by 80287/80387 co-processors so can be no benefit to standard library devs there. RPi folk also replaced some libc functions with hand-optimized assembler, for memcpy/memset etc, wonder if there's any mileage for our routers there?

So let me see if I understand correctly. I had to read this through a few times. Let me know if I'm right. Generally hard-float > soft-float. But in the case of our router's, our 'hard-float' is really a 'soft-float' emulated in hardware (poorly, or at-least not as efficient as it could be). So letting gcc figure out the 'soft-float' stuff is more effecient than our router's fake 'hard-float'. So you're saying this new soft-float version entware is preferrable and likely faster than the hard-float version we've been using so far? Does this also mean that entware was not usable on dd-wrt soft-float firmware, but this new soft-float entware is? Interesting that dd-wrt uses soft-float for their kernel. I'm guessing this is because they have access to the broadcom source that we do not? This may explain to me why I've seen they use two toolchains sometimes for one firmware in dd-wrt (presumably before they had broadcom source). I'm guessing they were using a hard-float chain for the kernel, and then a soft-float for user-space binaries. edit: let me know if I've erred in any of this please.

Not 500. 50 per cents, which means floating point calculations is going twice faster. Yep, GCC do this job better then FPU emulation onboard. I suspect that's because of context switching between CPU and FPU, which takes too much time and lower overall FP performance. crypto-, encoding, compression maybe. A real life tests is needed. As an example, we may compare performance of: libvorbisidec — a fixed-point implementation of libvorbis, libvorbis on hard-float, libvorbis on soft-float. They are all present in Entware. Right. You decide. On a hard-float based kernel you may use any of two Entware feeds. Yes. DD-WRT switched to soft-float kernel about year ago, this step makes Optware and Entware incompatible with their builds. A strange decision, isn't?

If "50% means benchmark was ran twice faster on soft-float code", what does "18%" mean, I assumed it took 18% of the time (double precision division on n14) i agree real world tests in condition where CPU is a limit needed. Still unsure why even crypto needs floating point (should be all exact integer arithmetic). I'm sure lossy jpg/mpeg use float/double arithmetic, but wouldn't do that on a router!

Please, open link from the first post to clarify any misunderstandings, there is a raw numbers on first sheet. I think it can be a bit wrong direction of decision. All well known vendors put FPU emu on their SoC's, especially designed for routers. Not sure it's done just for fun.

OK i now see the calcs, you are reporting % improvement, so bigger the better, float mul on the N66 is 58% better with soft float, hard float takes 2.4 times the time that soft float does. But double precision division must be avoided whatever mode! No wonder MIPS are losing out to ARM!

I would expect crypto code to use integer maths, but just out of curiosity, have you tried an openssl benchmark (openssl speed) using either methods?

Al, you're always doing the coolest stuff. If I had to guess, I'd say the FPU emulation is done so they can share code with open source projects without rewrites and the testing it requires. I wouldn't necessarily assume that ARM includes a full FPU in every CPU. Up until ARMv7 they were doing the same FPU emulation. Even so it's kind of rash to think that all ARM CPUs from now on will include an FPU, because most ARM CPUs are tailored by the vendor for their product, and some vendors may decide to not include an FPU for power/heat/cost savings. ARM isn't Intel, it's far more customized, quite literally tailor made for the embedded market.

Here are my unscientifically produced numbers. Ran 'openssl speed' once for soft float and hard float. Each time traffic should have been nearly the same when run (almost nothing). Ran on an Asus RT-N16. Soft Float Code: OpenSSL 1.0.1g 7 Apr 2014 built on: Thu Apr 24 18:32:50 MDT 2014 options:bn(64,32) rc4(ptr,char) des(idx,cisc,16,long) aes(partial) idea(int) blowfish(ptr) compiler: mipsel-linux-gcc -fPIC -DOPENSSL_PIC -DZLIB_SHARED -DZLIB -DOPENSSL_THREADS -D_REENTRAN T -DDSO_DLFCN -DHAVE_DLFCN_H -DL_ENDIAN -DTERMIO -O3 -mtune=mips32 -mips32 -fomit-frame-pointer - Wall -DSHA1_ASM -DSHA256_ASM -DAES_ASM The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes md2 0.00 0.00 0.00 0.00 0.00 mdc2 584.24k 694.21k 722.16k 726.67k 734.55k md4 1874.27k 6614.89k 19694.79k 38464.98k 55371.12k md5 1768.31k 5820.47k 15489.72k 26387.59k 32814.42k hmac(md5) 1843.85k 6165.20k 15748.90k 26661.02k 33067.38k sha1 1452.11k 4464.63k 10526.17k 15860.36k 18874.59k rmd160 1315.37k 3714.36k 8271.45k 11849.29k 13403.10k rc4 25043.54k 27192.87k 28237.82k 28571.98k 28445.35k des cbc 3933.69k 4054.42k 4118.36k 4156.48k 4192.86k des ede3 1437.56k 1484.04k 1492.25k 1486.51k 1498.20k idea cbc 6268.35k 6675.16k 6860.03k 6917.50k 6861.14k seed cbc 5081.87k 5380.21k 5490.77k 5505.28k 5526.18k rc2 cbc 3631.47k 3642.28k 3767.85k 3712.77k 3751.22k rc5-32/12 cbc 0.00 0.00 0.00 0.00 0.00 blowfish cbc 8809.23k 9629.74k 9919.62k 10038.99k 10080.56k cast cbc 7226.11k 7739.35k 7924.91k 7931.25k 8029.81k aes-128 cbc 8324.01k 9071.96k 9281.15k 9314.33k 9451.04k aes-192 cbc 7283.63k 7756.67k 7928.01k 7966.79k 7994.73k aes-256 cbc 6281.32k 6778.66k 6952.96k 6966.61k 7102.52k camellia-128 cbc 6285.68k 6832.04k 7021.80k 7068.66k 7087.86k camellia-192 cbc 4970.44k 5316.48k 5468.95k 5503.83k 5495.24k camellia-256 cbc 4971.77k 5310.34k 5473.85k 5466.16k 5500.74k sha256 1428.26k 3420.51k 6230.50k 8041.72k 8637.82k sha512 461.34k 1839.30k 3007.36k 4332.41k 4990.18k whirlpool 290.16k 583.72k 953.92k 1120.26k 1202.42k aes-128 ige 8112.64k 8855.88k 9184.17k 9204.09k 9260.52k aes-192 ige 7058.85k 7516.77k 7916.61k 7968.64k 7990.65k aes-256 ige 6203.51k 6662.04k 6966.49k 6877.29k 6886.74k ghash 9709.76k 10603.62k 10822.74k 10956.46k 10954.42k sign verify sign/s verify/s rsa 512 bits 0.004262s 0.000375s 234.6 2663.7 rsa 1024 bits 0.020892s 0.001074s 47.9 931.0 rsa 2048 bits 0.126026s 0.003454s 7.9 289.5 rsa 4096 bits 0.816154s 0.011937s 1.2 83.8 sign verify sign/s verify/s dsa 512 bits 0.003873s 0.004248s 258.2 235.4 dsa 1024 bits 0.010670s 0.012605s 93.7 79.3 dsa 2048 bits 0.033986s 0.040656s 29.4 24.6 sign verify sign/s verify/s 160 bit ecdsa (secp160r1) 0.0029s 0.0119s 347.0 84.2 192 bit ecdsa (nistp192) 0.0031s 0.0132s 322.9 76.0 224 bit ecdsa (nistp224) 0.0038s 0.0171s 262.4 58.6 256 bit ecdsa (nistp256) 0.0042s 0.0184s 238.5 54.5 384 bit ecdsa (nistp384) 0.0087s 0.0442s 115.4 22.6 521 bit ecdsa (nistp521) 0.0215s 0.1124s 46.4 8.9 163 bit ecdsa (nistk163) 0.0059s 0.0286s 168.1 35.0 233 bit ecdsa (nistk233) 0.0126s 0.0535s 79.5 18.7 283 bit ecdsa (nistk283) 0.0193s 0.1002s 51.9 10.0 409 bit ecdsa (nistk409) 0.0488s 0.2312s 20.5 4.3 571 bit ecdsa (nistk571) 0.1179s 0.5458s 8.5 1.8 163 bit ecdsa (nistb163) 0.0059s 0.0309s 170.3 32.4 233 bit ecdsa (nistb233) 0.0126s 0.0591s 79.5 16.9 283 bit ecdsa (nistb283) 0.0193s 0.1144s 51.8 8.7 409 bit ecdsa (nistb409) 0.0488s 0.2642s 20.5 3.8 571 bit ecdsa (nistb571) 0.1170s 0.6300s 8.5 1.6 op op/s 160 bit ecdh (secp160r1) 0.0099s 100.6 192 bit ecdh (nistp192) 0.0111s 89.7 224 bit ecdh (nistp224) 0.0143s 69.8 256 bit ecdh (nistp256) 0.0154s 64.9 384 bit ecdh (nistp384) 0.0366s 27.4 521 bit ecdh (nistp521) 0.0944s 10.6 163 bit ecdh (nistk163) 0.0141s 71.0 233 bit ecdh (nistk233) 0.0266s 37.6 283 bit ecdh (nistk283) 0.0506s 19.8 409 bit ecdh (nistk409) 0.1152s 8.7 571 bit ecdh (nistk571) 0.2727s 3.7 163 bit ecdh (nistb163) 0.0152s 65.7 233 bit ecdh (nistb233) 0.0293s 34.1 283 bit ecdh (nistb283) 0.0560s 17.9 409 bit ecdh (nistb409) 0.1312s 7.6 571 bit ecdh (nistb571) 0.3156s 3.2 Hard Float Code: OpenSSL 1.0.1g 7 Apr 2014 built on: Thu Apr 24 20:11:07 MDT 2014 options:bn(64,32) rc4(ptr,char) des(idx,cisc,16,long) aes(partial) idea(int) blo wfish(ptr) compiler: mipsel-linux-gcc -fPIC -DOPENSSL_PIC -DZLIB_SHARED -DZLIB -DOPENSSL_TH READS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DL_ENDIAN -DTERMIO -O3 -mtune=mip s32 -mips32 -fomit-frame-pointer -Wall -DSHA1_ASM -DSHA256_ASM -DAES_ASM The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes md2 0.00 0.00 0.00 0.00 0.00 mdc2 567.18k 685.34k 723.41k 730.89k 736.73k md4 2105.09k 7293.92k 21040.62k 40181.96k 54202.83k md5 1440.36k 4900.01k 13663.28k 25054.78k 32781.65k hmac(md5) 1831.77k 6107.59k 15903.06k 26491.76k 33159.91k sha1 1362.45k 4204.97k 10085.97k 15672.01k 18701.40k rmd160 1206.47k 3554.06k 8117.54k 11670.54k 13497.69k rc4 24932.54k 27186.52k 27941.13k 28570.62k 28669.28k des cbc 3964.01k 4080.02k 4105.10k 4124.24k 4123.13k des ede3 1457.54k 1482.33k 1493.73k 1470.68k 1477.83k idea cbc 6236.94k 6707.54k 6800.11k 6882.30k 6862.17k seed cbc 5046.29k 5370.96k 5475.08k 5482.47k 5505.02k rc2 cbc 3627.96k 3715.79k 3741.70k 3778.83k 3739.47k rc5-32/12 cbc 0.00 0.00 0.00 0.00 0.00 blowfish cbc 8874.75k 9699.24k 9938.09k 10021.93k 10007.89k cast cbc 7252.67k 7754.65k 7890.75k 8044.24k 8111.19k aes-128 cbc 8084.09k 8361.51k 9017.17k 9050.66k 9068.54k aes-192 cbc 6883.57k 7426.16k 7757.57k 7777.64k 7737.49k aes-256 cbc 6125.40k 6398.79k 6865.83k 6607.89k 6599.87k camellia-128 cbc 6282.34k 6875.01k 6986.77k 7037.46k 7154.90 k camellia-192 cbc 4995.31k 5325.25k 5431.95k 5472.79k 5494.10 k camellia-256 cbc 4987.13k 5355.19k 5496.18k 5456.70k 5455.87 k sha256 1431.99k 3421.62k 6333.32k 7907.67k 8579.75k sha512 437.80k 1770.33k 2957.14k 4293.29k 4936.97k whirlpool 291.01k 589.14k 950.52k 1124.35k 1187.84k aes-128 ige 7780.79k 8336.23k 9321.95k 9046.50k 8869.21k aes-192 ige 6848.40k 7368.26k 7679.49k 7415.47k 7714.13k aes-256 ige 5854.35k 6358.45k 6442.27k 6712.35k 6442.01k ghash 9617.40k 10570.10k 10827.86k 10913.06k 11023.70k sign verify sign/s verify/s rsa 512 bits 0.004276s 0.000384s 233.9 2603.5 rsa 1024 bits 0.020970s 0.001074s 47.7 931.4 rsa 2048 bits 0.126154s 0.003447s 7.9 290.1 rsa 4096 bits 0.815385s 0.011892s 1.2 84.1 sign verify sign/s verify/s dsa 512 bits 0.003905s 0.004307s 256.1 232.2 dsa 1024 bits 0.010628s 0.012586s 94.1 79.5 dsa 2048 bits 0.033707s 0.040121s 29.7 24.9 sign verify sign/s verify/s 160 bit ecdsa (secp160r1) 0.0029s 0.0120s 345.5 83.1 192 bit ecdsa (nistp192) 0.0031s 0.0130s 323.6 77.2 224 bit ecdsa (nistp224) 0.0038s 0.0169s 261.6 59.3 256 bit ecdsa (nistp256) 0.0042s 0.0186s 238.4 53.7 384 bit ecdsa (nistp384) 0.0086s 0.0431s 115.7 23.2 521 bit ecdsa (nistp521) 0.0214s 0.1120s 46.7 8.9 163 bit ecdsa (nistk163) 0.0059s 0.0286s 169.7 35.0 233 bit ecdsa (nistk233) 0.0127s 0.0541s 78.9 18.5 283 bit ecdsa (nistk283) 0.0192s 0.1005s 52.1 10.0 409 bit ecdsa (nistk409) 0.0484s 0.2326s 20.6 4.3 571 bit ecdsa (nistk571) 0.1176s 0.5447s 8.5 1.8 163 bit ecdsa (nistb163) 0.0059s 0.0310s 168.7 32.3 233 bit ecdsa (nistb233) 0.0125s 0.0588s 79.7 17.0 283 bit ecdsa (nistb283) 0.0194s 0.1139s 51.6 8.8 409 bit ecdsa (nistb409) 0.0485s 0.2650s 20.6 3.8 571 bit ecdsa (nistb571) 0.1176s 0.6256s 8.5 1.6 op op/s 160 bit ecdh (secp160r1) 0.0103s 97.0 192 bit ecdh (nistp192) 0.0110s 91.0 224 bit ecdh (nistp224) 0.0145s 68.9 256 bit ecdh (nistp256) 0.0158s 63.3 384 bit ecdh (nistp384) 0.0357s 28.0 521 bit ecdh (nistp521) 0.0934s 10.7 163 bit ecdh (nistk163) 0.0138s 72.3 233 bit ecdh (nistk233) 0.0264s 37.8 283 bit ecdh (nistk283) 0.0503s 19.9 409 bit ecdh (nistk409) 0.1146s 8.7 571 bit ecdh (nistk571) 0.2724s 3.7 163 bit ecdh (nistb163) 0.0153s 65.5 233 bit ecdh (nistb233) 0.0291s 34.4 283 bit ecdh (nistb283) 0.0564s 17.7 409 bit ecdh (nistb409) 0.1312s 7.6 571 bit ecdh (nistb571) 0.3125s 3.2

Interesting results on some of these, such as the AES cryptos. I would have to see how it looks like on my end, since I use more advanced ASM optimizations in my openssl source tree than Tomato (I backported a portion of the 1.0.2 ASM code for aes).

Is there a command that will indicate whether or not FPU emulation was included in he Firmware build? Constructive suggestions are appreciated: thank you. I tried: nvram show | grep CONFIG_MIPS_FPU_EMU however this was not helpful.

All builds will include FPU emulation. Tomato has no control over what Broadcom does in their closed source drivers, so therefore Tomato must support FPU emulation in the firmware. Whether or not the other userspace code uses FPU emulation or not is the question.

FPU emulation is usually invisible for the code if it is properly implemented. Usually it "catches" the use of the instruction set if the instructions are not implemented, and jumps to some alternative code. The reason is so you can use the same software both on CPUs with and without the extended instructions. There is no "magic" in creating floating point arithmetic using integer operations, but different CPU's will handle them differently and hence also the gain will depend on the model in your router (or computer). GCC surely must have builtin version of ways to handle floating-point arithmetic if told to not use certain instruction sets in the CPU it is compiling for. Compiler optimizations are usually able to infer ways to make shortcuts in calculations, and getting better for each version. What they can't tell is if you will run it on a proper FPU or on an emulated FPU, so when compiling it is always better to tell the compiler if you know this beforehand for your specific scenario.