1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

OpenSSL assembler acceleration

Discussion in 'Tomato Firmware' started by ryzhov_al, Oct 21, 2012.

  1. ryzhov_al

    ryzhov_al Networkin' Nut Member

    Toastman, shibby20, there was a very interesting patch in OpenWRT's trunk last month. My tests, for example, gives up to 90% speed up for SHA-1 algorithm, which used while torrents hash checking.

    In Entware, a torrents download constant speed now ~4Mbytes/sec, hash checking speed ~8Mbytes/sec. Not only torrents clients will get an advantage, but all OpenSSL dependent software too: OpenVPN etc.

    Andy Padavan's RT-N56U firmware and Eric Sauvageau's RT-N66U firmware now includes this patch because of my annoying.:))

    Would you like to try it?
     
  2. koitsu

    koitsu Network Guru Member

    Reviewed the patch/diff (for the RT-N66U firmware link you provided) -- this looks excellent, truly.

    For firmware authors here who wonder what all does: instead of using a pure 100% C-based SHA-1 algorithm, Andy Polyakov has written a replacement for the SHA-1 algorithm API functions that are pure MIPS32 and MIPS64 assembly. Perl is used as a "shim" to generate the necessary blocks of assembly code. The speedup on MIPS32 is roughly 30-75%, and over 300% on MIPS64. (Most routers I've used so far are MIPS32)

    This is some very, very nice work. As a fellow assembly programmer (though I don't do MIPS), this really makes me happy to see. I always love seeing people still using assembly to make the most out of a system, especially embedded.

    I have a couple questions for you though, Ryzhov:

    1. What OpenSSL version should this be applied to? I imagine this matters greatly, since OpenSSL is notorious for changing API semantics and function arguments (as well as types) even between minor versions.

    2. How much testing has been done with these patches on MIPS32 systems? For example, are these patches integrated into stock/base OpenSSL? If not, why not? Is there a mailing list thread on the OpenSSL lists about getting these integrated? (I imagine the NetBSD folks would be very interested too) I ask because it seems something like this should really be done upstream (within OpenSSL natively) so everyone can benefit. Just curious!

    As always, thank you for your work, and keeping on top of things!
     
  3. ryzhov_al

    ryzhov_al Networkin' Nut Member

    For 1.0.1c, also Sauvageau backported it for 1.0.0.

    Don't know about OpenSSL devs tests, but it's in OpenWRT's trunk for a month. No any negative consequences at all. At least for now.

    Yes, i think sooner or later it will be a part of official code, see OpenSSL maillist for details.
     
  4. shibby20

    shibby20 Network Guru Member

    thank you. I will apply this patch soon.
     
  5. mstombs

    mstombs Network Guru Member

    Great stuff - didn't know the C-compiler optimization was that bad, to allow room for improvement! I understand AES in Tomato already in asm? I'm not too interested in transmission but does this benefit all https traffic and open-vpn? What about PPPoE? I assume Broadcom already do this sort of thing in the wireless drivers? But they do have files such as aes.c and sha1.c in the bcmcrypto folder.

    Have also seen asm mentioned re the RaspberryPi (Broadcom SOC, but arm not mips), they have put optimized versions of memcpy and memset into the C-libraries (not uclibc).
     
  6. lefty

    lefty Networkin' Nut Member

    As stated in the first post:

    "Not only torrents clients will get an advantage, but all OpenSSL dependent software too: OpenVPN etc."

    Also as i said before about the broadcom wireless driver, its a binary with no source, so no one but broadcom could tell you if it already does this sort of thing in the wireless driver...
     
  7. kthaddock

    kthaddock Network Guru Member

    @Lefty
    Are you sure about new Broadcom ver 6.0 wifi drivers?, I'm not !
     
  8. lefty

    lefty Networkin' Nut Member

    Which are you not sure about? Broadcom wireless drivers being closed source? Please feel free to point me to a source link. Otherwise no one can tell you whats in it but broadcom..
     
  9. PBandJ

    PBandJ Networkin' Nut Member

    I suggest you read more about timing attacks (if you aren't familiar with the topic). In crypto, faster doesn't always mean better. It might make a security system completely broken.
    I'm adding a quote here from wikipedia to wet your appetite. The emphasis are mine:
    I think it will be wise to sit this one out til some crypto-experts, such as the OpenSSL devs, vet it out and upstream it.
     
  10. kthaddock

    kthaddock Network Guru Member

     
  11. lefty

    lefty Networkin' Nut Member

    @kthaddock - stop pointing me to CFE sources and point me to the wireless driver sources, no offense but stick to things you know....not things you think you know.. the broadcom wireless driver is closed source, period, and only broadcom can change that. The CFE sources are available, has nothing to do with the wireless drivers.. so as i said, please feel free to point me to the broadcom wireless drivers source, not the CFE sources..
     
  12. mstombs

    mstombs Network Guru Member

    Sorry, been there before - Broadcom wireless drivers are binary objects that get compiled into kernel modules and userspace config tools, they are not standalone entities. They and other modules for switch drivers, fast nat etc are a restriction to any 3rd party firmware since they effectively lock the reported kernel version. We assume that manufacturers such as Asus get access to more non-open source materials in the Broadcom SDK, clearly including the CFE, but it might still only be access to developers who can make changes for them?
    I think it would be a clear abuse of GPL if they linked in openSSL routines at router firmware compile time, but I'm not sure why they provide source to brcmcrypto at all! I only mentioned wireless since its clear that wireless encryption is a big requirement, I don't think there is a big difference between hardware encyption and heavily optimized hardware specific code these days.

    I am interested in possible improvements in overall performance in real-life testing and can't really see any downsides, such asm code is heavily exercised, and has clear spec in reproducing interfaces of the C-code it replaces, you would find out quickly if it didn't work!
     
  13. PBandJ

    PBandJ Networkin' Nut Member

    Like I wrote above more faster isn't necessarily better when it comes to cryptographic code. It may result in security being completely broken.
     
  14. koitsu

    koitsu Network Guru Member

    I want a reference in that Wikipedia article, specifically the "login leaked information" part. In fact I'm going to edit it to request citation references. (Edit: Just throwing this in there: I went to high school and personally knew Paul Kocher. Yep, namedroppin'!).

    I believe (since I'm an old UNIX bastard myself, started 1990 or so) what the page is describing about crypt(3) is that the implementation was intentionally slow to keep people from brute-forcing passwords. I remember reading something about that long ago, but it's been ages. It was determined classic DES crypt was insecure anyway, which is why we've moved to MD5 -- and more recently, SHA512 (at least on FreeBSD).

    So I think what you're quoting may be extremely out of context and you're a bit out on a limb with the claim. However, that said, I completely agree that getting an official sign-off from the official OpenSSL folks would definitely be the way to go. The OpenBSD folks should be able to tackle this one for sure, as I imagine they do have someone on their team who does/speaks MIPS.
     
  15. RMerlin

    RMerlin Network Guru Member

    You might want a system call that validates a login to intentionally slow things down. But when it comes to the code that actually does the cryptographic heavy lifting and calculations, you want as much performance as you can. That code will make a big difference in raw throughput for things that have to hash/encrypt/decrypt actual streams (like the OpenVPN example given above).

    Otherwise, why not code IPSEC crypto code in BASIC or Pascal, and insert a few sleep() calls between each blocks...
     
    mstombs and koitsu like this.
  16. RMerlin

    RMerlin Network Guru Member

    Check at the end of the Github commit ryzhov_al posted for my own backport, I listed the OpenSSL commits on which I based my backport, in case you want to do your own backport (I'm not sure which OpenSSL version you are using in Tomato).
     
  17. PBandJ

    PBandJ Networkin' Nut Member

    I have to admit I didn't make a very good argument for my claim, and that Wikipedia article, especially the bit about brute-forcing login kinda sunk my argument. However, if you'll lend me your ear for a second I'll try to elaborate. Timing attacks are a type of active attack on cryptographic systems (i.e. an attacker interacts with the system he's trying to attack, as opposed to eavesdropping) that use what is called side-channel info. In this type of attack, the attacker is trying to exploit some weakness in the implementation of the cryptographic system that manifests itself as some actions taking longer (or shorter) amount of time to complete.
    For example: A crypto system will reject ill-formed (i.e. fake) messages/packets. But the process of examining a packet is a complex, multi-stage process. Now if we implement a crypto system to return an error message immediately when we run into the first sign it is fake, an attacker can measure the difference in how long it took the system to reject a message. That will leak information regarding which stage of the test failed. He can now brute-force this check till he clears it, and move on to identify the next one. This can lead to a crypto system being completely broken.
    Here's one example, more recent than the famous DES system: http://codahale.com/a-lesson-in-timing-attacks/
    Here's another example, for a tool exploiting vulnerability to timing-attack to obtain valid usernames: http://pentestmonkey.net/tools/timing-attack-checker.

    So the solution of these types of problems is to write the system in such a way to go through all stages of validations even if a message/packet fails the first one. You intentionally make your system run slower (=do redundant work) in order to make it safer. It's not the same as a brute-forcing login (i.e. making it take practically forever to scan through a sizable number of credentials). That's the point I was trying to get to (and failed) in my previous post.
    Additionally, you need to write the code it in such a way that the compiler/optimizer will not optimize away the redundant work.
    The take-home point here is: When it comes to cryptography, faster isn't necessarily better. A faster implementation might render the system vulnerable.

    This leads me to the conclusion that it would be wise to sit this one out till the experts decide if this optimization effort doesn't leak side-info.
    I hope I made my point clear(er) this time around.
     
  18. Claus Andersen

    Claus Andersen Serious Server Member

    You are crying wolf without enough information. Code optimizations which makes existing code run more effeciently is always welcome. Optimizations are not shortcuts. Timing attack is not relevant on the code level but at the protocol level. If a protocol requires a specific timing it should be respected both by the optimized and non-optimized code. The flaws you bring to the table would be present in both versions. If we where that scared of timing then we should only use one compiler and in a specific version to ensure that it always generates the same code!

    What you are talking about is shortcuts. If optimizations are done by taking shortcuts you have a much more difficult decision. A notable example of attempting optimization shortcuts is Google's SSL false start. Shortcuts might work sometimes - but that is not the question in this case.

    In this case it is clearly stated that the optimization has been done by hand coding critical paths. It is a much more trivial task to confirm that the assembler works in the same manner as the previous C code. Those are "simple" optimizations of critical code paths. In this case it is clearly stated that the optimization is converting the SHA-1 algorithm from C to assembler.

    As we cannot all be experts it all comes down to trust. I think it is rather clear cut in this case: If a sufficient number of qualified people like koitsu reviews the code and says it looks solid it will be enough for me. Other like you might choose to to wait until OpenSSL comes down the mountain.

    I choose to speak up as I feel that you cry wolf without enough of a reason. I fully support your choice to wait until OpenSSL includes it in the mainline (if ever) - however leaving your statements unanswered might scare others. Based on what actually is FUD I think it is harsh to recommend to sit it out. Tomato is a mod community and is the perfect place to test out such features. Real life testing might actually make it easier to get included into mainline.

    OpenSSL is the king and they have done a great job. I do not claim to be much of an expert but I do however claim that your arguments are red herrings. For others who read this and might be in doubt have a look at this to get a more balanced view on the imperfections of OpenSSL.

    Your basic argument is that you do not trust koitsu enough to make a judgement. Neither do I. But if a couple of other verifiable and knowledgeable sources step up I am fine. You choose to wait for OpenSSL - that's fine. But please do not wrap it in pseudo arguments of timing attacks. For the uninitiated it might give credibility to your argument but not to me. This is a trust issue - nothing more. And a valid point - most def.

    Yes - this might be dangerous. So is crossing the road. But so far it actually looks pretty good. A speedup of SHA-1 of 30-75% a huge accomplishment - kudos!

    Kind Regards,
    Claus
     
  19. mstombs

    mstombs Network Guru Member

    Sorry, I do not agree with your argument in this specific case, but the choice is yours! Maybe the c-compiler builds in more checks and better handles error streams? Is there a risk that carefully crafted messages could lead to responses that reveal hidden keys that would allow the easier decoding of communications? Or that classic Microsoft buffer over-run exploit that could allow an attacker take over the machine?

    Note there's a huge amount of code in Tomato and the mods that is not 'reviewed upstream', heck some people even risk using potentially password stealing trojan infected Chinese rip-off binary only firmwares!

    There are documented attempts to get back-doors into the Linux kernel via changes and we haven't seen a lot of back-ported patches to the relatively old Tomato kernel for quite a while. (You are not paranoid, there really is someone out-there after you!)

    Note you will only see a benefit of this patch is your router CPU is the bottleneck, so maybe transmission users are the biggest beneficiaries, but if it works as claimed I can see all mods switching to it
     
  20. PBandJ

    PBandJ Networkin' Nut Member

    @mstombs: Maybe, from a home user perspective this is all good and well but there are other Tomato users out there that might view things differently. Let's, for the sake of argument, say that Toastman agrees with you and incorporates this patch into his build, which is also the basis of EasyTomato - a tomato fork built and advocated by a non-profit whose goals are to make cost-effective solutions for disaster areas (think Haiti earthquake, for example). They don't care much for Transmission performance. I imagine they would care about OpenVPN's security, as well as many others that use this wonderful firmware in their businesses.

    Quickly adopting this patch (and potentially throwing caution to the wind) only makes the prize of finding a vulnerability greater.
     
  21. PBandJ

    PBandJ Networkin' Nut Member

    You couldn't be more wrong. Read the first example I gave, http://codahale.com/a-lesson-in-timing-attacks/, and you'll see for yourself that timing attacks' source is based on bad impelementation. In that example, all it take is using a trivial (and standard) implementation of byte-string comparison to potentially break a security system.

    I'm not arguing that the the optimized code produces the wrong results, just that because of the optimization it may become vulnerable to timing attack.
    If the speedup has such variance, it might leak side info, opening it up for a timing attack, right?

    koitsu is a very capable guy. No argument here. But is he an expert on the subject matter? Are you, koitsu? And what about the person(s) responsible for the optimization? Are they just extremely familiar with the MIPS architecture, or are they also experts on the subject matter?

    That will make it easier to conclude that it is correct, not that it is safe.
     
  22. PBandJ

    PBandJ Networkin' Nut Member

    I've looked a bit at the optimization. It looks like some kernel was replaced with hand-optimized assembly. Something to do with faster reads from memory when the data isn't aligned (I wonder if that can't be achieved by using pragams).
    The reason for the different performance gains is, probably, due to use of different block sizes (relative size of prolog/epilog parts compared to kernel?)

    It seems good, you have my blessing (fwiw).
     
  23. koitsu

    koitsu Network Guru Member

    Regarding my capabilities: I don't do MIPS. The few platforms I still do assembly on are 65xxx (yep that "ancient processor"), PIC16, and 80386/80486 (as in "ancient x86" -- I have no familiarity with things like MMX, SSE, etc.). But all that's neither here nor there -- bottom line is that I don't do MIPS, so I can't review the code. Just as important, I have very little experience with the OpenSSL code (aside from some of the API functions and my experience there is extremely limited -- I just remember hating how often they'd break the API between minor versions, grr), and I have little-to-no experience with cryptography or cryptographic systems (specifically how to design them or the nuances surrounding them). I *have* written code that uses OpenSSL (I was one of the few people who worked on the first-generation SSL code in ircd-hybrid), but that's not the same thing as understanding cryptography.

    The bottom line is that the OpenSSL folks should review the code -- and ryzhov_al already said in this post (see last line) that they are.

    Folks who are concerned about the implications of using native assembly code instead of C code for some of the OpenSSL functions (externally exposed as API functions or internal functions, doesn't matter) should probably voice their concerns on the openssl-dev mailing list.

    But as I said in this post (see last paragraph), I'm of the opinion we should just wait for the OpenSSL folks to sign off on this. It may takes months, or even a year.
     
  24. shibby20

    shibby20 Network Guru Member

    openssl updated to 1.0.1c

    WL500gp v1 before:
    after (with patch):
    will be included in my v105 release. Best Regards.
     
    kthaddock likes this.
  25. ryzhov_al

    ryzhov_al Networkin' Nut Member

  26. joew1

    joew1 Serious Server Member

    shibby20, do you know what happened with the forum in the past several days? It worked for me until the morning of the 14th (yesterday) then shortly after I found out that the my ID completely disappeared alongside with the post I started. There must have been a crash or something. Please let me know and if you know what the recovery plan is, I would appreciate that as well.

    joew1 (aka SteveF, my old and disappeared ID)
     
  27. kthaddock

    kthaddock Network Guru Member

    http://www.linksysinfo.org/index.php?threads/to-message-bords-mods.64783/#post-217920
     
  28. M_ars

    M_ars LI Guru Member

    shibby, what command gives that speed outputs of openssl?
     
  29. koitsu

    koitsu Network Guru Member

  30. shibby20

    shibby20 Network Guru Member

    openssl speed aes
    openssl speed sha1

    But speed command is not compiled in current tomato. I ill enable this feature in next release
     
  31. rafwes

    rafwes Serious Server Member

    shibby, any chances of seeing a tinc binary bundled in the next releases that could also profit from these improvments?
     
  32. shibby20

    shibby20 Network Guru Member

    @rafwes - another VPN? For what? In tomato you have openVPN and PPTP and it`s ready to make ipsec vpn tunnel. In my optware package repo have also new n2n vpn tunnel daemon.

    Another test on Netgear WNR3500L v2
    :)
     
  33. ryzhov_al

    ryzhov_al Networkin' Nut Member

    BTW, tinc with OpenSSL asm optimized is in Entware repo.
     
  34. rafwes

    rafwes Serious Server Member

    @shibby: tinc is not just another point-to-point vpn. It creates a vpn mesh between clients without the need to set up tunnels/routes for each and every connection. It is a not only a huge improvement for secure peer communications, but also offers great nat transposing mechanisms, allowing connections between clients behind nat, what no other vpn can do.

    @ryzhov_al: Since most of my devices lack USB ports, this is not an option actually.
     
  35. shibby20

    shibby20 Network Guru Member

    and finally RT-N66U.

    new v105 soon.
     
  36. apnar

    apnar Network Guru Member

    Another big tinc fan here. Looking forward to the speed improvements with OpenSSL patch here.

    @shibby: In my custom version of openssl I'd added both speed & version. I don't know if 'version' has been added since but if not it's a very small but useful add if you're going to go in and add 'speed' anyway. Also, I think I had to remove the "no-engine no-engines" from the openssl make parameters to allow tinc to use it dynamically.

    I'd love to see tinc added, even if there isn't a GUI for it anytime soon. It's a great VPN technology and fairly different from anything included now. Not to mention it'd save me from having to keep my own modified source ;)

    Edit: I'll note that tinc is very small, only 140k when compiled dynamically:

    Code:
    root@router_home:/tmp/home/root# du -h /usr/sbin/tincd
    140.0K    /usr/sbin/tincd
    
    Only dependancies are on openssl, lzo, and zlib.
     
  37. shibby20

    shibby20 Network Guru Member

    ok i will look. Now you can download v105 :D
     

Share This Page