|
NetKernel News Volume 3 Issue 28
June 22nd 2012
What's new this week?
Catch up on last week's news here
Repository Updates
The following update is available in the NKEE and NKSE repositories...
- kernel-1.25.1
- Fixed a potential race condition with very heavily loaded asynchronous requests to a non-thread safe endpoint (see below)
Kernel Update
In testing a new technology (which we'll be revealing soon) - this week, we discovered an extremely marginal potential kernel concurrency race condition.
The situation could arise with a very heavy load to a non-thread safe endpoint but only if the endpoint was called synchronously from a parent endpoint which itself had been called asynchronously. So, not only was it a very rare condition, it is also a pretty unusual pattern.
However this particular pattern was present in our testing configuration and once in every million or so requests we'd see an NPE thrown on a response to a request. To compound the rarity factor - it did not appear at all when we ran the tests on a regular developer machine, but only on a test machine with a large number of cores.
I won't bore you with the details, but finding a bug like this is non-trivial. It took the best part of 3-days of investigation to find the exact condition and trace through the parallel executions to see how one sub-request's thread could manage to trip up the initial asynchronous requesting thread.
Ordinarily the quick fix to solve such asynchronous contention is to use synchronization, either of methods or objects. However we're pretty obsessive about performance and the kernel is designed to have zero sychronization in the core scheduler and request state machine. There were any number of workarounds that would eliminate the potential problem - indeed we fixed the bug within an hour but we were not satisfied by this and weren't prepared to compromise on the ideal.
Ultimately, after capturing and inspecting the state-transitions of the two threads when the collision happened (which was a bit like going fishing with no bait on the line) - we eventually identified that the release from one-state change needed a flag to sub-ordinate the initial request thread while the sub-request dealt with and entered the locked endpoint when it was freed. A one-liner solution.
When you've been working up in the ROC-domain with logical requests not physical threads for as long as we have you take things for granted. You forget that asynchronous code is hard. Really hard. I can't even imagine trying to get the most out of multi-core servers by writing to the metal. And its not just about the language features - the problems don't go away in functional languages.
ROC: step away from the metal, step away from the APIs... run faster, more safely
XUnit Tip
One thing that's always satisfying is when a change that you make to round-out features, such as last-week's update to the declarative request to support request headers in the same way as NKF, yields an unforeseen dividend...
It turned out that we had to create an extremely heavy massively parallel and long running test process to be able to trap the kernel bug. But such a huge process came at a cost.
The unit-test engine starts an initial request that fanned out the sub-requests. The problem was that by default NK tracks dependencies and holds onto this state. State means memory - so we got into an unfortunate situation, when testing on a 256MB JVM that we'd run out of heap by holding this useless (in the case of the testing) dependency state.
The answer of course is to tell NK to forget dependencies, which in XUnit you can do, thanks to the update to declarative requests, just by setting the forget-dependencies header like this...
<request>
<identifier>active:asyncBusyThrash</identifier>
<header name="forget-dependencies">true</header>
</request>
</test>
So if you have tests that need to run in a tight memory footprint and where the dependency model is irrelevant then set this header.
Push Twitter Notification Solution
After discovering the joys of push-email - Mr Butterfield took it upon himself to create a push-twitter notification service. In an event almost as rare as the kernel bug, Tony has blogged the details...
http://durablescope.blogspot.co.uk/2012/06/scratching-itch.html
Meanwhile, knowing that every Friday I play the last-minute game of russian roulette and have no idea what I'm going to put in the newsletter, but also knowing that Tony had provided me with a free ride this week by delivering some content, in another blatant attempt to steal my thunder, Mr Geudens also provides his commentary on Tony's blog here...
http://practical-netkernel.blogspot.be/2012/06/great-minds-do-not-think-alike.html
...and so, with this cunning bit of contextual scheduling I am relieved of writing anything more this week and we narrowly avoided a state contention overlap.
Have a great weekend.
Comments
Please feel free to comment on the NetKernel Forum
Follow on Twitter:
@pjr1060 for day-to-day NK/ROC updates
@netkernel for announcements
@tab1060 for the hard-core stuff
To subscribe for news and alerts
Join the NetKernel Portal to get news, announcements and extra features.