WiNK edit

wiki /NetKernel /News /3 /7 /December_9th_2011

NetKernel News Volume 3 Issue 7

December 9th 2011

What's new this week?

Repository Updates
Tom's Book - Tom's Blog
Lessons From the Cloud: Check Thread Limits

Catch up on last week's news here

Repository Updates

The following NetKernel 5.1.1 updates are available in the NKSE and NKEE repositories...

layer0 1.77.1
- added extra methods to NKFException to allow reading of additional fields
- fix to HDSXPath to make // search from current node not root

Tom's Book - Tom's Blog

Here's the latest news from Tom. More progress on the book. Plus Tom is now writing a weekly "practical tip" on his blog - this week he offers tips on how a colonic irrigation will impress your boss...

Good progress on the book. Appendixes C and D are getting some content. The latest build is available from:

http://www.netkernelbook.org/serving/pdf/practical_netkernel_nk5.pdf.

As always, your feedback and input is highly appreciated. You can send it to tom(dot)geudens(at)hush(dot)ai.

You can find this weeks blog entry at:

http://practical-netkernel.blogspot.com/2011/12/eighty-eighty.html

The focus is on the why of colon eighty eighty (and how to change it).

Lessons From the Cloud: Check Thread Limits

Earlier this week we received a request for assistance from Sven Wallage of Edge Technologies BV.

Edge provide hard-core, highly-scaled, ultra-reliable telecoms solutions. Their solutions underpin a large majority of the Dutch telecoms market with high-profile customers like KPN and Tele2.

Edge has built their products on NetKernel for over six years, so, we take it very seriously when we get a request for help...

The Service

First some background. One of Edge's products is a service to manage, monitor and control DSL modems for point-of-sale terminals. These DSL modems are mission critical components in the real-world bricks and mortar commerce of thousands of retail stores. So no pressure then...

One of the interesting properties of a large distributed system like this is that it can potentially start to demonstrate flocking behaviour. If, for any reason, the cloud-based phone-home service slows or goes away, the modems back-off, but then retry. If the service goes down for maintenance or update the modems can come home to roost en-mass leading to a post-boot transient load profile.

A month or so ago, we offered some tuning advice on how to configure their transports, throttles and threads to provide a profile shaping transformation function; smoothing bursts into progressive and steady request rates to the ROC domain.

One of the powerful features of ROC is that non-functional operational characteristics are also directly controllable - you can model and implement load shaping anywhere within an ROC solution. ROC can do this since, within the ROC domain, requests are not tied to physical threads - they are "logical" and so can be queued, managed and buffered to impedance match the load shape to the capabilities of the underlying physical capacity.

One further item of note that will become relevant, is the "cloud factor" - the "phone-home" service is deployed on AWS cloud servers. More on which later...

...So, back to the story, the balanced front-facing architecture went into production and all seemed well..

Until earlier this week. The production machines would start to fail in "weird" ways that looked like physical resource starvation. Which is when we got the shout...

When is a disk full but 50% empty?

Working with Edge we ssh'd in to one of the failing nodes. After some diagnostics and sleuthing - it became apparent that NetKernel was getting a Java exception whenever it tried to create a file.

But looking at the Ubuntu virtual server's disks it was reporting 49% free on a 15G partition - plenty of space?

When you don't know what's happening its a good idea to isolate the variables. This "disk" was an AWS block storage device - we had absolutely no idea how it works or how it was implemented. But it was very worrying indeed that it reported plenty of space but nothing could be written.

Sven decided to do a clean build, he took an off-the-peg Redhat distro (blessed by AWS), installed the NK solution and fired it up. He also created a new clean block partition with 30G of capacity. The system went into production and all looked fine.

One day later Sven was back in touch. The new disk was doing the same thing!

More sleuthing and Sven discovered the cause: While the disk had lots of capacity. The way that the Edge solution works is to write small files for each modem, fairly frequently. Reporting, logs, change and update sets to be pushed out etc. The reason nothing could be written to the disk was that it had run out of inodes. Lots of space, but too many files. Sven sorted this out with some remedial clean-up and some tuning tweaks.

We thought we'd nailed it this time, the updates went into production, all was looking good...

When is an Out of Memory Exception not an Out of Memory Exception?

Yesterday Sven was back again, "File writing is good but we're now seeing this exception..."

WARNING: EXCEPTION
java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:614)

This was bizarre. The same system had been running with the same memory capacity and had shown no memory related issues.

Even stranger, Sven had taken a precaution to boost the base size of the AWS instances to give them 8GB memory. He'd then also given the NK instances a 4GB Java heap. So now each system actually had more headroom than before. How on earth could the memory be running out?

By this stage, we were beginning to feel like the Dutch boy with his finger in the dyke.

We dug around all over NetKernel. Its memory charts and the detailed system report. We took a live heap dump using our handy heap dump script. We analysed the heap dump with the Eclipse retrospective heap analyser. There were no memory leaks. Everything looked fine. In fact NetKernel was using just 125MB of the 4GB of heap it had available.

So now we started to think laterally. The memory exception was always appearing on Thread construction.

After some discussion we understood more about the architecture of the service - Edge have a custom transport that allows the modems to call in and establish a push-based protocol to the modem. This caused quite a large number of threads to be spun up. But the threads are mostly idle.

With this insight we were able to determine that there was no reason that this was particularly related to NetKernel. It was starting to look like Java or the Operating System.

To test this hypothesis we used this simple Java thread tester...

public class ThreadTest
{
    private static int COUNT = 0;
    
    public static void main(String[] aArguments) throws Exception
    {
        try
        {   // keep spawning new threads forever
            while (true)
            {   new TestThread().start();
            }
        }
        catch (OutOfMemoryError e)
        {   System.out.println("Created "+COUNT+" threads before memory/OS limit reached");
            System.exit(-1);
        }
    }

    static class TestThread extends Thread
    {
        public TestThread()
        {   COUNT++;
        }

        public void run()
        {
            try
            {   //Do nothing just sleep
                sleep(1000000000);
            }
            catch (Exception e)
            {   /*Do nothing*/
            }
        }
    }
}

You run this in a standalone JVM with the command "java ThreadTest" and it spins up threads until it runs out memory.

On my and Svens development box we both comfortably managed 32000+ threads. But running it on the production cloud server it reported just over 1000 threads. That's no way to run a server! We had established that it was definitely the underlying platform.

We spent a lot of fruitless time trying to understand how much native memory each thread uses for its stack and trying to tune it with the -Xss switch. Nothing we did made any difference to the ThreadTest.

After much digging around we finally discovered the salient fact. Redhat sets a default maximum number of user processes of 1024. A java thread on Linux is a user process. Java was being stopped by the OS from creating a new thread.

Yes, did you get that, Java reports the operating system refusal to create a thread as an OutOfMemory Exception!!!... Speechless...

Removing Rehat's Process/Thread Limit

After this it was relatively plain sailing. If you run Redhat (note we never saw this before because Ubuntu server doesn't set a limit on user threads - its a server, of course it needs threads!), you need to edit /etc/system/limits.conf and set the nproc parameter to be a large number. You may also need to edit /etc/security/limits.d/90-nproc.conf and comment out the line related to nproc.

Moral

The moral of this story? The cloud is neat. You can take an off-the-shelf operating system and spin it up in seconds. But just like an off-the-shelf suit - you may end up regretting that the jacket sleeves are too short.

Take care. Question everything. Isolate the variables. Test the OS/Java stack combo. Don't take Java exceptions at face value.

Have a great weekend,

Comments

Please feel free to comment on the NetKernel Forum

Follow on Twitter:

@pjr1060 for day-to-day NK/ROC updates
@netkernel for announcements
@tab1060 for the hard-core stuff

To subscribe for news and alerts

Join the NetKernel Portal to get news, announcements and extra features.

NetKernel will ROC your world

Download now

NetKernel, ROC, Resource Oriented Computing are registered trademarks of 1060 Research