|
NetKernel News Volume 3 Issue 13
February 10th 2012
What's new this week?
Catch up on last week's news here
Repository Updates
No updates this week. But there's a new XRL to try (see below).
Service Notice
Our data center's routers will be undergoing a routine software update tomorrow (11th February) between 07:00 UTC and 11:00 UTC. This will result in an estimated 5-10 minute period of loss of service.
Tom Latest
Here's the latest from Tom. Please also don't forget that it would be really great if you can lend Tom some time to read/review the current draft of the Practical NetKernel Book.
TRL went async last week. Did you try it already? No? Well, there are no excuses left now : http://practical-netkernel.blogspot.com/2012/02/trl.html
As always your feedback, suggestions and entries for the challenges are welcomed at practical<dot>netkernel<at>gmail<dot>com.
*NEW* active:xrl2 goes async
Not to be outdone by the asynchronous tricks of the young pretender (trl), the active:xrl2 runtime has been rewritten.
It now supports a <xrl:async>true</xrl:async> flag - which causes the embedded request to be issued asynchronously. This feature means you can asynchronously fan-out any long lived requests while the main parts of the template carry on being built. Here's the relevant documentation...
Asynchronous behaviour
By default XRL operates completely synchronously. Includes and Evals are processed in document order one at a time. However Includes and Evals can be specified to be asynchronous, resolves are always synchronous but don't involve issuing subrequests.
All asynchronous requests within a document will be performed in parallel. However if an include fetches further nested asynchronous includes then each nested includes includes will be performed as a separate asynchronous group.
Regular includes and evals are made asynchronous by specifying the <xrl:async> tag. The value of the tag must be "true" for the request to performed asynchronously:
<xrl:identifier>meta:myServiceId</xrl:identifier>
<xrl:async>true</xrl:async>
</xrl:include>
Compact notation includes and evals require the use of an async attribute:
<xrl:include identifier="meta:slowInclude" async="true" />
<div xrl:eval="text" xrl:async="true">meta:slowEval</div>
</div>
Release Plan
XRL is a critical component since, as a bare minimum, all of the backend adminstration tools are composed with XRL. The changes have been rigourously tested and we're running it on our development and production servers (this wiki, the portal, our main web sites are using it) but, to err on the side of caution, we'll have a short period of beta-testing. So please give it a go - you can get the module here...
urn.org.netkernel.lang.xrl-1.9.14.jar
Install it by simply adding an entry to your etc/modules.xml (commenting out the old version).
Please let us know how you get on. Once we're satisfied that this has been kicked enough we'll release to the repos.
ROC and Data Structures
When I first started to think seriously about software, I was naive and would ask innocent questions of my more experienced colleagues. Question's like: "What's the best general purpose data structure to do such and such?"
I'd come from the world of Physics and my expectation was that there would be a body of evidence with world-relevant selection criteria based upon rigorous engineering evaluations. Like, for example, when you are required to choose a type of steel cable for a suspension bridge, or an op-amp for a circuit, or even just surface mount resistors - you have a body of principles, guidelines and best practice and this eliminates the need for trial and error. I was expecting something similar for data structures.
I was therefore somewhat surprised to discover that software seemed to be much more free and easy. It seemed like "you make it up as you go along" was an acceptable answer. I never could reconcile this and, for the most part over the years, I have bitten my tongue and avoided conflict.
Well not today. Today I'm coming off the fence.
Objects Objects Everywhere
Coleridge would have put it better but here's a verse from "Rime of the Ancient Coder" (which I did just make up)...
Data, Data every where, Miss it if you blink, Objects, Objects every where, Nor seldom stop to think.
This poor verse is an attempt to lightheartedly consider the reason for the standard practice I observed.
When the arrays and structs of C became encapsulated as the internal members within objects we were "liberated".
It no longer mattered what structure the data had - it was inside the object and, if you were obeying OO convention, the object's methods were how we interacted with it.
Of course at some point we are always required to touch the structure - in the return types of the methods. But the temptation (nay the express goal) of object oriented design is that a method itself returns another object. As that nice Prof. Feynman was told in his public lecture by the old lady: "Young man, surely its turtles [object's] all the way down".
It's easy to understand that when your language is object oriented and your paradigm is to create classes of objects within that language, then its easy to see why the structure in which you place your state is no longer regarded as a first order design criterion.
But in ROC we are forced to confront headlong the structure of data. Since in ROC we have taken a radical step. We have broken free of the "language-plane" and determined that transfer of state* is the overriding design consideration of a system. Therefore our choice of state bearing "representation-structure" becomes rather important.
*Actually minimisation both of computation and of state-transfer is our real systemmic goal. Since these cost energy. So lets state Rodgers' Razor: When two information systems, given the same inputs produce the same outputs, the one that requires less energy is better. This is followed by the conjecture: It is also the one that will take less energy (time, effort) to evolve and adapt to a change in the information system.
The Tyranny of the Bean
In the object oriented paradigm it seems like such a natural thing to do... you put your state inside a value object and you provide getters and setters to interact with that state. That was how it was. That was the natural thing to do... Until it became a problem. A big ugly problem...
Once upon a time when threads existed but processors were single cored - we had the illusion of parallel processing - but our systems actually ran step by step. So you could share objects and know that you could always synchronize on the object - knowing that the overall system would not be materially affected by single-track access to it.
But today the myth of parallelism is no longer a myth. Its real. To make full use of the cores in your CPU you must aim never to have to synchronize code. Synchronization is the death of scaling.
But all is not lost. To eliminate synchronization we just need to treat state, and state bearing objects, as immutable. If we share state and no-one can change it, then we don't have to synchronize when using it.
So now you see the problem: Getters good. Setters bad.
OK, go and look at a class library. It doesn't matter which. I bet its littered with mutable objects. Just look at any of the core java.* APIs. The overwhelming convention of the last 20-years has been that in designing objects they will have "state reading methods", but also a liberal mixture of state mutating methods and sometimes (often!) you even find read methods that cause state changes as a side effect just for good measure! (I'm pointing no fingers - but I've sure kissed a lot of toads).
Starting from Here
Of course we rarely start from a clean slate. We will generally have existing code and hence existing libraries of objects. And of course nobody is saying you can't use objects nor that any system should not be able to incorporate existing investments. But clearly we need a few guidelines that allow legacy objects to "behave nicely" in the highly concurrent, stateless world of the resource oriented domain.
The Rule of Immutability
The most basic rule of thumb is that when writing any endpoint you must never make calls on mutating methods of any representation that you may be processing. Every representation is highly likely to be shared. You have absolutely no right to modify its state in the physical domain of code. State changes can only be allowed to happen "in-band" via requests in the logical ROC domain.
Here's how to think of it. When an endpoint is invoked it does not receive any transferred state (it is something of a fundamental misconception in REST that "state is transferred" - it is not, in a Resource Oriented system it is context which is transferred). Rather, an endpoint is provided with a context in which the state it is informed about becomes accessible.
Practically speaking, it is wise to envisage that all state is static and it sits in the central state space (aka the cache**) and that your execution is just a context in which you have been given an object reference to that state. So now you see why it would be unforgiveably rude to mutate the state.
**I dislike the word cache. Its such a dumb and overloaded word. What NetKernel really has is a common state space. This state is the realization of representations of abstract resources. So my preferred term for this state space would be "The Realization". It constitutes the measured approximation of reality in which the system is currently executing. "The Realization" is managed in such a way that the state it contains is qualitatively the best ongoing approximation to the real world that it can maintain - which is what I mean when I constantly bang on about thermodynamics and minimising energy and entropy.
Pseudo-Immutable Interfaces
The rule of immutability is hard to police and, given the twenty years of free-living mutable OO-modelling we've enjoyed, hard to break the habit of. So there is a simple way to permit the use of legacy object models in the ROC domain. Add a level of indirection.
Provide a wrapper class which first and foremost implements an immutable interface. (Due to the pragmatics of existing object models, you will also probably inevitably have to provide a separate mutable interface, exposing the dangerous mutating methods. But by keeping the two separate you ensure that a developer is required to make a deliberate decision to move from immutable to mutable access).
And now you have a much simpler way of writing your endpoints. Instead of asking for the raw object type. You always ask for the immutable interface. (You can always provide transreptors that will automatically get you from raw to immutable forms*).
*Of course this is not truly immutable, its a form of pseudo-immutability and its just a way to keep everyone honest. Kind of like agreeing to drive on the left (or right, if you're one of those funny lot who drive on the wrong side).
Example - IXDAReadOnly
Here's a concrete example. A long time ago, in what now feels like another life, we realised that org.w3c.dom.Document was an absolutely terrible representation. It has the classic mix of mutating and readable methods. It requires state management to navigate its structure - even when the structure can be expressed in vector form (XPath). It has even been known to have non-thread-safe read methods etc etc.
So we wrote a wrapper object model to make DOM more suitable for ROC. Its called IXDA and can be found in the xml-core module - its a semi-declarative approach to API design (more on which later).
The most important observation is that it separates access into two interfaces. There is a primary pseudo-immutable interface called IXDAReadOnly - its methods look like this...
package org.netkernel.xml.xda; import java.io.*; public interface IXDAReadOnly { String eval(String aTargetXPath) throws XPathLocationException; boolean isTrue(String aTargetXPath) throws XPathLocationException; String getText(String aTargetXPath, boolean aTrim) throws XPathLocationException; IXDAReadOnlyIterator readOnlyIterator(String aTargetXPath) throws XPathLocationException; void serialize(Writer aWriter, String aTargetXPath, boolean indent) throws XPathLocationException, IOException; void serialize(Writer aWriter, boolean indent) throws IOException; IXDA getClone(); }
And, for pragmatic reasons there is a mutable interface that looks something like this...
package org.netkernel.xml.xda; public interface IXDA extends IXDAReadOnly { void append(IXDAReadOnly aSource, String aSourceXPath, String aTargetXPath) throws XPathLocationException, XDOIncompatibilityException; void insertBefore(IXDAReadOnly aSource, String aSourceXPath, String aTargetXPath) throws XPathLocationException, XDOIncompatibilityException; void insertAfter(IXDAReadOnly aSource, String aSourceXPath, String aTargetXPath) throws XPathLocationException, XDOIncompatibilityException; void applyNS(String aTargetXPath, String prefix, String uri) throws XPathLocationException, XDOIncompatibilityException; void removeNS(String aTargetXPath, String prefix) throws XPathLocationException, XDOIncompatibilityException; //etc etc etc }
Any endpoint that wants a quick and easy semi-declarative way to interact with XML resources can request an IXDAReadOnly representation. They are then able to safely access the state. If they subsequently need to construct a new XML data representation they have the ability to immutably clone the original and perform modifications to the clone. Or indeed create a new empty instance and immutably "pour in" parts of the original.
This is rather an extreme example - a complete abstracted model for interacting with and manipulating XML resources. But the principles will hold for your own domain specific object models. Provide a pseudo-immutable interface. Write endpoints to always request the immutable interface. Provide transreptors if necessary.
Sometimes (honestly, quite often) you will have to provide a wormhole in your wrapper to get out the underlying POJO. When you do this it is always a good idea to make the developer sign in blood that they have taken on the responsibility for not mutating the state. We do this by the simple expediency of ending our method names with "ReadOnly". So you would provide an interface like this...
CustomerRecord getCustomerRecordReadOnly();
Meaning: ok, you can have the underlying CustomerRecord - but you absolutely promise not to change its state. This is READ ONLY!
In summary, follow these basic guidelines and you'll suddenly find you don't need synchronization. Your solution will scale. And you know what, you won't need to resort to learning that next magic bullet language that everyone is raving about.
Starting from There
So now lets imagine a pristine world with no legacy. We are immediately faced with some requirements. Our data must be in a form which is easily consumable by many possible endpoints. In ROC - computation is kept separate from state. State is transferred to code to create new state (No its not. I told you nothing is transferred but context - but if it helps to think of it that way then go ahead).
We also know that inherently, there are, relatively speaking, small sets of core information. But that most systems involve the composition of that core state (and combination and transformation of those compositions, and of those compositions, and of those compositions.. ad infinitum). It is the nature of the real world that it necessitates lots of compositions (some would say: mash-spit-ups).
We intrinsically know that it is impossible to perfectly model the world - and, it is infinitely "more impossible" to model the future state of the world. (Think of the map maker who set out to create the perfect map. It would be 1:1 scale! He ran out of paper and got into an infinite recursion trying to put himself on the map).
Because we cannot model the world with perfect fidelity we have to anticipate compromise. We have to expect that our representations won't always have what we need - but maybe if we can combine and transform ones we do have we can get what we need. In fact the real world is often satisfied by just such emergent combinations.
The ability to discover and evolve emergent state is an innate goal and satisfiable expectation of ROC. (It is, for the record, also why OO systems are brittle and eventually break and have to be thrown away. Inheritance does not a composite make.) It is also the basis by which we allow an information system to provide serendipitous emergent information - stuff no-one anticipated ever thinking would be important. It is new information that leads to new opportunities which leads to new value which ultimately leads to your company being more successful than its competitors. (Every business no matter what it does is an information system).
So, let's take a stab at answering the impossible question: What is the best data structure?
What is the best data structure?
First of all some ground work. In ROC we are first and foremost conceiving of resources as abstract sets of information. Nothing says that those sets cannot be represented in countless concrete representations. I am not saying one form is better than another - there is no such thing - they are the same resource. What we are aiming for is some engineering principles. Some guidelines that when we need to build something we can use and avoid the pit-fall of trial and error.
Our first requirement is that whatever we choose we need it to be composable. That is we need to be able take one or more representation and relatively easily combine them into a new representation. So what we're aiming to get an engineering handle on is, what does "relatively cheaply" mean and how do we know it when we've found it?
What are our choices?
At the highest level, the possible data structures are as follows: atom, list, array, tree, graph (and map - to be discussed later). Each can be used to represent a set.
It is pretty easy to show that any of these data structures can provide the basis for a general purpose computing system.
For example - assembly language is dedicated to computation using atoms of state. (by atom we mean bits, bytes, integers, floats and even strings). But we all know that its pretty tricky to do composition with atoms without inventing higher order data structures. Actually its completely possible - as is shown with the Turing tape. Its just a bit "fiddly".
Equally, we know that the next highest structure, the list, can also provide a perfectly adequate structure. LISP.
If you've done any engineering mathematics you'll probably have used MatLab. This takes multi-dimensional array processing as the basis for an effective computational model.
OK - how do we find Goldilocks? Who wants to try some porridge...
Atom
Too small. Too fiddly. Move on...
Wait for it! How will we know when we've found porridge that's just right? First we'd better think about what our taste in porridge is...
Our Taste in Porridge
Taste is subjective. What follows is led from experience and from hard won knowledge of balancing trade-offs. There are no rules. Anything is possible. This is my subjective perspective...
A function (endpoint) is a mapping from one set to another.
S1 ⇒ S2
Or usually we write
S2 = F(S1)
Now think about an evolving business. It demands that S1 must evolve - new information is added to the set all the time. How do we preserve the fact that in order to tolerate change F must still return S2 (the layer that uses S2 expects that S2 is stable for their purposes) and we really really don't ever want to have to recode F.
So what we desire in our data structure as a set representation is this:
S2 = F(S1')
We desire that S2 is stable under change of S1 without needing to recode F.
Lists do not have this property since F would need to know the original length of the list. What happens if we append at the front rather than the tail. Or in the middle! So to keep S2 constant we are forced to modify F to F' to work the same with the new S1'.
Just as for Lists, so too for Arrays. They have the same problem of insertion and extension as Lists and so F must be changed to preserve stability of S2 under change to S1'.
So jumping ahead, what about graphs? Graphs are very easy to combine - they have the excellent property of being trivial to create set-unions (as Brian Sletten is fond of saying: "they can be slapped together"). Intersections can be computed relatively cheaply too.
But graphs do come with their own overhead.
For a start they do not have an origin. They demand that F must necessarily search to start its computation to find the S2 we desire. This means that while they can tolerate S1 changing to S1' - the day-to-day overhead of working with a graph is somewhat expensive computationally.
Furthermore, since even in a directed graph we can have edges that form loops that can return to the starting point - this means we are required to hold state as we traverse the data structure. Think of it like this - you will waste less time if you roll out a ball of string with you as you explore a maze. The necessity for the string is your "state overhead" for exploring the graph. (In ROC we are trying to minimize state - aka energy - remember Rodgers' Razor?)
But looking at the system in the large as an engineering problem, I would also argue that graphs are intellectually quite challenging for average developers. They're hard to think about and so hard to code against - which makes it error prone. They are intellectually expensive for the users and programmers of a system.
We intuitively know that writing code is easier when iterating over lists. Not walking graphs.
But graphs certainly have a role to play - graphs are powerful and indeed necessary when the outcome of the computation is not rigidly defined. That is, we are desiring to discover based upon a set of values and or relationships (known edges in the graph) the subset of the graph that satisfies some set of properties. This is the raison d'etre of the semantic web movement.
Shake My Tree
So what about the 80-90% case of stable repeatable systems with known resources?
Step forward Goldilocks - the Tree. A tree has the wonderful property that provided S1 evolves to S1' with an enlarging operation (adding branches, leaves) then in fact S1 remains a perfect subset of S1'. That is S1 is undisturbed by growth and evolution. Subsets are immutable under change.
Furthermore, computationally F can be written to use simple linear vector notation in order for a developer to apply computational focus to the state (subset) that they care about. These vectors are called XPaths. In general an XPath that satisfies S1 will also satisfy S1'.
Finally, conceptually XPaths allow coders to move from "tree thinking" to "list iteration thinking" - which is natural to programming languages and is least error prone for developers.
In addition traversing a tree does not require any additional holding of state. The current position in the tree is sufficient state to unambiguously navigate down (and up, via parent relationships). You're never lost in a tree.
It follows that Trees have a very valuable property. They allow S1 to evolve and grow without requiring any change to F. That is
S2 = F(S1') : if S1, S1' are trees
notice we are not saying S2 has to be representated in a tree - a resource can have any representational structure. All I'm saying is that to guarantee S2 is stable without the desire to modify F and without knowing how the world might change, then my experience is that choosing tree structures turns out to be a good engineering decision.
Dimensionality
So the next question is - can you convey the resource information you need in a tree? Well of course that's a stupid question - if an atom works so does a tree. But is it the right sort of thing? Does it feel "roomy" enough?
A much better and precise way of saying this is: Does it have the right dimensions? Too few and you're restricted and processing and evolution are hard. Too many and you're too flexible, growth is easy but processing and human conception becomes hard.
According to the Hausdorff measure of dimension: lists have dimension 1 (too few), graphs have dimension N (which is down to your choice of graph and is often too many). While trees are in the goldilocks zone. They're about 2.x dimensional (cauliflower is 2.3, broccoli 2.6).
Which in short, is me saying: They seem to be roomy enough - but not too roomy. Not too hot, not too cold - just right.
NetKernel's use of Trees
This discussion is just a public exhalation of something we we have known for a very long time. This is why the module.xml is a tree structure. Its not to make it hard to write and ugly (counter to some popular opinion). Its because we cannot anticipate the future. We need a reliable evolvable data structure to represent the spacial architecture of the ROC domain.
We also use tree structure as the representation of choice for configuration state. That is state that is generally repeatedly used by an endpoint rather than which changes each request. Using trees for configuration means that the ROC tool set has remained incredibly stable over 5 generations of physical embodiment. Requests that "ran" on the first prototype NK back in 1999 will still give the same resources today.
The module.xml is a tree structure. The fact it's XML has the elegant benefit that we can allow configuration state to be expressed in-line too. It is reliable, evolvable, stable and tolerates ongoing and progressive change. So configuration state can appear in situ (no matter what new tool we invent) and, increasingly and very importantly for the future, it allows anyone to put their own extensible branches in (called metadata). Metadata driven architecture is already apparent in the RESTOverlay (Which has (apparently!) been regarded, as an impressive feat of engineering - not really, the fact we are using composite trees meant it took just 2 days work - start to finish), and is seen in the metadata driven nCoDE tools... but, we've barely scratched the surface of what's to come.
HDS
Our engineering need for the tree is seen in the only true native resource object model in NK. In layer0 there is the HDS object model. IHDS is immutable out of the box. It has the very nice property that it can be used to represent an atom, a list, a tree, a forest (not discussed but obviously if trees are good - then forests (multiple trees in a set) are good for similar reasons).
When you see XML configuration - we actually transrept it to HDS and then, if necessary, into a localized and specialized low-entropy POJO (cached so one time hit with optimal lookup).
In addition HDS also has the powerful ability to be inverted. That is, if we know the value of a node but we need to locate it in the tree. The HDSInversion converts the tree to a node-value ⇒ tree-node map. This provides O(log) time entry into the tree for the most efficient execution of a computation F on the contextualized tree when a value is already known.
The Bigger Picture
You know that I've got a bee in my bonnet. I'm pretty sure that ROC is about as efficient a computational model as you can have. If that's the case, then it ought to be present in other systems. Systems where the imperative (survival) demands that any extraneous energy use would be severely punished (death and annihilation).
I told you that I had a recent excursion into the world of biology. Well similarly I went looking in other places too.
The Language Instinct
For the longest time I've known that our predominant form of written communication is in tree structures (we call them books or documents, or even web-pages). This isn't coincidence. Its got to be because our brains have a resonance for tree structured information.
So armed with a pre-conviction that I would find things I recognised, I started wondering about language. Human language. The primary medium of exchange of information between humans.
I bought a copy of "The Language Instinct" by Steven Pinker.
It's a beautiful book. Its the sort of stuff that should be standard knowledge, taught as soon as possible in school to every kid. Today's syllabus - Maths: this is how we count. Language Principles: this is how we all communicate. English (or your native language) - this is an instance of Language Principles. (I felt let down that no-one had told me this stuff earlier).
Guess what?...
- First revelation, the current standard model of the mind is called the "Representational Theory of Mind". It is discussed on pages 73-82. It is very very familiar. Its premise is that we (the human mind) resolve identifiers to abstract resources and reify these to representational state. (I am not kidding). I expressed it more succinctly since it is a little less concrete than our own world view - not surprising as we have full control of the internals of such an ROC system - the social scientist has to measure by indirect means and is prevented by ethics from taking people's heads apart bit by bit. But nevertheless it is uncannily familiar. (So now I need to go and talk to Social Scientists as well as Biologists - I need to get out more, twice as often).
- Second revelation. Every sentence you ever uttered (wrote, spoke) is a tree structure.
- Third revelation. Our brains have an abstract tree structured grammar model - this accepts infinitely complex variability in sentence structure to be "transrepted" to a normalized universal human information model. This theory (reinforced by repeated scientific evidence) by Chomsky is as important in its way to the human race as anything in the 20th Century. (Why didn't I know this before?)
- Fourth revelation. Every word you ever spoke is also a tree.
- Fifth revelation. These properties are universal - they appear in every language of every people of the world, and, for example, they also appear in sign-languages.
No matter what the language, both the sentences and the words are the input (and outputs) of our language processing system(s). They are both infinitely variable and yet, they (we) have the extraordinary property that when we use them, the same abstract information is proven to be conveyed from one information system (you) to another (me).
This is one heck of a strong piece of engineering best practice to underwrite the case for tree structured information architectures.
Back to ROC
Remember I said I went exploring language armed with pre-convictions. Remember I keep saying that REST paths are ok - that the NK simple grammar can parse them. But if you are building something to last and evolve you should use active identifiers and the active grammar?
Guess what? The active URI is a tree structured identifier. The active grammar is a tree structure normalizer so that abstract identifiers are decoupled from the evolving labels (identifiers) we give to information resources.
Do I have to draw a 1:1 map? Is it not evident we are standing on the shoulders of giants. Or maybe, like the map maker, we've been recursively standing on our own shoulders all along.
Have a great weekend, my legs are getting numb...
Comments
Please feel free to comment on the NetKernel Forum
Follow on Twitter:
@pjr1060 for day-to-day NK/ROC updates
@netkernel for announcements
@tab1060 for the hard-core stuff
To subscribe for news and alerts
Join the NetKernel Portal to get news, announcements and extra features.