WiNK edit

wiki /NetKernel /News /1 /24 /April_16th_2010

NetKernel News Volume 1 Issue 24

April 16th 2010

What's new this week?

NetKernel Standard/Enterprise Edition 4.1.0.
Si Valley ROC talk.
NetKernel Protocol status check.
Do it yourself twitter infrastructure.

NetKernel 4.1.0

We're pleased to announce that two new distributions of NetKernel are available...

NetKernel Standard Edition 4.1.0 is available here: http://download.netkernel.org/nkse/

NetKernel Enterprise Edition 4.1.0 preview 4 is available here: https://cs.1060research.com/csp/download/ (registration required)

Repository Status

We will maintain the NKSE 4.0.x repository with significant bug fixes and/or critical security updates for the next six months.

We recommend that for development and to stay abreast of the latest updates and new features that you transition to use a 4.1.x distribution.

Common Updates

Both NKSE and NKEE 4.1.0 incorporate all of the last 6 months of package updates and reset the baseline with their own 4.1.0 repositories.

The distributions are backwards compatible and your applications can be deployed straight to a 4.1.0 distro.

H2 Migration

One potential area to take note for application compatibility is the H2 database.

The H2 embedded database is used in the NK system in a few places. Notably it's the default persistence behind pds: and is also the engine used for the apposite client. Over the last year H2 has been transitioning to the 1.2.x series which is based on their new "Page Store" filestructure offering better integrity and performance. Recent releases have even removed support for their older 1.1.x format entirely.

Since this component plays an important role in the system, we want to ensure we can track ongoing fixes and enhancements. It has therefore made sense for the NK 4.1.0 builds to update to 1.2.x and its Page Store format. Therefore you'll find that the h2.db module ships the very latest 1.2.133 build of H2.

 From the regular NK system point of view this is not news and you

won't see any difference. However, if you've used H2 as your DB engine for your applications developed on NKSE 4.0.x you'll probably want to convert them to the new page store format so that you can seamlessly keep working with the NKSE 4.1.0 db packages. Here's what you need to do...

1. Download a copy of H2 1.2.128 (this is their last build that supported the old format)...

http://code.google.com/p/h2database/downloads/list

2. Temporarily place the 1.2.128 jar in the urn.org.netkernel.h2.db module's lib/ directory and rename 1.2.133 from jar to some other extension (so its not in the classpath). 3. Change your application's JDBC URL connection and add the following at the end ";PAGE_STORE=true" (note semicolon). 4. Run your app as normal. The first time that H2 tries to connect to your DB it will automatically convert it to the new format. (It takes a backup of the older format files in a zip if needed). You can tell the new format as it has the file extension ".h2.db". 5. You're now done. You can revert out the 1.2.128 version from the db.h2 module and use the latest version we ship.

If you've got business critical data and need any help with this process just let us know and we'll help out.

NetKernel Enterprise Edition 4.1.0 Preview 4

We are still in preview mode with Enterprise Edition, but rapidly closing in on the completed feature set. This preview 4 release includes the common NKSE 4.1.0 infrastructure. It also includes the latest builds of the enhanced NKEE tools and libraries. This release features a new nkee-architecture module...

nkee-architecture: Is a library of Enterprise architectural components that provide control and ease-of-implementation of a number of advanced architectural patterns. Included in this first release is the Profile Overlay, which can be used to transparently wrap a space and provides runtime statistical profiling of all requests into the space. Also included is a Virtual Host Endpoint, which allows hostname-based routing into isolated application address spaces enabling full virtual hosting on NKEE. The endpoint is general purpose and could be used to provide arbitrary grammar based routing into isolated spaces, not just hostname routing.

A number of other NKEE architectural components are in the works and will be released via nkee-architecture package updates in the coming weeks.

Si Valley / Bay Area NetKernel Talk/Meet-up

I'm going to be in Silicon Valley / SFO Bay Area from the 16th May (volcanoes permitting). I'd like to use the opportunity to give a NetKernel/ROC talk. Anyone have any ideas for locations or, even better, could help host ;-) Thinking this would be the evening of either Monday 17th or Tuesday 18th May. If you know anyone in the area, please put the word out.

Would also be happy to go out for beers!

NKP Status Report

The NetKernel protocol (NKP) development is moving on nicely. The client-side now presents the logical endpoints from the server side. That is, the set of endpoints present in the host fulcrum space of the server side (downstream NetKernel) are logically mapped over the wire into the address space on the client-side (upstream NetKernel).

So a request on the client side can be resolved locally (using the remote side's grammars), before being shipped over the wire if it is resolvable - this is very efficient since it means that network cost is only incurred when you know that the resource will be reified (computed) by the server. So both network and computation costs are deferred to just in time.

This starts to show a little of how NKP offers extra dimensions beyond HTTP/REST. In HTTP, client-side resolution only goes as far as the DNS stage and the URL->resource resolution is done server side. The reason being that in HTTP/REST, resolution is not a part of the abstraction and so there is no resource that offers resolution metadata. In NetKernel (locally) and NKP (distributed), the resolution phase of the resource oriented system is itself resource oriented, and treats metadata as a resource too!

You might worry that NKP's client-side resolution implies that the client-side requires pre-knowledge of all possible sets of resources, but that's not the case. NKP means you can take appropriate engineering decisions about pre- or post- resolution/reification. So if you wish you can fallback to the identical pattern as HTTP, where the client-side would fulfil the same role as DNS and only decide which cluster node to direct a request to. These choices of system balance become an application-level architectural variable you have control of.

As far as work remaining on NKP, there are still some low-level metadata consistency hooks to introduce, but to first-order the current state of development implements a practical 90% coverage and could already be used to implement many distributed architectures.

DIY Twitter

The NKP work got me thinking. What would be a good way to demonstrate the technology?

As you might have noticed by the increased content in this newsletter, I've recently forced myself, against my nature/cultural inclinations (reserved Brit), to be more vocal. To that end, I've been trying to increase my twitter use (http://twitter.com/pjr1060).

Twitter is one of those things who's success is due to its innate simplicity. Its resource model is easy for end-users to understand and use. The hidden complexity is in the engineering needed to ensure that the backend scales to deliver the immediacy that each user demands.

As an exercise in ROC modelling and NKP architectural partitioning, I thought I'd talk through what an NK implementation of the twitter backend might look like [pjr comment: What follows is fairly long and has been deliberately left as a stream of consciousness - if the details don't interest you, the thought processes still might be useful as insight into the way I go about an ROC design]...

Firstly lets consider the resource model. It consists of 3 core sets of first-order resources:

(A) mytweets/[userid]  - the merged stream of [userid]'s tweets and all
of the tweets of [userid]'s friends.
(B) tweets/[userid] - the stream of tweets by [userid].
(C) friends/[userid] - the list of friends of [userid].

We can see, at least for the sake of our first consideration of the model, that (B) and (C) are atomic resources, ie they are not composite resources. We also see that (A) is a composite resource which consists of the transformation of (B) and (C) (the sorted superset of (B)'s belonging to the (C)'s).

There are other first order resources, for example

(S) search/[identifier] - the set of all tweets matching [identifier].
(T) time/[range] - the set of all tweets within a given time [range].

For now it'll be apparent that the general architectural principles concerned with the (A-C) will be applicable to these other sets.

OK, that's a rough outline of the read-side channels of the resource model. On the write side we have status updates and adding/removing friends but lets not worry about defining these for now.

Lets step away from the resource model and think in terms of the system engineering, computational cost, the quality-of-service requirements etc etc.

What's our most expensive cost? Probably the the persistence mechanism, the database. This is for a number of reasons. Firstly atomic persistence demands consistency, which means its either hard, or expensive, or both to distribute the stored data, so access to the database is a natural bottleneck. Secondly, high-end database vendors have non-linear pricing models, they know that once you lay down data in their DB engines, the bigger you get the more they make. So for performance and economic reasons our imperative for this architecture is to find a function min([DB requests]).

What do we know that allows us to look for such a minimum use of the DB? Well fortunately we know that all social networks have the property that Read >> Write. Think about a dinner party, if everyone spoke at the same time (so that write>=read) there'd be chaos. We also know that your sent email folder is way way smaller than your received folders (unless you're a spam bot).

These empirical observations suggest that we're onto a good thing. If our server had infinite local memory we could envisage an ideal solution that would require only one read of the db for every write. Of course we don't have that, so we need to work out a good dynamic equilibrium by using the architectural variables that ROC gives us.

OK, lets get concrete and consider how we compute A(x), where "A(x)" is shorthand notation for the full resource: myteets/[x].

Lets create an endpoint with a grammar something like...

<grammar>res:/mytweets/
  <group id="userid">
    <regex type="anything" />
  </group>
</grammar>

Which we use to bind requests to an endpoint, lets call the physical endpoint code: EP-A

EP-A has to implement a merged view of all tweets of the user and their friends. Before we can implement EP-A we have to implement an endpoint for the friends of userid C(x), and to cut a long story short, we're going to need an endpoint that can give us any given user's tweets B(x), let's quickly consider B(x)...

We'd create a single endpoint EP-B with a grammar something like...

<grammar>res:/tweets/
  <group id="userid">
    <regex type="anything" />
  </group>
</grammar>

EP-B in pseudo code, and assuming our persistence is an RDBMS, does something like this...

userid=request.getArgumentValue("userid") //Get the userid argument
sql="SELECT * FROM tweets WHERE userid='$userid' ORDER DESC LIMIT 10"
rep=doDBQuery(sql)
//associate a virtual dependency with the resource...
attachGoldenThread("twitterGT:"+userid);
return rep

Don't get hung up on the code - it really doesn't matter and for sure we'd change this as we go on. This is just an illustration, the important thing is that EP-B reifies a resource representation for B(userid), the set of tweets of [userid]. It is cacheable and remains valid for as long as twitterGT:userid is valid. So, in an ideal perfect server, if [userid] never made another tweet we'd never have to hit the DB again (read=write=1). We'll come back to the golden thread stuff.

OK lets assume we can do something very similar for C(userid), the friends of [userid]. Lets also assume that C(userid) includes [userid], ie you are your own friend. So returning to EP-A we can now pseudo code up the endpoint...

userid=request.getArgumentValue("userid")
friends=SOURCE C(userid)
rep=null
for friend in friends
(
    rep+=SOURCE B(friend.id) //Source tweets of friend
)
sort(rep)   //Sort in time descending order
filter(rep, N)  //Include only first N tweets.
return rep

So rep is the representation of A(userid). Notice that it is cacheable since it is identified uniquely, and it depends upon each of the member resources in the set C(userid) because we have sourced each B(x). [pjr comment: Bear in mind that almost all B(x) will be cache hits - but see below for discussion of why this is a somewhat naive first model and would really need a differential resource approach].

The result is that this A(userid) is now computed and will be repeatedly served from cache for as long as C(userid) is unchanged and, because B(x) has golden thread twitterGT:x, as long as all the twitterGT:X are valid. (We're still on target for our ideal limit of DB read=write).

Since twitter clients use polling its clear that this is going to be very efficient, since in steady-state we don't have to compute anything, just send the resource. Better yet, if the client is well implemented and it sends an ETAG of their local cache, we can just send a 304 not modified.

OK lets consider a change of state in the system. Lets still imagine that we have a perfect infinite server. Our user issues a new tweet by issuing a state transfer request to this resource ...

(D) SINK status/<span>[</span>userid]

This is a write channel. Its independent of any of the other channels we've talked about so far. So lets assume we have a grammar...

<grammar>res:/status/
  <group id="userid">
    <regex type="anything" />
  </group>
</grammar>

...bound to some endpoint implementation: EP-D. In pseudo-code that might look like...

userid=request.getArgumentValue("userid")
primary=context.sourcePrimary(String.class) //Source the incoming
transferred state as a String.
sql=INSERT INTO tweets VALUES ('$userid', '$primary');
updateDB(sql);
cutGoldenThread("twitterGT:"+userid);
return null;

Again, don't worry about the code, its just detail. The key thing is that state is received, our resource persistence mechanism is updated.

Before we return, notice that we cut the golden thread twitterGT:userid. With this step we atomically expire all dependent resources. So A(userid) and B(userid) are both instantaneously out-of-date. Anyone requesting A(userid) will force the recomputation. More importantly anyone who is a friend "f" of [userid] will also see their A(f) expire since it depends on B(userid) which depends on the golden thread we just cut.

If you're following you'll see that we're still tracking our read=write objective. Since the first re-request of A(userid) (or B(userid)) will create a new cacheable representation depending on the golden thread. So the update cost is only one database read which covers everyone in the system who is a friend of [userid]. Even a super friend (SF) like Stephen Fry (SF!) would need no more reads.

OK you should be able to see the general principles - you can probably also see that to implement a simple twitter system in a single NK server is going to be of the order of 100 lines of code.

But we don't have ideal servers. Not least our engineering limit is that the HTTP client/server protocol is going to exceed the NIC capacity of a single server. Or we're going to have so much state at any given time that it exceeds one server's local memory capacity. Or the next dominant compute cost, A(x)'s merging and sorting the friend feeds, will start to exceed the CPU capacity.

So how would we scale this architecture across a cluster? We break the potentially infinite sets A(x), B(x), C(x) down into subsets and place those subsets on their own physical implementation.

So we might have a dozen identical front-end servers all of which apparently serve A,B,C. (we'd have a network load balancer in front to do IP affinity / load balancing across this tier).

Each front end would no longer have an instance of the EP-A, EP-B, EP-C endpoints, instead we'd have an NKP client connection to each of 3 implementing servers.

A-Server - serves user personal tweet stream A(x)
B-Server - serves B(x) tweets of x.
C-Server - serves C(x) friends of x.

If we wanted to, we could subdivide A-Servers into A("A-C"), A("D-F")...A("W-Z"). Where each A would handle only userid's in the first-lettered subset specified, and so on for B's and C's. We could have load-balancing front-ends for the A,B,C's too so that the front-end servers only have to connect to 3 consistent places.

We could keep adding servers with ever finer subset responsibility, in the limit you would have one server per user id! Sounds mad, but imagine instead of twitter we were subdividing data analysis for the Higgs Boson and each resource representation took serious compute power.

OK you should be seeing the picture. Just one more thing to consider. The tiered computed resources percolate up to the edge of the cluster. Each one will be locally cached in its local server. The NKP protocol ensures that dependency relations are preserved upwards - so the front-end servers will depend on the same logical golden thread consistency. But, slightly subtly, the golden threads for the A's, B's and C's are not necessarily tied together because they were implemented locally at the leaf nodes of the request tree.

We can tie this up by implementing a simple golden thread expiry write channel on every server in the cluster. A request to this channel would simply indicate the identity (g) of the golden thread that is expired, its implementation would then call cutGoldenThread(g) - all local state on that cluster node depending on g would be atomically expired.

When a user tweet happens, all we'd need to do is call the expiry channel with the userid for each node in the cluster responsible for that [userid] subset (eg A("A-C") etc). We don't care if there is any state there, or how it may be used in composite resources etc. Effectively we can apply a very fine scalpel to expire the minimal subset of dependent state in the system.

We could ensure that our physical update to the DB is transactional, so that we can do the cluster expiry as lazily as we like, knowing that as the distributed state is expired, the DB is locked on our resource. Only when we've finished notifying the cluster would we commit the DB update, at which point any requests for our resource issued during the expiry phase, would get access to the DB to start rebuilding the resource representations.

Given that this is just a social network and any change propagation within a minute is probably sufficient, we don't have to be super transactional and can allow dirty reads. So we don't need to lock the database. We can also use timed dependent expiries for all our representations so that they are valid within a given last use period or'd with their dependencies.

I've run out of stamina to explain the detail of how we might implement the search resource sets, but basically you'd have a resource that contained a list of all tweets matching a given search. It would be very very expensive to query the DB for searches in real time so you'd probably maintain a secondary index system too. However even accessing this in realtime would get costly, so instead, each time a tweet status update came in you'd request another resource containing the map of all current search terms (this will be relatively quite small). You'd split the tweet down into its constituent words, if they were in the search map you'd add this tweet to the search item in the DB and expire the search's golden thread, next time the search is requested it will regenerate the search resource since it's no longer cached.

There are lots of necessarily missing details in this discussion. On reflection, splitting the sets as described, illustrates the point about clustered Golden threads, but it is not necessarily the best partitioning. It might make more sense to have all the partial subset A's, B's and C's endpoints implemented on each clustered server S and responsible for all resources of the identified user subsets (that way you probably minimize the use of the distributed GT pattern - since the bulk of the collection of resources for a given user will be on the same server instance).

You'd also quickly find the limits of the naive A(x) implementation, imagine the cost (even with cache hits) for a superfriend with a million followers. You'd have to bite the bullet and recognise that our single read objective for the DB is too limiting. Instead you'd have to introduce a differential resource model DeltaC(x) the set of friends of x who have recent tweets. Which would require that the write channel would update a secondary table holding this delta list. A(x) would source the delta list and only merge that. Although even here with second-order differential resources you'd still quickly find local equilibriums and so cache the majority of the resource sets for the majority of the time.

The implementation details aside, it should be clear that playing and tuning the system with some real data would quickly allow us to tweak the architecture to get a good balance and introduce any necessary 2nd order resources, constraints and caching parameters.

Hopefully this discussion starts to show that with ROC you can play with architectural and engineering variables to find a pragmatic dynamic equilibria. In this case, steady state interludes are statistically quite frequent, (a property of read>>write system) so the solution will often discover regions of the resource space that approach the ideal minimum of read=write.

One final thought. Going back to last week's post about demonstrable value of ROC being greatest when requiring system changes...

Now that twitter is going commercial, and will soon start inserting advertisements, their resource model will include demographic profile resource sets of us poor proletariat. To implement this change in the system above, all we'd need to do is change (C) to a dynamic composite resource set, the union of friends and instantaneous advertizers targeting you. In effect recode EP-C. Nothing else needs to change, especially not the resource identifiers and architectural relations. More sophisticated targeting might do content analysis on the A(x) but again this remains internal to EP-A.

Linked Data Reprise

I wrote the twitter discussion earlier in the week. I just reviewed it before sending and realize I've done it again with linked-resources (see Newsletter Vol 1 Issue 22). In the discussion I've just assumed that linked data (composite linked-resources) are a given.

My implicit assumption is that each of the A's,B's and C's would mostly consist of resource references - so for example the B's would actually be a list of resource references to individual tweets (which would have a unique identifier in the set tweet/x (and need an endpoint to reify them etc)). To reify the final output resource, the linked resources would be evaluated and included into the composite whole. In this case, since we'd probably want XML outputs, I'd probably do that with XRL.

Something of a monster newsletter this week. By the time you've read to this point you'll either be asleep or it'll be next Friday and there'll be another one to fall asleep over.

Have a great weekend!

Comments

Please feel free to comment on the NetKernel Forum

Follow on Twitter:

@pjr1060 for day-to-day NK/ROC updates
@netkernel for announcements
@tab1060 for the hard-core stuff

To subscribe for news and alerts

Join the NetKernel Portal to get news, announcements and extra features.

NetKernel will ROC your world

Download now

NetKernel, ROC, Resource Oriented Computing are registered trademarks of 1060 Research