Friday, January 23, 2009

Multi-Multicore Single System Image / Cloud Computing. A Good Idea? (4)

This is part 4 of a multi-post sequence on this topic which began here. This part discusses some implementation issues and techniques.

Implementation, Briefly

The only implementations I know much about were the Locus ones. I use the plural deliberately, since two different organizations were used over time.

Initially, the problem was approached the most straightforward way conceivable: Start at the root of the source tree and crawl through every line of code in the OS. Everywhere you see an assumption that some resource is only on one node, replace it with code that doesn't assume that, participates in cross-system consistency and recovery, and so on.

This is a massive undertaking, requiting a huge number of changes scattered throughout the source tree. It's a mess. And it's massively complicated. I am in awe that they actually got it to work. As you might imagine, after a few of their many ports, they seriously began looking for a better way, and found it in an analogy to Unix/Linux vnode interface.

Vnode is what enables distributed file systems to easily plug into Unix/Linux. Its basic idea is simple: Anytime you want to do anything at all to a file, you funnel the request through vnode; you don't do it any other way. Vnode itself is a simple switch, with an effect that's roughly like this:

  If    <the vnode ID shows this is a native file system file>
then <just do it: call the native file system code>
Else <ship the request to the implementer, and return its result>

If the implementer is on another computer, as it is with many distributed file implementations, this means that you ship the request off to the other computer, where it's done and the result passed back to the requestor.

What later implementations of the Locus line of code did was create a vproc interface analogous to vnode, but used for all manipulations of processes. If a process is local, just do it; otherwise, ship the request to wherever the process lives, do it there, and return the result. This neatly consolidates a whole lot of what has to be done into one place, enabling natural reuse and far greater consistency. It is definitely a win.

Unfortunately, the rest of the kernel isn't written to use the vproc interface. So you still have to crawl through OS and convert any code that directly manipulated a process into code that manipulated it through vproc. This is still a pain, but a much lesser one, and you get some better structure out of it. (I believe vproc got into the Linux kernel at some point.) However, this alone doesn't do the job for IO, doesn't create a single namespace for sockets, shared segments, and so on; all that has to be done independently. But those aspects are generally more localized than process manipulation, so you have fixed a significant problem.

Other implementations, like the Virtual Iron one and, I believe, the one by ScaleMP, take a different tack. They concentrate on shared memory first, implementing a distributed shared memory model of some sort (exactly what I'm not sure). That gets you the function of shared memory across nodes, which is a necessity for any full kernel SSI implementation and does appear in Locus, too. But it doesn't cover everything, of course.

Enough with all the background. The next post gets to the ultimate point: If it's so great, why hasn't it taken over the world?

4 comments:

Eyston said...

This blog is fantastic -- your writing is both informative and entertaining. I look forward to more.

Anonymous said...

DragonflyBSD, a fork of FreeBSD, is an open source single system image project.

Greg Pfister said...

@huey - Thanks! There will be more, I promise.

@Anonymous - Also thanks! For the pointer to DragonflyBSD. I wish them all the best, but I don't think they'll like my next post.

-- Greg

Anonymous said...

great post

Post a Comment

Thanks for commenting!

Note: Only a member of this blog may post a comment.