The fundamental concept behind Swarm is that we should “move the computation, not the data”.
The Swarm prototype is a simple stack-based language, akin to a primitive version of the Java bytecode interpreter. I wanted the proof of concept to be quick to implement, while demonstrating that the concept could work for a popular runtime like the JVM or Microsoft’s CLR.
Update (Sept 17th 09): Swarm is now implemented as a Scala library, so you program in normal Scala, rather than a custom stack-based library as with the prototype described here. It uses the Scala 2.8 Continuations plugin to achieve this. See end of blog post for further information.
The prototype is implemented in Scala, and I will use snippets of Scala code below, but a knowledge of Scala won’t be required to understand the rest of this article. I chose Scala because I wanted to learn it, and because its rich semantics tends to make coding easier and faster than Java (my normal language of choice).
As with the JVM, there are three places to store data in the Swarm VM: the stack, a local variable array, and the store. The stack is used for intermediate values in computations, data here tends to be very short-lived. In the prototype it is implemented as a List[Any]. The local variable array is for data that is used within a block of code (its implemented as a Map[Int, Any]).
The “store” is somewhat analogous to the JVM heap. It is used for long-term storage of data, indeed, in an actual implementation it may be persistent, and/or transactional, but in the prototype it is in-memory. The store contains “objects”, each of which is a list of key-value pairs. The values may be references to other objects. The store is implemented as a Map[Int, Map[String, Any]].
The store is the key to how Swarm is distributed. A store can be spread across multiple computers, and objects in a store can point to objects on remote computers. While the objects stored on different computers are different, all the computers contain the same computer program (or must have access to it).
When some code is running on a computer that wants to read or write to an object stored locally, this can be done as it would with any other computer, a very efficient operation. But what if the code needs to access a remote object? Well, in this case, rather than moving the remote object across the network to the local computer, we freeze the computation, and move it to the remote computer, where it resumes execution – now able to access the object directly.
Freezing the code execution in this way is called a “continuation”. A continuation is akin to using the “Save Game” feature in a computer game. It allows you to start the game from the same place in the future, or even to email the saved game to a friend of yours and have them continue where you left-off.
Continuations are supported by some advanced programming languages like Haskell and Ruby, but not other more common languages like Java and Python (at least, not directly).
In practice, the continuation must contain the current stack, the current local variables, and the program counter. These three things represent the “state” of a Swarm computer program. If you can transmit these three things to another computer that also has access to the same computer program, then it can pick up where you left off, and the author of the computer program doesn’t need to know or care about this little bit of magic.
Of course, sending this state across the network, even though it will typically be quite a small amount of data (probably fitting in a single IP packet), will still introduce a significant delay relative to the normal speed of code execution. For this reason, Swarm must try to ensure that objects are distributed among the computers in the cluster in such a way that over-the-network traffic is minimized.
Deciding where objects should go
Our goal is to minimise the number of times continuations must be packages up and sent across the network. We can do this by ensuring that objects which are “tightly coupled”, meaning that the reference from one to the other are frequently used by the code, reside on the same computer.
For now we use a very simple approach to this. When an object is created, we attempt to put it on the same computer as the last object accessed. This is easy, because this is the computer that will be executing the code. The only situation under which we don’t do this is if the current server has too many objects relative to the other servers, in which case we create the object on the least-full server. This helps to balance objects across the servers.
Of course, this is a very simple approach, in a real-world implementation we’d want much more sophisticated load-balancing that monitored actual usage and moved objects as necessary.
To test these ideas I wrote a simulation in Scala. My goal was to create something (to quote Einstein) as simple as possible, but not simpler.
First I created a simple stack-based language. It supports primitive control commands, Goto and If. It also supports mathematical operations like addition, generation of random numbers, and greater than comparison. Lastly, it supports commands to create, modify, and retrieve objects in the store. Recall that each object is a list of key-value pairs, where the key is a string, and the value can be a number, string, or a reference to another object. At the time of writing the language doesn’t support subroutines, but these could be added with relative ease. You can find the definitions of these commands in Instruction.scala.
I used immutable datastructures for the state of the execution, a List for the stack (Scala’s Stack implementation is buggy – yay Scala!), and an IntMap for the local variables.
The simulated Store
Next I implemented the Store class. This is a map of integers to objects. Each object is itself a map of strings to arbitrary values, which could include integers, strings, ObjectRef objects, among others.
Overseeing the various Stores is a Cluster. Its role is to tell us when we should create new objects, using its loadBalance() function. It will generally create the object on the current computer (aka “node”), but will sometimes say that it must be created on another node to ensure objects are reasonably well distributed.
The load balancing algorithm is simple: No store is permitted to have less than half the free-space of the least-full store. If it is, then the object is created on the least-full store instead.
A test program
Next I needed to write a simple program to see how the Swarm prototype performs. I needed something simple, but which would hopefully be representative of the way real-world software creates and uses objects.
I decided to implement a simple binary-search tree, and then insert random numbers into it. It took me a while, and was rather tedious, but eventually I got something working. You can see the end result in Test2.scala. Initially line numbers had to be specified explicitly, this was a royal PITA, so I later implemented a label mechanism, and a “compilation” pre-processing stage (implemented in Interpreter.scala) that assigned line-numbers to the labels.
I created 5 virtual computers, each with a capacity of 20 objects. Of course, this is a ludicrously small number of objects (in the real world, a node would be able to handle millions, even billions of objects), but I wanted to stress the system.
Here is a graph of the resultant tree, the color indicates which of the computers the object is stored on – as you can see the root of the tree was stored on the pink computer.
In this experiment, I show the percentage of instructions that result in a continuation being transmitted between computers as the size of the tree is increased:
As can be seen, it increases rapidly before leveling out around 3.25%.
Even the simplistic load-balancing approach employed in the simulation seems to be reasonably effective, only 3.5% of instructions required some communication between nodes. A number of factors conspired to make this a difficult task for it:
- The insertion of data to random points in the tree make it more difficult for the algorithm to group branches of the tree together on the same computer
- Each node had very limited capacity
Continuous re-allocation of objects between computers in response to actual usage is key. This will need to be a trade-off between efficiency of the re-balancing process, and its effectiveness. Near-term work on the simulation is likely to focus on this.
I guesstimate that turning this concept into an actual working piece of software would be a 5 man-year effort. Among the major questions/challenges are:
Can we work with an existing virtual machine, or do we need a new language?
Ideally we would be able to make Swarm support a widely-used bytecode language such as the Java Virtual Machine, Microsoft’s Common Language Runtime, or the LLVM compiler infrastructure. Swarm’s simple bytecode is a simplification of the Java Virtual Machine for this purpose. However, it is possible that we will uncover reasons that we need a new language for this. That would be unfortunate if the case, as it would create a significant barrier to entry.
What about persistence, transactions, and replication?
If we want Swarm to have the potential to replace databases, then it will need to be able to store objects persistently (ie. on disk). It will also need to support transactions (so that we can be sure that the stored data is always in a consistent state). We will also need a solution to ensure that it is fault-tolerant, meaning some kind of replication scheme.
Right now Swarm is still more of a thought-experiment for me, as I am still very-much focussed on my day job. That being said, I have had some conversations with a number of people who have expressed an interest in providing funding to make Swarm a reality. As I mentioned before, it would be a big task, probably requiring around 5 very smart people working full-time for at least a year. The potential, however, is huge – just imagine being able to write massively scalable applications without having to think about scalability and persistence!
You can find the Scala source code in a public git repository here. The code is not that well commented, and at the time of writing it will spit out a Graphviz file to produce the tree diagram above, its not really intended for public consumption – but it may be interesting for people to browse.
Update 2 (17th Sept 09): I’ve done a lot more work on this, see here.