In a previous post I said that one of my reasons to develop a RELOAD implementation was to implement VIPR. Another reason is to be able to develop software that is resistant to developer bugs.
The Internet is global, which means that a service should be available 24/7, and having a web page saying that a service is down for maintenance is just plain incompetence, and companies doing this should be publicly shamed for this. There is multiple ways to build systems that can still work during software update, operating system bugs and other disasters, and I think that Google did a lot of good in this domain by showing that it could be done in a way that does not require to spend millions of dollars in hardware – I advocated a Google-like architecture a long time before Google even existed and it was a relief when Google started to publicly disclose some of the things they were doing because then I could end my explanations by a “and BTW, this is exactly what Google is doing!”. Herd mentality always works.
Now imagine a service that is already doing all the right things – Amdahl’s law is the Law of the Land, sysadmin motto’s (“if it ain’t broke, don’t fix it”) is cause for immediate dismissal, programmers cannot commit without unit tests and peer reviews, and so on. The problem is that even when using the best practices (and by best I certainly do not mean the most expensive) a whole system can still go down because of a single programmer’s bug – and we certainly had some very public examples of that recently.
Bugs are not something programmers should be ashamed of – instead they should be thought as a probability, and this probability integrated into the design of a service. One of the best proof that bugs are not preventable mistakes but unavoidable limitations of software is the so-call “software bit rot”, i.e. the spontaneous degradation of software over time. Software does not really degrade over time; it’s just that the environment under which they are running is changing over time – faster processor, larger memory and so on. Basically this means that event if we can prove that a software is perfect now, unless it is possible to predict the future it is not possible to guarantee that a program will be without bugs in the future.
So, as always, if there is no solution to a problem then just change the problem. Bugs have a more or less constant probability to happen, i.e. X bugs for Y lines of code. I am not aware of any research in this domain (pointers welcome!) but intuitively I would say that two independent developers starting from the same set of specification will come with two different software and, more important to my point, with two different sets of bugs. If the probability of a major bug in any implementation is X, the probability that the same environment triggers the same bug in two completely different implementations should be X^2 (assuming that developers are not biased towards certain classes of bugs). In other words if we can run in parallel two or more implementations of the same specifications and be able to choose the one that is working correctly at a specific time, then we increase by multiple orders of magnitude the reliability of our service (one of my old posts used the same reasoning applied to ISP. Just replace “service down” by “someone will die because I cannot reach 911”).
Obviously running multiple implementations of the same specifications at the same time is really expensive and, as far as I know, is used only when a bug can threaten life, like in a nuclear electric plant. If we can solve the economics of this problem, then we should be able to offer far better services to end-users.
Now back to RELOAD. RELOAD is a standard API for peer-to-peer services. There is nothing really new in RELOAD – it uses technologies, like Chord, that are well-known and, for some people, that are probably close to obsolescence. The most important, at least in my opinion, is that for the first time we have a standard API to use these services on multiple implementations that can inter-operate with each other. Having a distributed database is great for reliability purpose, but having a distributed database that is running on different implementations, like RELOAD permits, is far better.
But RELOAD is not limited to a distributed database. With the help of ReDIR, we can have multiple implementations registering a common service, and so if one implementation fails, we still can count on the other implementations of the same service.
As for the economic side of having multiple implementations, instead of having only mine developed in-house I can swap it with similar code developed by another implementers. At the end if I am able to do this with 3 or 4 other implementers, all of us will have an incredibility more resilient service without spending a lot more money (this is the reason why I am so against having THE ONE open or free software implementation of a standard. The more implementations are developed, the better the users will be served. Implementation anarchy is a good thing, because it fosters evolution).
OK, I admit that was a long introduction for version 0.6.0 of the libreload-java package and the associated new Internet draft about access control policy scripts, especially because access control policy scripts paradoxically do not improve the reliability of a service based on RELOAD, but reduce it. The reason for this is that access control policy are ordinarily natively implemented so a bug is local to an implementation. This draft permits to distribute a new access control policy but as the same code will be shared between all the different implementations this creates a single point of failure. This is why access control policy scripts should be considered only as a temporary solution, until the policy can be implemented natively. The script copyright should prevent developers to copy it but we all know how that works, so to help developers do the right thing and not be tempted to look at the scripts when implementing the native version in their code, the source code in the XML configuration file is now obfuscated by compressing it and converting it to a base64 string.
The draft also introduce an idea that, in my opinion, could be worth developing, which is that the signer of a configuration file should play an active role in verifying that all implementations in the overlay follow the rules set up by the configuration file. For example the signer can send incorrectly signed configuration files to random peers, and add them to the blacklist in a future configuration file if they do not reject it.