What the Internet is doing with your packets

When working with Internet transports, either for analyzing them or designing a new one, it is important to keep in mind that the Internet can do a lot of strange things with your packets.

Disclaimer: I did a similar presentation at my former employer but I did not keep anything, notes or slides, when I left so this blog entry is a re-creation from memory.

For this discussion, we will reason about a very simplified Internet that still have all the properties of the real thing. Our first simplification will be that there is only 10 endpoints in this network, numbered from 0 to 9 (the real Internet has 6e57 possible addresses if IPv4, IPv6 and NATs are taken in account). Our second simplification will be that we can send only 26 different messages between these endpoints, from a to z (the real Internet can exchange 10e3612 different messages). With these limitations, we can now use a simple notation based on regular expressions (regex) to express sending a message from endpoint 0 to endpoint 9:

messages -->  regex

Here “messages” is a list of messages that are sent by endpoint 0 one after the other, and “regex” is a description of what endpoint 9 receives from endpoint 0.

So the question we are trying to answer here is “what regular expression can fully describe what endpoint 9 can receive if endpoint 0 sends message a, then message b to it?”

A first description could look like this:

ab --> ab

Meaning that if endpoint 0 sends message a then message b, then endpoint 9 will receive first message a and then message b.

Obviously that is not true as not only there is no guarantee that a message sent will be received, but dropping messages is one of the fundamental mechanism used by the Internet to ask the sender to reduce its sending rate. So a second description can take this in account:

ab --> a?b?

But there is also no guarantee that when sending two messages in succession they will arrive in the same order. The reason for this is that two messages can take different paths through routers, and so the first message can be delayed enough to arrive after the second one. Let’s try a better description:

ab --> (a|b|ab|ba)?

But if the Internet can drop a message, the Internet can also duplicate it. This is a rare condition that can happen for different technical reasons, but the fact is that one should be ready to receive multiple identical messages:

ab --> (a|b)*

Now that seems to cover everything: Messages can be delayed, dropped or duplicated.

The reality is that the complete answer to our question is this:

ab --> (a|b|.)*

What is means is that, in addition of messing with messages sent by endpoint 0, the Internet can make endpoint 9 receive messages from endpoint 0 that endpoint 0 never sent.

This is one of the most often forgotten rule when analyzing or designing a transport, which is that it is really easy to fake a message as originating by another endpoint.

Nothing in the Internet Protocol can be used to detect any of these conditions, so it is up to the transport built on top to take care of these. Acknowledgments, timeout and retransmissions can take care of the lost messages; sequence number and bufferization can take care of the delays and duplications; and some kind of cryptographic signature can permit to detect fake message.

RELOAD: Access control policy script tester

There was some interest at the last IETF meeting in my presentation of Access Control Policy script, so I spent some time working in improving this. The first step was to release a new version of the I-D. To encourage adoption of this draft, I also released a library permitting to develop and test new access control scripts without having a complete implementation of RELOAD. This was released as a Debian package named libreload-java in my Debian/Ubuntu repository. This package is also a reference implementation for the I-D but, as it contains a subset of the RELOAD library I am current developing, it can also be considered as an early release of this library. As always, this is released under an Affero GPL 3 license, and the code source is available.

To explain how to develop a new access control script, I will now go through the example which is also in the libreload-java package:

import static org.implementers.net.reload.AccessControlPolicyTester.kind;
import static org.implementers.net.reload.DataModel.Standard.*;

These two lines permit to simplify the code by importing the kind() static method and the available Data Models.

private static final long KIND_ID = 4026531841l;

Here we declare a constant defining a new Kind-Id in the private range.

AccessControlPolicyTester tester = new AccessControlPolicyTester("myusername",
    "String.prototype['bytes'] = function() {n" +
    "  var bytes = [];n" +
    "    for (var i = 0; i < this.length; i++) {n" +
    "      bytes[i] = this.charCodeAt(i);n" +
    "    }n" +
    "    return bytes;n" +
    "};n" +
    "return resource.equalsHash(signature.user_name.bytes());n"));

This code creates a new instance of the tester by passing a user name (that will be used to generate an X509 certificate) and a new kind. The new kind uses the Kind-ID declared previously, uses the SINGLE Data Model and will use an Access Control Policy as described in the ECMAScript code. Note that more calls to the kind() method can be added to the constructor of the tester.

Map<Long, DataModel<?>> kinds = new HashMap<Long, DataModel<?>>();

This code declares a variable that will contains the data items meant to be stored. The key of the map is a long that contains a Kind-Id. The value is either an instance of the DataModel.Single class, the DataModel.Array class or the DataModel.Dictionnary if the Data Model is respectively SINGLE, ARRAY or DICTIONARY. In our case, we use the SINGLE Data Model, meaning that only one item can be stored at a specific Resource-ID for this specific Kind-ID.

kinds.put(KIND_ID, new DataModel.Single(KIND_ID, 0l, new Value.NotExists(System.currentTimeMillis(), 60, tester.signer(), tester.privateKey())));

This piece of code does multiple things, so let’s decompose:

First it retrieves the signer (an Identity instance that consist of a X509 Certificate, a user name and a Node-ID) and the private key that was used for the certificate of the signer. These two objects are automatically created by the constructor of the tester object.

Then it creates a new Value instance with a store_time equal to the current time, a lifetime of one minute, an exists value of FALSE and the identity and private key just retrieved. This is the object that we want to store (it is more or less equivalent to StoredData).

Then a new instance of DataModel.Single is created with our new Kind-ID, a generation_counter of 0 and the value to store (it is more or less equivalent to StoreKindData).

Finally this object is stored in the kinds map, with the same Kind-ID as key.

tester.store(tester.signer().userName().getBytes(UTF_8), 0, kinds)

In this code, we finally try to store the items that we just created. The first parameter is the Resource-Name, which in our case is the user name (converted to a byte array). The second parameter is the replica number, 0 for the original replica. The third parameter is the set of items. The method will return false if one items fails the access control policy verification, and true if all the items pass the verification.

The DataModel.Array and DataModel.Dictionary works in the same way, excepted that DataModel.Array is a sparse array implementing java.util.List and DataModel.Dictionary is implementing java.util.Map.

The program can be compiled and executed with the following commands:

$ javac -classpath /usr/lib/reload/reload.jar Main.java
$ java -classpath usr/lib/reload/reload.jar:. Main

In addition to the jar file and the example, the package contains the Javadoc for all the classes and methods accessible to the tester, and a copy of the I-D.

Update 2011/05/14:

The command lines above do not work. I uploaded a new package (0.1.1) so the following command lines should work now:

$ javac -classpath /usr/lib/reload/reload.jar AcpTest.java
$ java -classpath usr/lib/reload/reload.jar:. AcpTest

Application developers and DNS TTL

During the technical plenary of the 73th IETF meeting in Minneapolis MN, Dave Thaler made the interesting point that most DNS resolver API do not return the TTL of the resource resolved, e.g. an IP address. At the time he was proposing a modification of the existing APIs, and that made me thinking since.

The problem is that programmers generally think that resolving a domain name and then storing the IP address(es) resulting of this resolution is a good idea, as this technique could yield a better responsiveness for an application. But doing that without having a mechanism to know that the result of the resolution is invalid creates an even bigger problem. During my time at 8×8, I worked on the problem of designing a scalable Internet service, and my design was focused on having the scalability driven by the client (in other words: the whole load balancers idea is evil). Such techniques are very cheap and effective, but only if the client strictly obeys the TTL value received in the DNS response (I still have two pending patent applications about this techniques). Another well known problem is documented in RFC 4472 section 8.2, as keeping an IP address for too long prevents renumbering an IPv6 network, but there is plenty of other cases.

So the idea of passing the TTL together with the result of the DNS query seems like a good one, until you realize that in fact what developer have now to do is to implement a DNS cache in their application, and every evidence shows that this is not a simple task. As can be seen by the number of security vulnerabilities found during the years, even people who do read the RFC seem to have an hard time doing it right. Internet could probably do without another wave of DNS cache implemented incorrectly.

So in my opinion, adding the TTL to the API is not the solution – it will just exchange one problem with another. The correct solution is to do the resolution each time the resource is needed and do not store the result at all. If performances are too much impacted (after scientifically measuring them, we are between professionals here) then using an external DNS cache will fix the problem. The DNS cache can be in your network (for example having two DNS caches per data center), can be on each server (dnscache from the djbdns suite is easy to install and configure and has a good security track), or even directly in your application (for example dnsjava contains such a cache).

The end to end argument

Scott Brim posted today on the IETF mailing-list one of the best defense of the end to end argument:

> On page 9, you state, based on a citation from RFC 4924:
> “We believe that providing end-to-end transparency […]
> is key to the success of the Internet.” I think this
> statement needs elaboration. End-to-end transparency
> is not a reason for the success of the Internet.

I invoke Feynman and the “philosophy of ignorance”. The reason you want e2e transparency is because you do not know what it might enable, and we want that. We _want_ to have uncertainty about what the future of the Internet is. We do not know what advantages or restrictions our decisions will bring in the future. The richness of the Internet experience has come about because we have given end users the capability to develop new ways of using it, and somehow managed to have got out of the way, so far.

Feynman said (among other things — search for it):

Our responsibility is to do what we can, learn what we can, improve the solutions, and pass them on. It is our responsibility to leave the people of the future a free hand. In the impetuous youth of humanity, we can make grave errors that will stunt our growth for a long time. This we will do if we say we have the answers now, so young and ignorant as we are. If we suppress all discussion, all criticism, proclaiming “This is the answer, my friends; man is saved!” we will doom humanity for a long time to the chains of authority, confined to the limits of our present imagination. It has been done so many times before.

I am a long time advocate of the end to end argument and I wrote and talked about it in the past, so many thanks Scott for this.