What the Internet is doing with your packets

When working with Internet transports, either for analyzing them or designing a new one, it is important to keep in mind that the Internet can do a lot of strange things with your packets.

Disclaimer: I did a similar presentation at my former employer but I did not keep anything, notes or slides, when I left so this blog entry is a re-creation from memory.

For this discussion, we will reason about a very simplified Internet that still have all the properties of the real thing. Our first simplification will be that there is only 10 endpoints in this network, numbered from 0 to 9 (the real Internet has 6e57 possible addresses if IPv4, IPv6 and NATs are taken in account). Our second simplification will be that we can send only 26 different messages between these endpoints, from a to z (the real Internet can exchange 10e3612 different messages). With these limitations, we can now use a simple notation based on regular expressions (regex) to express sending a message from endpoint 0 to endpoint 9:

messages -->  regex

Here “messages” is a list of messages that are sent by endpoint 0 one after the other, and “regex” is a description of what endpoint 9 receives from endpoint 0.

So the question we are trying to answer here is “what regular expression can fully describe what endpoint 9 can receive if endpoint 0 sends message a, then message b to it?”

A first description could look like this:

ab --> ab

Meaning that if endpoint 0 sends message a then message b, then endpoint 9 will receive first message a and then message b.

Obviously that is not true as not only there is no guarantee that a message sent will be received, but dropping messages is one of the fundamental mechanism used by the Internet to ask the sender to reduce its sending rate. So a second description can take this in account:

ab --> a?b?

But there is also no guarantee that when sending two messages in succession they will arrive in the same order. The reason for this is that two messages can take different paths through routers, and so the first message can be delayed enough to arrive after the second one. Let’s try a better description:

ab --> (a|b|ab|ba)?

But if the Internet can drop a message, the Internet can also duplicate it. This is a rare condition that can happen for different technical reasons, but the fact is that one should be ready to receive multiple identical messages:

ab --> (a|b)*

Now that seems to cover everything: Messages can be delayed, dropped or duplicated.

The reality is that the complete answer to our question is this:

ab --> (a|b|.)*

What is means is that, in addition of messing with messages sent by endpoint 0, the Internet can make endpoint 9 receive messages from endpoint 0 that endpoint 0 never sent.

This is one of the most often forgotten rule when analyzing or designing a transport, which is that it is really easy to fake a message as originating by another endpoint.

Nothing in the Internet Protocol can be used to detect any of these conditions, so it is up to the transport built on top to take care of these. Acknowledgments, timeout and retransmissions can take care of the lost messages; sequence number and bufferization can take care of the delays and duplications; and some kind of cryptographic signature can permit to detect fake message.

Improving standard compliance with transclusion

Inserting fragments of a standard specification – IETF RFC or other – as comments in the source code that implement it seems to be a simple way to assure a good conformance. Unfortunately doing so can create legal issues if not done carefully.

These days I am spending a lot of time implementing IETF’s (and other SDO’s) protocols for the Nephelion Project. I am no stranger to network protocol implementation, as this is mostly what I am doing since more than 25 years, but this time the very specific code that is needed for this project is required to be as close as possible to the standard. So I am constantly referring to the text inside the various RFCs to verify that my code is conformant. Obviously copying the text fragments as comments would greatly simplify the development and at the end make the translation between the English text that is used to describe the protocol and the programming language I use to implement it a lot more faithful.

At this point I should insert the usual IANAL but, at least to my understanding, that is something that is simply not possible. My intent is to someday release this code under a Free Software license, but even if it was not the case, I believe that all software should be built with the goal of licensing it in the future, this license being a commercial one or a FOSS one. The issue here is that the RFCs are copyrighted and that modifying is simply not permitted by the IETF Trust and, in my opinion, rightly so as a standard that anybody can freely modify is not much of a standard. But publishing my code under a FOSS license would give everyone the right to modify it (under the terms of the license), and that would apply too to the RFC fragments inserted in the source code.

So the solution I use to at the same time keep the specification and the implementation as close as possible and to not have to worry about code licensing is to use transclusion. Here’s an example of comment in the code source for the UDP module:

% @transclude file:///home/petithug/rsync/ietf/rfc/rfc768.txt#line=48,51

The syntax follows the Javadoc (and Pldoc, and Scaladoc) syntax. The @transclude tag indicates that the text referenced by the URL must be inserted in the source code but only when displayed in a text editor. Here’s what the same code looks like when loaded in VIM (the fragment for RFC 768 is reproduced here under fair use):

% @transclude file:///home/petithug/rsync/ietf/rfc/rfc768.txt#line=48,51
% {@transcluded
% Source Port is an optional field, when meaningful, it indicates the port
% of the sending process, and may be assumed to be the port to which a
% reply should be addressed in the absence of any other information. If
% not used, a value of zero is inserted.
% @@@}

(I chose this example because, until few days ago, I did not even know that using a UDP source port of 0 was conformant).

The @transcluded inline tag is dynamically generated by a VIM plugin but this tag will never appear anywhere else than in the VIM buffer, even after saving it to the disk. The fragment syntax is from RFC 5147, and permits to select the lines that must be copied (An RFC will never changes, so hardcoding the line number in the code source cannot break in the future).

The plugin can be installed from my Debian repository with the usual “apt-get install vim-transclusion”. The plugin is kind of rough for now: only the #line=<from>,<to> syntax is supported, hardcoding the full path is not very friendly, curl is required, etc… But that is still a huge improvement over having to keep specification and implementation separate.

An Undecidable Problem in SIP

A few years back, one of my colleague at 8×8 made an interesting suggestion. We were at the time discussing a recurring problem of SIP call loops in the Packet8 service and his suggestion was to write a program that would analyze all the various forwarding rules installed in the system, and simply remove those that were the cause for the loops. I wish I remember what I responded at the time and that I had the insight to say that writing such program is, well, impossible, but that’s probably not what happened.

Now “impossible” is a very strong word and I must admit that I have spent most of my career in computers thinking that nothing was impossible to code and that, worst case, I just needed a better computer. It just happens that there is a whole class of problems that are impossible to code – not just difficult to code, or not that the best possible code will take forever to return an answer without using a quantum computer, but that it is impossible to write a program that always return correct answers for these problems. The SIP call loop problem is one of them.

To make sense of this we will need to define a lot of concepts, so lets start with a SIP call. SIP is, for better or for worse, the major session establishment protocol for VoIP. A SIP client (e.g. a phone or a PSTN gateway) establishes a call more or less like how a Web browser contacts a web site, with the difference that in the case of SIP the relationship with the server lasts for the duration of the call. One of the (deeply broken) features of SIP is that there may exist intermediate network elements called SIP Proxies that can, during the establishment of a call, redirect the call to a new destination. In this case the server, instead of answering the call itself, creates a new connection to a different destination which can itself creates a new connection and so on. This mechanism obviously can create loops, especially when the forwarding rules in these servers are under the control of the end-user – which was the problem we encountered at 8×8. There is many mechanisms a SIP proxy can use to decide if, when, and where to redirect a call, but for the sake of simplicity we will consider only a subset of all these possibilities, and assume that all the SIP proxies involved use the Call Processing Language (CPL), a standard XML-based language designed to permit end-users to control, among other things, how a SIP proxy forwards calls on their behalf. Jive’s users can think of a CPL script as equivalent to the dialplan editor, but for just one user and with the additional constraint that it is not possible to create loops inside a CPL script.

Here an example of CPL script taken from the standard (RFC 3880) that will forward incoming calls to the user’s voicemail if she does not answer the calls on her computer:

<cpl xmlns="urn:ietf:params:xml:ns:cpl" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:ietf:params:xml:ns:cpl cpl.xsd">
  <incoming>
    <location url="sip:jones@jonespc.example.com">
      <proxy>
        <redirection>
          <redirect/>
        </redirection>
        <default>
          <location url="sip:jones@voicemail.example.com">
            <proxy/>
          </location>
        </default>
      </proxy>
    </location>
  </incoming>
</cpl>

One interesting feature of CPL is that it is extensible, so even if only a subset of all the capabilities of a standard compliant SIP proxy can be implemented using baseline CPL, it is possible to add extensions to be able to use any legal feature of the SIP standard. As an example, the following CPL script (also taken from RFC 3880) rejects calls from specific callers by using such an extension:

<cpl xmlns="urn:ietf:params:xml:ns:cpl" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:ietf:params:xml:ns:cpl cpl.xsd">
  <incoming>
    <address-switch field="origin" subfield="user" xmlns:re="http://www.example.com/regex">
      <address re:regex="(.*.smith|.*.jones)">
        <reject status="reject" reason="I don't want to talk to Smiths or Joneses"/>
      </address>
    </address-switch>
  </incoming>
</cpl>

For the rest of this discussion, we will consider that we can only use CPL extensions that are legal in a standard SIP proxy.

Now that we established the perimeter of our problem, we can formally formulate the question as follows: If we consider a VoIP system made only of endpoints (phones) and SIP proxies that run CPL scripts, and with the complete knowledge of all the (legal) CPL extensions in use, is it possible to write a program that takes as input all these scripts and an initial call destination (known as a SIP URI) and that can tell us if the resulting call will loop or not?

That seems simple enough: Try to build an oriented graph of all the forwarding rules, and if the graph is not a directed acyclic graph, then it is looping.

Let’s say that my former colleague took up the challenge and wrote this program. Using Java, he would have written something like this:

boolean isLooping(Map<URI, Document> configuration, URI initial) {
  // some clever code here...
  }

The “configuration” parameter carries the whole configuration of our SIP system as a list of mappings between an Address-Of-Record (AOR, the identifier for a user) and the CPL script that is run for this user. The “initial” parameter represents the initial destination for a call (which may contain the AOR of a user). The Java function returns true if a call to this user will loop, or false if the call is guaranteed to reach someone or something, other then the originator of the call, without looping.

Now let’s change the subject for a little bit and talk about something called the Cyclic Tag System (CTS). CTS is a mechanism to create new bit strings from an initial bit string and an ordered list of bit strings called productions, using the following rules:

1. Remove the leftmost bit of the current bit string.
2. If the removed bit was equal to 1 then concatenate the next production in the list to the current bit string (restarting at the first production when all are used).
3. Continue at rule (1) unless the current bit string is empty.

The following example from Wikipedia uses an initial bit string of 11001 and a list of productions of { 010, 000, 1111 }. Following the rules above, we generate the following bit strings:

11001
1001010
001010000
01010000
1010000
010000000
10000000
...

Simple enough.

Now, let’s say that we want to implement CTS in SIP. The CPL language is not powerful enough for that but we can define a very simple extension (still conforming to the SIP standard) that permits to copy a character substring into a parameter for the next destination of a call, e.g.:

<location m:url="sip:c.org;p={destination.string:2:10}" />

Here we extract the “string” parameter from the original destination and copy ten characters starting in second position into the new destination. With this new extension, we can now write three CPL scripts that will implement the CTS example above:

Script attached to user `sip:p1@jive.com`:

<address-switch field=”destination”>
  <address contains=”;bitstring=1”>
    <location m:url=”sip:p2@jive.com;bitstring={destination.bitstring:1}010” />
  </address>

  <otherwise>
    <address-switch field=”destination”>
      <address contains=”;bitstring=0”>
        <location m:url=”sip:p2@jive.com;bitstring={destination.bitstring:1}” />
      </address>

      <otherwise>
        <reject />
      </otherwise>
    </address-switch>
  </otherwise>
</address-switch>

Script attached to user `sip:p2@jive.com`:


<address-switch field=”destination”>
  <address contains=”;bitstring=1”>
    <location m:url=”sip:p3@jive.com;bitstring={destination.bitstring:1}000” />
  </address>

  <otherwise>
    <address-switch field=”destination”>
      <address contains=”;bitstring=0”>
        <location m:url=”sip:p3@jive.com;bitstring={destination.bitstring:1}” />
      </address>

      <otherwise>
        <reject />
      </otherwise>
    </address-switch>
  </otherwise>
</address-switch>

Script attached to user `sip:p3@jive.com`:

<address-switch field=”destination”>
  <address contains=”;bitstring=1”>
    <location m:url=”sip:p1@jive.com;bitstring={destination.bitstring:1}1111” />
  </address>

  <otherwise>
    <address-switch field=”destination”>
      <address contains=”;bitstring=0”>
        <location m:url=”sip:p1@jive.com;bitstring={destination.bitstring:1}” />
      </address>

      <otherwise>
        <reject />
      </otherwise>
    </address-switch>
  </otherwise>
</address-switch>

On our Java function isLooping(), these three scripts would go on the first parameter (indexed by the address of the proxy) and the initial parameter would contain “`sip:p1@jive.com;bitstring=11001`”.

It is trivial to write a program that can generate the CPL scripts needed to implement any list of productions. No need to even write a program, a simple XSLT stylesheet can do the job.

The reason we implemented CTS in SIP is that in 2000, Matthew Cook published a proof that CTS is Turing-Complete. Turing-Complete is a fancy expression that means that it is computationally equivalent to a computer, which basically means that any computation that can be done on a computer can also be done by the CTS system – obviously doing it multiple orders of magnitude slower than a computer, but both a computer (or a quantum computer, or a Turing machine, or lambda calculus, or any other Turing-Complete system) and the CTS system can compute exactly the same things. Because only a Turing-Complete system can simulate another Turing-Complete system, by showing that we can implement any CTS production list in SIP we just proved that SIP is equivalent to a computer. In turn that means that it is possible to convert any existing program unto a list of CPL scripts (with our small extension) and an initial SIP URI. So knowing this and knowing that the Java bytecode is also Turing-Complete (it is trivial to implement CTS in Java), we now can write this new function:


Map<URI, Document> convert(Runnable runnable) {
  // complex, but definitively implementable
  }

The function takes a Java class (i.e. a list of bytecodes) as parameter and returns a list of { SIP URI, CPL scripts } mappings as result. The class to convert implements Runnable so we have a unique entry point in it (i.e. its run() method), an entry point that by convention becomes the SIP URI on the first mapping returned by the function.

Now that we have these two basic functions, let’s write a complete Java class that is using them:

class Paradox implements Runnable {
  boolean isLooping(Map<URI, Document> configuration, URI initial) {
    // some clever code here
    }

  Map<URI, Document> convert(Runnable runnable) {
    // complex, but definitively implementable
    }

  public void run() {
    Map<URI, Document> program = convert(this);
    if (!isLooping(program, program.keySet().iterator().next())) {
      while (true);
      }
    }
  }

What we are doing here is converting the whole class into a SIP proxy configuration, and passing it to the code that can predict if this program will loop or not. Then we do something a bit tricky with the result, which is to immediately exit if we detect that it will loop. But because this is exactly the code that was under the scrutiny of the isLooping() implementation, it should have returned false, which should have executed the “while (true);” code. But executing the “while (true);” code means that the code is looping, which again contradicts the result we have.

Both cases are impossible which can have only one explanation: The isLooping() implementation is unable to find if this program loops or not. Thus, we have shown that it is possible to write a Java program that loops but cannot detect that it is looping. And because we previously proved that SIP with CPL is Turing-Complete, we know that if it is possible to write such a program in Java, it is theoretically possible to do so in SIP configuration. Because of this, we now know that it is impossible to write a program that can reliably predict if a SIP configuration loops or not. More precisely, for any implementation of isLooping(), it is always possible to find an instance of the “configuration” and “initial” parameters for which this version of isLooping() will return an incorrect response.

Now that we have the answer to our question, let’s have a better look at what really happen in a VoIP system. A SIP call never really loop forever (well, unless the PSTN is involved, but that’s a different story), because there is a mechanism to prevent that. Each call contains a counter (Max-Forwards) that is decremented when it traverses a SIP proxy, and the call ends when this counter reaches zero. Max-Forwards is a little like CPU quota. It does not improve the quality of a program; rather it just prevents it from making things worse. Putting aside the fact that we are in the business of establishing communications between people, not finding fancy ways to prevent that, the Turing-Completeness of SIP still gets in the way as, for the same reasons than before, it is also impossible to write a program that will reliably predict if a call will fail because the Max-Forwards value will reach zero.

This is a sad state of affairs that the only way to reliably predict a possible failure is to let this failure happen, but the point of this article is that this is not really the fault of the programmer, but just the consequence of the limitations of computation in this universe.

Note that at least some SIP systems are not necessarily Turing-Complete – here we had to add an extension to our very limited system based on CPL to make it Turing-Complete. But it is far easier to prove that something is Turing-Complete than proving it is not, so even if I could not find a way to prove that a pure CPL system is Turing-Complete, that does not prove it is not. But worse, we saw that there is not much difference between a Turing-Complete system and one that is not, so even a small and seemingly irrelevant modification to a system can make it Turing-Complete. So basically writing a program that tries to reach definitive conclusions on the behavior of a system moderately complex – like SIP – is an exercise in futility.

Many thanks to Matt Ryan for his review of this article.

Keeping work and personal computers separated

I try as much as possible to keep my personal stuff separated from the work stuff. Even if both California and Utah laws are clear that what I develop on my own time and my own hardware is mine (as long as it is not related to my day job – that’s the big difference with French laws where I own what I developed on my own time, even if it is on my employer’s business or computers), that did not prevent one of my former employers to try to claim ownership on what was rightfully mine. Because it is very expensive to get justice in the USA, getting things as separated as possible from the beginning seems like a good idea.

The best way to do that is simply to not work during one’s free time on anything that could have a potential business value – these days, I spend a lot of time learning about cryptography, control system engineering and concurrent systems validation. But keeping things separated still creates some issues, like having to carry two laptops when traveling. I did this twice for IETF meetings, and it is really no fun.

The solution I finally found was to run my personal laptop as an encrypted hard drive in a virtual machine on the company laptop. My employer provided me with a MacBook, which has nice hardware but whose OS is not very good. I had to put a reminder in my calendar to reboot it each week if I did not want to see it regularly crashing or freezing. Mac OSX is a lot like like Windows, excepted that you are not ashamed to show it to your friends. Anyway here’s how to run your own personal computer on your employer’s laptop:

First you need a portable hard drive, preferably one that does not require a power supply. I use the My Passport Ultra 500GB with the AmazonBasics Hard Carrying Case. Next step is to install and configure VirtualBox on your laptop. You will need to install the Oracle VM VirtualBox Extension Pack if, like me, you need to use in your personal computer a USB device that is connected to the employer laptop (in my case, a smart-card dongle that contains the TLS private key to connect to my servers). Next step is to change the owner of your hard drive (you unfortunately will have to do that each time you plug the hard drive):

sudo chown <user> /dev/disk2

After this you can create a raw vdmk file that will reference the external hard drive:

cd VirtualBox VMs
VBoxManage internalcommands createrawvmdk -filename ExternalDisk.vmdk -rawdisk /dev/disk2

After this, you just have to create a VM in VirtualBox that is using this vdmk file. I installed Debian sid which encryption, which took the most part of the day as the whole disk as to be encrypted sector by sector. I also installed gogo6 so I could have an IPv6 connection in places that still live in the dark age. Debian contains the correct packages (apt-get install virtualbox-guest-utils) so the X server in the personal computer will adapt its display size automatically to the size of the laptop.

To restore the data from my desktop, I configured VirtualBox on it too, so I could also run the personal computer on it. Then, thanks to the same Debian packages, I was able to mount my backups as a shared folder and restore all my data in far less time than an scp command would take.

And after all of this I had a secure and convenient way to handle my personal emails without having to carry two laptops.

On the design of the STUN and TURN URI formats

The first goal of this post is to write down my reasoning for the formats I am promoting for the future STUN and TURN URIs, mostly because I keep forgetting it and have to reconstruct it from scratch each time I have this discussion with other people (and sadly also with myself), but this post can be of interest if you are confused about what TURN and STUN are, and how they can be used.

Let’s start with STUN (RFC 5389): It is important to immediately separate the STUN protocol from the STUN usages. The STUN protocol covers how bits are organized on the wire and how STUN packets are sent, received and retransmitted – all details that are not terribly important for this discussion, excepted on how they contribute to the confusion. The really interesting part is the list of STUN usages, which is the list of different things that can be done with STUN. At the time this post is written there is 4 different STUN usages, which always involve a STUN client and a STUN server:

  • NAT Discovery, specified in RFC 5389, which used is to find under which IP address and port a STUN client is visible to a STUN server. If the STUN client is inside a NAT and the STUN server on the Internet, then the NAT Discovery Usage permits to find the IP address of the NAT.
  • NAT Behavior Discovery, specified in RFC 5780, which used is to find what type of NAT separate a STUN client from a STUN server. It is a bad idea to use this information for anything else than collecting debugging data, which is why this RFC is experimental and why we will not discuss it.
  • Connectivity Check, specified in RFC 5245 (aka ICE), which used is to find if a STUN server can be reached by a STUN client.
  • Keep-alive, specified in RFC 5626, which is used to a) detect if a STUN server can still be reached by a STUN client, b) detect if the NAT/Firewall IP address or port changed and c) to keep the NAT/Firewall open.

STUN is defined to be used over UDP, TCP or TLS. STUN cannot yet be used over DTLS (i.e. TLS over UDP), or any more recent transports like SCTP or DCCP. One fundamental point to understand for this discussion is that the choice of the transport used by STUN is dependent only on the application needing it. If for instance the NAT Discovery Usage is used with RTP, only STUN over UDP can be of use to this application. STUN over TCP cannot help at all, so the choice of the transport is not left to the user of the application or to the administrators of the STUN server – it is purely a consequence of what the application is trying to achieve.

TURN (RFC 5766) is an application layer tunneling protocol. Although TURN have absolutely nothing to do with any of the Usages described above, it shares the same protocol than STUN – same bits on the wire, same way the packets are sent, received and retransmitted. This is the first reason of the confusion between STUN and TURN, the second being that, to save a round-trip, the TURN allocate transaction returns the exact same information that the STUN NAT Discovery Usage returns. In spite of this similarities with STUN, the job of the TURN protocol is completely different, as it is to carry application data between the TURN client and the TURN peer, through the TURN server. These application data can be anything, e.g. RTP packets. They can even be STUN packets, in which case the TURN client can also be a STUN client and the TURN peer (not the TURN server) can also be a STUN server.

Like for STUN, TURN is defined to be used over UDP, TCP or TLS between the TURN client and the TURN server. But this is the transport used for the tunnel itself, and the transport used inside the tunnel (i.e. for our RTP or STUN packets) can be different. RFC 5766 defines only UDP as TURN allocation (this is how the inside transport is called in the specification), but RFC 6062 extends TURN by adding the support of TCP allocations, although with the limitation that a TCP allocation cannot be used over a UDP transport (i.e. a UDP tunnel cannot carry TCP inside).

The very important point here is that the application does not care which transport is used for the TURN tunnel – it can be any tunnel transport that can carry the inside transport that the application need to use with the peer. So if the application needs UDP to send STUN or RTP to the peer, it does not matter if the tunnel transport is UDP, TCP or TLS.

On the other hand, what tunnel transport is available can matter for the provider of the TURN server. At the difference of STUN servers, TURN servers use reel resources (ports, bandwidth, CPU), so the administrators of these TURN servers may want to be able to balance the load, fail-over servers, etc… One of the other things that an administrator may want to manage is the priority between the different tunnel transports that a TURN client can use, and this is exactly what RFC 5928 provides.

But before going into RFC 5928, let’s have a look to the way the DNS interacts with STUN and TURN. A TURN server or a STUN server for the two first STUN Usages listed above (NAT Discovery and NAT Behavior Discovery) are generally deployed on fixed public Internet addresses, and so it is useful to use the DNS to associate a name with them (in an A or AAAA record). Because more than one instance of these servers is generally required to run a service, the SRV records can be used to distribute the load between servers, to manage fail-over and to assign a port to the servers. What RFC 5928 adds to this is the definition of a NAPTR record to select the transport.

Under RFC 5928 when an application wants to use a TURN server it has to provide two sets of information. The first set contains the list of tunnel transports that the application implements. The second set, which is probably stored in the configuration of the application, contains the name of the domain for the TURN server, an optional port, an optional transport and an optional secure flag. The algorithm in RFC 5928 takes these two sets of information and spit out an ordered list of IP address, port and tunnel transport that the TURN client can try to establish the tunnel. As soon the tunnel is established, The TURN client can request a TCP or a UDP allocation to send and receive packets, depending, as explained above, on the purpose of the application.

Because there is no point on having the STUN server administrators choosing the transport, there is no need to define something equivalent to RFC 5928 for STUN.

The TURN URI as currently designed carries all the information that are in the second set passed to the RFC 5928 algorithm. The URI “turn:example.org” fills the host parameter with “example.org”, and sets the secure flag, the transport and the port to undefined. The URI “turns:[2001:DB8::1]:2345;transport=TCP” sets the host to the IPv6 address 2001:DB8::1, the secure flag on, the port to 2345 and the transport to TCP.

Let’s now replace the TURN URI in the WebRTC context, which is the reason it is needed in the first place. The TURN URI is passed from the Web server to the browser in the Javascript code. In normal operations, the TURN URI will probably be something like “turns:example.org”, meaning that the tunnel transport will be negotiated between the capabilities of the browser and what the administrators of the TURN servers in the example.org domain prefer. But the administrators of the Web server may want for debugging reason to use a specific server and port, e.g. “turn:[2001:DB8::::1]:1234”. They may also want to force a specific transport, knowing that others transport have an unfixed bug, by using something like “turn:example.org;transport=UDP”. This flexibility is even more useful knowing that even with the cooperation of the DNS administrators, it will take some time for the new DNS records to propagate. So in this context, it makes sense that the TURN URI has a transport parameter.

On the other hand, a transport parameter on a STUN URI would make no sense, because the transport used by STUN is dictated by the application. If the UDP transport has a bug in the STUN servers, switching to a TCP transport cannot help an application that is trying to send RTP packets.

One of the alternative format that was proposed for the TURN and STUN URIs was to lose the “s” suffix in the “turns” and “stuns” scheme and to consolidate it inside a “;proto=” parameter. With this alternative format, “turns:[2001:DB8::1]:2345;transport=TCP” becomes “turn:[2001:DB8::1]:2345;proto=TLS”. But because as demonstrated previously STUN URI does not need a transport parameter, it is not possible way to remove the “s” suffix and convert it in a “;proto=” parameter. One way would be to convert “stuns:example.org” to “stun:example.org;secure”, but one can ask how this is better than the original STUN URI.

For all these reasons, and because it would look strange that STUN uses the “s” suffix and not TURN, I think that the right format is to allow “turns” and “stuns” scheme, and to use the “;transport=” parameter only for TURN URIs.

Updated 09/12/2012: Added a bit more text about the interaction between STUN/TURN and the DNS.

A configuration and enrollment service for RELOAD implementers

As people quickly discover, implementing RELOAD is not an easy task – the specification is complex, covers multiple network layers, is extensible and flexible and the fact that security is mandatory creates even more challenges at debug time. This is why new implementations generally focus on having the minimum set of features working between nodes using the same software.

Making two different RELOAD implementations interoperate requires a lot more work, mostly because connecting to a RELOAD overlay is not as simple as providing an IP address and port to connect to. Because of the extensibility of RELOAD, all the nodes in an overlay must use the same set of parameters, parameters that are collected and distributed in an XML document that need to be cryptographically signed. In addition to this, all nodes must communicate over (D)TLS links, using both client and server certificates signed by a CA that is local to the overlay. Configuration file and certificates must be distributed to each node and when two or more implementations wants to participate in the same overlay, adhoc methods to provision these elements are no longer adequate. The standard way to do that is through a configuration and enrollment server but unfortunately that is probably the part of the RELOAD specification that most implementers would assign the lowest priority, thus creating an higher barrier to interoperability testing than one would expect.

This is why during the last RELOAD interoperability testing event in Paris, I volunteered to provide configuration and enrollment servers as a service to RELOAD implementers, so they do not have to worry about this part. I already had my own configuration and enrollment servers, but I had to rewrite them from scratch because of two additional requirements: They had to work with any set of parameters, even some that my own implementation of RELOAD do not support yet, and it must be possible to host servers for multiple overlays on the same physical server (virtual server). A first set of servers are now deployed and in use by the participants of the last RELOAD interoperability event, so it is now time to open it to a larger set of participants.

First what this service is not: It is not to host commercial services, and it is not meant to showcase implementations. The service is free for RELOAD implementers (up to 5 overlays per implementer) for the explicit purpose of letting other implementers connect to your RELOAD implementation, which means that you are supposed to provision a username/password for any other implementer on request, on a reciprocity basis. Contact me directly if you are interested in an usage that does not fit this description.

The enrollment for the service is simple: send me an email containing the X500 names that will be used to provision your servers. Here’s an example to provision a fictional overlay named “my-overlay-reload.implementers.org”:

C=US, ST=California, L=Saratoga, O=Impedance Mismatch, LLC, OU=R&D,
CN=my-overlay-reload.implementers.org

The C=, ST=, L=, O= and OU= components should describe your organization (not you). The CN= component contains the name requested for your overlay. Note that the “-reload.implementers.org” part is mandatory, but you can choose to use whatever name before this suffix, as long as it is not already taken, that it follows the DNS label rules and that it does not contain a dot (wildcard certificates do not support sub-subdomains).

With these information I will provision the following:

  • The DNS RR as described in the RELOAD draft
  • A configuration server.
  • An enrollment server, with its CA certificate
  • A secure Operation, Administration and Management (OAM) server.

The DNS server will permit to retrieve the IP addresses and ports that can be used to connect to the configuration server. If we reuse our example above, the following command will retrieve the DNS name and port:

$ host -t SRV _reload-config._tcp.my-overlay-reload.implementers.org
_reload-config._tcp.my-overlay-reload.implementers.org has SRV record 40 0 443
my-overlay-reload.implementers.org.

Note that the example uses the new service and well-known URL name that were agreed on in the Vancouver meeting, but the current name (p2psip-enroll) will be supported until the updated specification is published.

The DNS name can then be resolved (the IPv6 address is functional):

$ host my-overlay-reload.implementers.org
my-overlay-reload.implementers.org has address 173.246.102.69
my-overlay-reload.implementers.org has IPv6 address
2604:3400:dc1:41:216:3eff:fe5b:8240

Then the configuration file can be retrieved by following the rules listed in the specification:

$ curl --resolve my-overlay-reload.implementers.org:443:173.246.102.69
https://my-overlay-reload.implementers.org/.well-known/reload-config

The returned configuration file will contain a root-cert element containing the CA certificate that was created for this overlay, and will be signed with a configuration signer that will be maintained by the configuration server. Basically the configuration server will automatically renew the configuration signer and resign the configuration file every 30 days, or sooner if you upload a new configuration file (more on this later). Note that to ensure that there is no lapse in the rollover of signer certificates, the configuration file must be retrieved periodically (the expiration attribute contains the expiration date of the signer certificate, so retrieving the configuration document one or two days before this date will guarantee that any configuration file can be used to validate the next one in sequence). This feature frees the implementers from developing its own signing tools (a future version will permit the implementers to maintain their own signer and to upload a signed configuration file).

The configuration file also contain an enrollment-server element, pointing to the enrollment server itself, that can be used to create certificates as described in the specification. The enrollment server requires a valid username/password to create a certificate and anyway the default configuration document returned is filled with the minimum parameters required, so they are useless as it to run a real overlay. Modifying the configuration document and managing the users that can request a certificate (and so join the overlay) is the responsibility of the OAM server.

Because the OAM server uses a client certificate for authentication, it uses a different domain name than the configuration and enrollment server. The domain name will use the “-oam-reload-implementers.org” suffix, and will use a separate CA to create the client certificate, so a user of the overlay cannot use its certificate to change the configuration (That would be a good idea to define a new X.509 extended key usage purpose for RELOAD to test for this).

The OAM server uses a RESTful API to manage the configuration and enrollment servers (well, as RESTful as possible, because the API is in fact auto-generated from a JMX API, and I did not find another solution that to map a JMX operation to a POST. But more on this in a future blog entry). Here’s the commands to add a new user, change a user password, list the users and remove a user:

$ curl --cert client.crt --key client.key --data "name=myname&password=mypassword"
https://my-overlay-oam-reload.implementers.org/type=Enrollment/addUser
$ curl --cert client.crt --key client.key --data "name=myname&password=mypassword"
https://my-overlay-oam-reload.implementers.org/type=Enrollment/modifyUser
$ curl --cert client.crt --key client.key https://notomele-reload.implementers.org/type=Enrollment/Users
$ curl --cert client.crt --key client.key --data "name=myname"
https://my-overlay-oam-reload.implementers.org/type=Enrollment/removeUser

The password is stored as a bcrypt hash, so it is safe as long as you do not use weak passwords.

The last step is to modify the configuration, probably to add a bootstrap element element. Currently the OAM server manages what is called a naked configuration, which is a configuration document stripped of all signatures. The current naked configuration can be retrieved with the following command:

$ curl --cert client.crt --key client.key https://my-overlay-oam-reload.implementers.org/type=Configuration/NakedConfiguration > config.relo

The file can then be freely modified with the following constraints:

  • The file must be valid XML and must conform to the schema in the specification (including the use of namespaces).
  • The sequence attribute value must be increased by exactly one, modulo 65535.
  • The instance-name attribute must not be modified.
  • The expiration attribute must not be modified.
  • A node-id-length element must not be added.
  • The root-cert element must not be removed or modified, and a new one must not be added.
  • The enrollment-server element must not be removed or modified, and a new one must not be added.
  • The configuration-signer element must not be removed or modified, and a new one must not be added.
  • A shared-secret element must not be added.
  • A self-signed-permitted element with a value of “true” must not be added.
  • A kind-signer element must not be added.
  • A kind-signature or signature element must not be added
  • Additional configuration elements must not be added.

Then the file can be uploaded into the configuration server:

$ curl --cert client.crt --key client.key -T config.relo https://my-overlay-oam-reload.implementers.org/type=Configuration/NakedConfiguration

If there is a problem in the configuration, an HTTP error 4xx should be returned, hopefully with a text explaining the problem (please send me an email if you think that the text is not helpful, or if a 5xx error is returned).

Bufferbloat, Cablemodem and SDR/SDN

Yesterday I attended a session at the IETF meeting in Vancouver that will probably be remembered as being a key moment in the history of the Internet. In it Van Jacobson gave a fantastic talk on CoDel and on the Bufferbloat problem. At the end of the talk, Van presented some deployment issues, the second one being that that in a computer CoDel should be deployed closer to the device driver. I wondered if Van Jacobson own’s netchannels could not be a nice solution to this problem, but I did not get the courage to go to the microphone and ask this.

I did not think much of the first deployment issue at the time. Here the problem is that although Codel is now implemented in the Linux kernel 3.5, and so can easily be deployed in home NATs/routers, the right place to install CoDel would be inside the Cablemodem (or equivalent). Unfortunately this is not a place that can be easily modified, as it is fully under the control of whoever built the cablemodem.

Then later in the day I attended the Technical Plenary, and the technical talk was about Software Defined Network (SDN). I must admit that I never heard of SDN before, but the name itself immediately made me thing of SDR, Software Defined Radio – I worked on a project involving SDR few years back, and I still have the two USRP1 that I used for prototyping. My intuition seems right on, although the first talk by Cisco left me confused (I will summarized it as “this is horribly complicated, buy our stuff”). The second talk by a researcher was better although a little bit creepy (SDN in my home network and the controller outside?). With the third and last talk by Google, I now was convinced that SDN was in fact SDR for network.

Now all these talks got somehow processed during my sleep, and I woke up with an idea. Why do not use the same hardware that is used for SDR, but for implementing SDN? For example one can design an USRP1 daughterboard that would permit to connect it to the RJ45 connector from my cable provider. Then it is simply a programming problem – i.e. implementing Docsis 3.0 but this time with CoDel inside. And that would also open a lot of possibilities, like being able to run tcpdump on the cable side of the modem.

One can even dream of additional daughterboards for different wire connections – Ethernet, USB, HDMI, SATA, Powerline and so on. That would be an exciting project to work on.

NAT64 discovery

Last week I volunteered to review draft-ietf-behave-nat64-discovery-heuristic, an IETF draft that describes how an application can discover a NAT64 prefix that can be used to synthesize IPv6 addresses for embedded IPv4 addresses that cannot be automatically synthesized by a DNS64 server (look here for a quick overview of NAT64/DNS64).

I am not a DNS or IPv6 expert, so I had to do a little bit of research before starting to understand that draft, and that looked interesting enough to decide to write an implementation, which is probably the best way to find problems in a draft (and seeing how often I find bugs in published RFCs that should be a mandatory step, but that’s another discussion). I installed a PC with the Linux Live CD of ecdysis, and configured it to use a /96 subnet of my /64 IPv6 subnet. After this I just had to add a route on my development computer to be able to use NAT64. I did not want to change my DNS configuration, so I forced the nameserver in the commands I used. With that configuration I was able to retrieve a synthesized IPv6 address for a server that do not have IPv6 addresses, then ping6 it:

$ host -t AAAA server.implementers.org 192.168.2.133
server.implementers.org has IPv6 address 2001:470:1f05:616:1:0:4537:e15b

$ ping6 2001:470:1f05:616:1:0:4537:e15b
PING 2001:470:1f05:616:1:0:4537:e15b(2001:470:1f05:616:1:0:4537:e15b) 56 data bytes
64 bytes from 2001:470:1f05:616:1:0:4537:e15b: icmp_seq=1 ttl=49 time=49.4 ms

As said above, the goal of NAT64 discovery is to find the list of IPv6 prefixes. The package nat64disc, that can be found at the usual place in my Debian/Ubuntu repository, contains one command, nat64disc, that can be used to find the list of prefixes:

$ nat64disc -d ipv4only.implementers.org -n 192.168.2.133 -l
Prefix: 2001:470:1f05:616:1:0:0:0/96 (connectivity check: nat64.implementers.org.)

When the draft will be published, the discovery mechanism will use by default the domain “ipv4only.arpa.” but this zone is not populated yet, so I added the necessary record to ipv4only.implementers.org so the tool can be used immediately. This domain name must be passed with the -d option on the command line.

As explained above, I did not want to modify my DNS configuration, so I have to force the address of the nameserver (i.e.e the DNS64 server) on the command line, with the -n option. Interestingly this triggered a bug in Java, as when forcing the nameserver the resolver will send an ANY request, which is not processed by DNS64. People interested in the workaround can look in the source code, as usual (note that there is another workaround in the code also related to a resolver bug, bug that prevents to use IPv6 addresses in /etc/resolv.conf).

I also provisioned a connectivity server for my prefix, as shown in the result. If the tool finds a connectivity server associated with a prefix, it will use it to check the connectivity and remove the prefix from the list of prefixes if the check fails.

The tool can also being use to synthesize an IPv6 address:

$ nat64disc -d ipv4only.implementers.org -n 192.168.2.133 69.55.225.91
69.55.225.91 ==> 2001:470:1f05:616:1:0:4537:e15b

and to verify that an IPv6 address is synthetic:

$ nat64disc -d ipv4only.implementers.org -n 192.168.2.133 2001:470:1f05:616:1:0:4537:e15b
2001:470:1f05:616:1:0:4537:e15b is synthetic

The tool does not process DNSSEC records yet, and I will probably not spend time on this (unless, obviously, someone pay me to do that).

RELOAD: Interoperability

In a previous post I said that one of my reasons to develop a RELOAD implementation was to implement VIPR. Another reason is to be able to develop software that is resistant to developer bugs.

The Internet is global, which means that a service should be available 24/7, and having a web page saying that a service is down for maintenance is just plain incompetence, and companies doing this should be publicly shamed for this. There is multiple ways to build systems that can still work during software update, operating system bugs and other disasters, and I think that Google did a lot of good in this domain by showing that it could be done in a way that does not require to spend millions of dollars in hardware – I advocated a Google-like architecture a long time before Google even existed and it was a relief when Google started to publicly disclose some of the things they were doing because then I could end my explanations by a “and BTW, this is exactly what Google is doing!”. Herd mentality always works.

Now imagine a service that is already doing all the right things – Amdahl’s law is the Law of the Land, sysadmin motto’s (“if it ain’t broke, don’t fix it”) is cause for immediate dismissal, programmers cannot commit without unit tests and peer reviews, and so on. The problem is that even when using the best practices (and by best I certainly do not mean the most expensive) a whole system can still go down because of a single programmer’s bug – and we certainly had some very public examples of that recently.

Bugs are not something programmers should be ashamed of – instead they should be thought as a probability, and this probability integrated into the design of a service. One of the best proof that bugs are not preventable mistakes but unavoidable limitations of software is the so-call “software bit rot”, i.e. the spontaneous degradation of software over time. Software does not really degrade over time; it’s just that the environment under which they are running is changing over time – faster processor, larger memory and so on. Basically this means that event if we can prove that a software is perfect now, unless it is possible to predict the future it is not possible to guarantee that a program will be without bugs in the future.

So, as always, if there is no solution to a problem then just change the problem. Bugs have a more or less constant probability to happen, i.e. X bugs for Y lines of code. I am not aware of any research in this domain (pointers welcome!) but intuitively I would say that two independent developers starting from the same set of specification will come with two different software and, more important to my point, with two different sets of bugs. If the probability of a major bug in any implementation is X, the probability that the same environment triggers the same bug in two completely different implementations should be X^2 (assuming that developers are not biased towards certain classes of bugs). In other words if we can run in parallel two or more implementations of the same specifications and be able to choose the one that is working correctly at a specific time, then we increase by multiple orders of magnitude the reliability of our service (one of my old posts used the same reasoning applied to ISP. Just replace “service down” by “someone will die because I cannot reach 911”).

Obviously running multiple implementations of the same specifications at the same time is really expensive and, as far as I know, is used only when a bug can threaten life, like in a nuclear electric plant. If we can solve the economics of this problem, then we should be able to offer far better services to end-users.

Now back to RELOAD. RELOAD is a standard API for peer-to-peer services. There is nothing really new in RELOAD – it uses technologies, like Chord, that are well-known and, for some people, that are probably close to obsolescence. The most important, at least in my opinion, is that for the first time we have a standard API to use these services on multiple implementations that can inter-operate with each other. Having a distributed database is great for reliability purpose, but having a distributed database that is running on different implementations, like RELOAD permits, is far better.

But RELOAD is not limited to a distributed database. With the help of ReDIR, we can have multiple implementations registering a common service, and so if one implementation fails, we still can count on the other implementations of the same service.

As for the economic side of having multiple implementations, instead of having only mine developed in-house I can swap it with similar code developed by another implementers. At the end if I am able to do this with 3 or 4 other implementers, all of us will have an incredibility more resilient service without spending a lot more money (this is the reason why I am so against having THE ONE open or free software implementation of a standard. The more implementations are developed, the better the users will be served. Implementation anarchy is a good thing, because it fosters evolution).

OK, I admit that was a long introduction for version 0.6.0 of the libreload-java package and the associated new Internet draft about access control policy scripts, especially because access control policy scripts paradoxically do not improve the reliability of a service based on RELOAD, but reduce it. The reason for this is that access control policy are ordinarily natively implemented so a bug is local to an implementation. This draft permits to distribute a new access control policy but as the same code will be shared between all the different implementations this creates a single point of failure. This is why access control policy scripts should be considered only as a temporary solution, until the policy can be implemented natively. The script copyright should prevent developers to copy it but we all know how that works, so to help developers do the right thing and not be tempted to look at the scripts when implementing the native version in their code, the source code in the XML configuration file is now obfuscated by compressing it and converting it to a base64 string.

The draft also introduce an idea that, in my opinion, could be worth developing, which is that the signer of a configuration file should play an active role in verifying that all implementations in the overlay follow the rules set up by the configuration file. For example the signer can send incorrectly signed configuration files to random peers, and add them to the blacklist in a future configuration file if they do not reject it.

RELOAD: VIPR Support

One of my reasons for developing an implementation of RELOAD is to be able to develop an implementation of VIPR, a mechanism currently developed by the IETF to automatically and securely share SIP routes between VoIP providers. VIPR is using RELOAD as a giant distributed database where VoIP providers stores the phone numbers of their customers.

RELOAD in its last incarnation has all the features needed to implement VIPR, with two exceptions, the access control policy and the quota mechanism.

The access control policy in VIPR is similar to the standard USER-NODE-MATCH policy, but there is enough differences to mandate the implementation of a new policy. The preferred solution is to implement natively this policy, but a temporary solution could be to use the extension I designed to add new policies in the RELOAD configuration file. A future version of my implementation will implementation this policy natively, but meanwhile the following script can be used:


var equals = function(a, b) {
  if (a.length !== b.length) return false;
  for (var i = 0; i < a.length; i++) {
    if (a[i] !== b[i]) return false;
  }
  return true;
};
var length = configuration.node_id_length;
return equals(entry.key.slice(0, length),
  entry.value.slice(4, length + 4))
    && equals(entry.key.slice(0, length), signature.node_id);

The quota mechanism in VIPR is interesting. Basically it says that a VoIP provider must contribute a number of RELOAD servers for the distributed database that is proportional to the number of phone numbers it plans to register. Because this quota mechanism is useful for other usages than VIPR it now has its own Internet-Draft, separate from the VIPR drafts, with the goal of publishing it as an IETF standard. New quota mechanisms are not very frequently needed (AFAIK, this is the first quota mechanism created outside the RELOAD document) so it does not make sense to develop another API to write quota scripts. This means that this quota mechanism will have to be coded by RELOAD implementers (the VIPR configuration document should contain a <mandatory-extension> element to be sure that only servers implementing this extension will join the overlay).

Version 0.5.0 of the libreload-java package, that was released few minutes ago, not only permits to use the script listed above but also implement the new quota mechanism, making it suitable for implementing a VIPR server.

Update 07/05/2011: The VIPR access control policy is natively implemented in lib-reload-java 0.6.0.