# What? Imagine an Internet where every packet is cryptographically protected from source to destination against espionage and forgery, getting an IP address is as simple as generating a cryptographic key, core routers move data without a single memory look up, and denial of service is a term read about in history books. Finally, becoming an ISP is no longer confined to the mighty telecoms, anyone can do it by running some wires or turning on a wireless device. This is the vision of cjdns. # Why? The Internet is built on protocols which largely date back to the late 80's or earlier. At a time when it was a network of anarchistic academics and scholars showing the ITU that open standards matter, it was absolutely enough. Over time the network has gotten bigger and the users have found new needs. In the age when packet inspection is universal and security breaches are commonplace, cryptographic integrity and confidentiality are becoming more of a requirement. The US government recognized this requirement and has been helping through [IPsec] and [DNSSEC] efforts. Another issue is how are we going to route packets in a world where the global routing table is simply too large for any one router to hold it all? Despite the heroic efforts of core network engineers, the growth of the global routing table seems an unstoppable march. Cisco router company has proposed a plan called Locator/Identifier Separation Protocol, or [LISP] which aims to solve this by re-aggregating the routing table without forcing people to change their precious IP addresses. A different view of this problem is IP address allocation, currently it is done by a central organization which assigns IP addresses in such a way as to make the routing table as small as possible. Unfortunately this creates a bar of entry to the ISP sphere because aspiring network operators must register with the central organization and apply for an allocation of IP addresses while demonstrating that they will not be wasted. It is always easier to show that you need IP addresses if you already have a network. Denial of service, an attempt to prevent legitimate users from accessing a service[1], is likewise a new problem in the expanding network. To my knowledge, there is no general purpose solution to denial of service attacks. Solutions to packet flood based denial of service often revolve around hosting a service on many computers so that they can handle an enormous amount of traffic. Finally, the existing protocols are difficult to use, we cannot reasonably assign blame to anyone for this, many of these protocols are over thirty years old and demonstrate a level of craftsmanship which I can only hope to one day achieve. However, thirty years takes its toll on the best of us and as the Internet grew and became more complex, the administration interface of the typical router has grown a thicket of knobs, buttons and switches to match the proliferation of use cases and failure modes. As a result, network operation has become a science where students receive degrees and certificates for knowing the meanings of the plethora of knobs and switches, it has also become, like the tuning of the race car, an art, passed from master to apprentice and shared on mailing lists. Suffice to say, the bar of entry into the ISP realm is too high. Users, particularly in the ad-hoc wireless network arena, have observed the high bar of entry into traditional routing and have developed a menu of alternative, self-configuring protocols such as [OLSR], [HSLS], and [BATMAN]. ## So the problems are already solved? Not every problem listed has an existing solution and of the ones which do, many of the solutions are based on incompatible technology. For example: OLSR was not designed to operate with IPsec and LISP. Even where the solutions exist and are ready for deployment, they still require mass technology adoption and they don't offer existing ISPs significant immediate gains. The mismatch is rather absurd. On the one hand there are scholars, engineers, hardware and software designers with combined over 1000 years of experience. There are mathematical formulae, proofs, papers and specification documents; written, circulated, peer reviewed and written again. On the other hand you have a single volunteer developer, a clean slate, and an attitude that nothing is impossible. How can this be anything short of lunacy? *In revolutionary times, the old book only weighs you down.* cjdns is built on the idea that the ISPs and hosting providers which exist now will never upgrade, not to LISP, not to DNSSEC, not even to IPv6 in any meaningful way. Building new systems to be compatible with old systems is catering to the audience you will never have. Asking existing ISPs to upgrade for the common good is asking them to take a risk with no immediate benefit. cjdns is about throwing out the book and redefining the specifications in a way that will be fast, secure, and most importantly, *easy* for the next generation of ISPs to deploy and use. ## What is this denial of service? Usage of a service can be interrupted by sending a flood of unwanted packets to a host from a large number of infected "zombie" machines. This, known as [DDoS], is a problem which worsens every year as the upload speed of all infected nodes on the Internet grows in proportion to the download speed of any given link. Being infected with a virus and participating in DDoS, though not a picnic, is not an emergency for the owner of the infected machine. Nor is it an emergency for their ISP. DDoS is always "their problem"... Until it strikes in your network. Sadly, a common response from a datacenter is to stop carrying the controversial content, making DDoS an effective censorship tool and encouraging the practice. Another form of denial of service which is even more insidious is intimidation by threat of faux court action. This form of denial of service is especially effective since most people get their IP addresses from their ISP, when their ISP is threatened, they need to make a judgment call as to the validity of the claim and they often act as judge and jury, disconnecting a customer in order to avoid conflict. Those who have their own IP addresses assigned to them, are able to essentially be their own ISP and to peer with a multitude of other ISPs making it very difficult to threaten anyone but them. ## What is the routing table and why does it keep getting bigger? A more technical issue with the Internet, and one of which many people are unaware, is address space deaggregation. Every computer connected to the Internet needs an address, a number which uniquely identifies it and which is attached to every piece of data which is to be sent to that computer. At every stop along its path through the Internet, a packet (unit of data) has its address field examined by a router so it can decide which wire that packet should be sent down. Routers have an easier time if addresses are in big blocks so that a router can look quickly at the first numbers in the address and know, for example, that it is destined for somewhere in China, not exact but enough to know which wire to send it though. People naturally want as many addresses as they can possibly get and they want them in the smallest blocks possible, this is so they can then control (or buy and sell) these small blocks independently. The smaller the blocks of addresses which are announced, the larger the routing tables become and the more work the Internet's core routers must do in order to send a packet in the right direction. There have been attempts to aggregate addresses back in to groups but nonetheless, the number of small announcements in the global routing table has grown every year. To address the growing routing tables, Cisco has proposed a new protocol called Locator/Identifier Separation Protocol or LISP. The idea of LISP is to separate the addresses which people use from the addresses which routers use, like a lower level version of DNS. LISP allows the edge ISPs and users to see the Internet as they want it, deaggregated into small pieces for political reasons, and the routers to see the Internet as they want it, centralized hierarchical pyramid of addresses emanating from some arbitrary center point. This design works well in existing routers since they are designed to get packets with a universally unique address and look that address up in a table. This is advertised as a design feature but LISP is limited in its vision, if one must look up the "real location" of a server before forwarding a packet, why not simply look up the fastest path? ## I don't care, it's their problem. Each of these problems is a tragedy of the commons problem. The users of virus infected computers are incentivized to save money rather than purchasing a product or service to rid their computer of the infection. While there are solutions such as egress filtering which decrease the problem, ISPs are incentivized to implement as little security as possible because they are not directly affected. Denial of service, whether by packet flooding or by faux legal action, benefits the attacker who is able to hide the truth or victimize service providers as well as organizations who make it their business to provide DoS related services. The victim who all too often publishes information for no other reason than satisfaction of telling the truth, is the only party harmed by this type of attack, and he is the least able to prevent it. Address space deaggregation benefits the edge ISPs who gain more flexibility in how their network is organized at the cost of the core ISPs whose only defense is the "we will not route that" nuclear option which would no doubt bring about a revolt from the edge ISPs. Each of these problems hurts everyone, DDoS forces ISPs to over provision their lines, denial of service through faux legal action increases the cost of running a community website or ISP since each accusation must be reviewed and its validity assessed, and address deaggregation means everyone must pay more to have their packets routed through increasingly high power routers, and difficulty of operating a router and getting a block of IP addresses hurts competition in the ISP sphere, thus increasing prices and impeding progress. # How? cjdns is made of three major components which are woven together. There is a switch, a router, and a CryptoAuth module. With total disregard for the OSI layers, each module is inherently dependent on both of the others. The router cannot function without routing in a small world which is made possible by the switch, the switch is blind and dumb without the router to command it, and without the router and switch, the CryptoAuth has nothing to protect. ## The Switch *He doesn't think about his actions; they flow from the core of his being.* The switch design is unlike an IP or Ethernet router in that it need not have knowledge of the globally unique endpoints in the network. Like ATM switching, the packet header is altered at every hop and the reverse path can be derived at the end point or at any point along the path but unlike ATM, the switch does not need to store active connections and there is no connection setup. ### Definitions * Interface: A point-to-point link to another cjdns switch. This may be emulated by Ethernet frames, UDP packets or other means. * Self Interface: A special Interface in each switch, packets sent for this interface are intended for the node which this switch is a part of. Upon reaching the ultimate hop in its path, a packet is sent through the Self Interface so it can be handled by the next layer in the node. * Director: A binary codeword of arbitrary size which when received by the switch will direct it to send the packet down a given Interface. * Route Label: An ordered set of Directors that describe a path through the network. * Encoding Scheme: The method by which a switch converts one of its internal Interface ID (EG: array index) to a Director and converts a Director back to its internal representation. Encoding schemes may be either fixed width or variable width but in the case of variable width, the width must be self-describing as the Directors are concatenated in the Route Label without any kind of boundary markers. * Encoding Form: A single representation form for encoding of a Director. For a given Encoding Form, there is only one possible way to represent a Director for a given Interface. A variable width Encoding Scheme will have multiple Encoding Forms while a fixed width Encoding Scheme will have only one. * Director Prefix: For switches which implement variable width encoding, the least significant bits of the Director is called the Director Prefix and is used to determine the width of the Director. **NOTE** Because the Route Label is read from least significant bit to most significant bit, the Director Prefix is actually the bits furthest to the *right* of the Director. ### Operation When a packet comes in to the switch, the switch uses its Encoding Scheme to read the least significant bits of the Route Label in order to determine the Director and thus the Interface to send the packet down. The Route Label is shifted to the right by the number of bits in the Director, effectively *removing* the Director and exposing the Director belonging to the next switch in the path. Before sending the packet, the switch uses its Encoding Scheme to craft a Director representing the Interface which the packet came *from*, does a bitwise reversal of this Director and places it in the empty space at the left of the Route Label which was exposed by the previous bit shift. In this way, the switches build a mirror image of the return Label allowing the endpoint, or any hop along the path, to derive the return path by simple bitwise reversal without any knowledge of the Encoding Schemes used by other nodes. #### Example Supposing Alice wanted to send a packet to Fred via Bob, Charlie, Dave and Elinor, she would send a message to her switch with the first Director instructing her switch to send down its interface to Bob, the second Director instructing Bob's switch to send to Charlie and so on. NOTE: Spaces between bits are for illustration only, Route Labels have no boundary markers between Directors and switches cannot know how many bits a Director uses unless that Director is their own. Alice's original Route Label, before entering her switch: 0000000000000000000000000 0001 101011 011010 100101101 10111 0100011 ^^^-- unused space --^^^^ ^^^^^^^-- Alice's Director Interface(Alice->Bob). Route Label when it reaches Bob: 1000000 0000000000000000000000000 0001 101011 011010 100101101 10111 ^^^^^^^ ^^^^^-- Bob's Director for Interface(Bob->Charlie). ^-- Alice's Director for her Self Interface (reversed) Route Label when it reaches Charlie: 11001 1000000 0000000000000000000000000 0001 101011 011010 100101101 ^^^^^ ^^^^^^^ ^^^^^^^^^-- Charlie's Director for Interface(Charlie->Dave). ^^^ ^-- Alice's Director for her Self Interface (reversed) ^-- Bob's Director for Interface(Bob->Alice), bit-reversed. Route Label when it reaches Dave: 110110011 11001 1000000 0000000000000000000000000 0001 101011 011010 ^^^^^^^^^ ^^^^^ ^^^^^^^ ^^^^^^-- Dave's Director for Interface(Dave->Elinor). ^^^^^^ ^^^ ^-- Alice's Director for her Self Interface (reversed) ^^^ ^-- Bob's Director for Interface(Bob->Alice) (reversed). ^-- Charlie Director for Interface(Charlie->Bob) (reversed). Supposing Dave cannot forward the packet and needs to send an error, he does not know where Charlie's Director ends and Bob's begins so he can't re-order them but because they are bit reversed, he can reverse the order by bit reversing the entire Route Label. Route Label after bit reversal: 010110 110101 1000 0000000000000000000000000 0000001 10011 110011011 ^^^^^^^ ^^^^^ ^^^^^^^^^-- Charlie's Director for Interface(Charlie->Bob). ^^^^ ^-- Bob's Director for Interface(Bob->Alice). ^ Alice's Director for her Self Interface. Dave can then send the packet back to Charlie who need not know what it is in order to forward it correctly on to Bob and then to Alice. If the packet had reached Fred, he would be able to use the same technique of reversing the Route Label in order to determine its origin. ### Route Label Manipulations Despite a Route Label's relative opacity, there are still certain functions which can be carried out on Labels. #### Splice The Splice operation takes a Route Label from pointA to pointB and concatenates it with a Label from pointB to pointC yielding a Route Label for a route from pointA to pointC. Splicing is done by XORing the second part with `1` and shifting it left by the log base 2 of the first part, then XORing the result with the first part. Given: ``` routeAB = 0000000000000000000000000000000000000000000001011101110101011001 routeBC = 0000000000000000000000000000000000000000000000000000110101010100 XOR 1 0000000000000000000000000000000000000000000000000000000000000001 equals 0000000000000000000000000000000000000000000000000000110101010101 shift left by the log base 2 of routeAB 0000000000000000000000000000000000110101010101000000000000000000 XOR routeAB 0000000000000000000000000000000000000000000001011101110101011001 equals routeAC 0000000000000000000000000000000000110101010100011101110101011001 ^-- Note the overlap bit ``` The log base 2 represents the index of the first set bit, starting from the right. This means that shifting by the log base 2 leaves 1 bit of overlap, this along with the XORing of the second part (`routeBC`) against `1` causes the highest bit in the first part to be overwritten. #### Unsplice The inverse of the splice operation converts a full Route Label to a representation which would be useful to a node along the path. That is to say from routeAC and routeAB it derives routeBC. ``` routeAC = 0000000000000000000000000000000000110101010100011101110101011001 routeAB = 0000000000000000000000000000000000000000000001011101110101011001 routeBC = routeAC shifted left by the log base 2 of routeAB routeBC 0000000000000000000000000000000000000000000000000000110101010100 ``` #### RoutesThrough Given two routes, it is possible to determine whether one route is an extension of another one, this is similar to the reverse of the the splicing routine. To determine whether routeAC "routes through" the node at the end of by routeAB, one simply removes the most significant 1 in routeAB, trims down the left side of routeAC so that both routes are the same length, and compares the results. ``` routeAC = 0000000000000000000000000000000000110101010100011101110101011001 routeAB = 0000000000000000000000000000000000000000000001011101110101011001 Trimmed to the same length: routeAC' 0000000000000000000000000000000000000000000000011101110101011001 routeAB' 0000000000000000000000000000000000000000000000011101110101011001 ``` Equality indicates routeAC routes through the node at the end of routeAB. ### Encoding Schemes Because the conditions under which the switches operate may differ dramatically, the actual Encoding Scheme is left as an implementation detail. Some switches may be devices with 15 physical ports wherein a 4 bit fixed width Encoding Scheme would be wise, other switches may be in wireless networks where the number of reachable cjdns nodes, and thus the number of Interfaces may grow and shrink. These devices may prefer a variable width encoding in order to save Label space without sacrificing expandability. There are however a few limitations placed upon encoding methods in order to allow the above manipulations. #### Self Interface Director In order for Route Labels to be able to be spliced together, the most significant bit in a Label must always be `1` so that we know where it ends. Since every route ends with a Self Interface, the Director which represents the Self Interface must always be encoded as `1` with 3 or more leading zeros. Furthermore, a node must never send a packet with a Route Label for which the highest 3 bits are not zero. This is important so that the reverse route data applied by routers along the path is not mistaken for additional forward route. Finally, a switch should alias any Director ending with `0001` to the Self Interface Director. For example: a switch using a 6 bit fixed width Encoding Scheme would have to alias `010001`, `100001` and `110001` to the Self Interface Director `000001`. The reason for this is because an unsuspecting router might splice a valid path which uses every bit of the label space such that the most significant three bits of the entire Route Label are `0001`. In this case the bits added by the first switch will be right up against the Self Interface Director and could be mistaken for another forward Director. #### Variable Width Encoding If a switch implementation uses variable width encoding, it's obvious that some Interfaces will get short Directors while other Interfaces will be stuck with the longer Directors. Less obvious is what happens when a packet coming from a "small" Interface is sent to a "big" Interface. The switch shifts off the big Director leaving a big empty space on the left. To put only a small Director in that space would not work because there would be unaccounted for space in the reverse route which would cause a failure at the next switch in the path. In order to make it work, small Interfaces need to be able to be represented using big Directors. For any size, all smaller sizes need to be able to be represented. In the other direction, a packet coming from a big Interface with only a small next Director must be dropped (and an error sent back) because the return route simply cannot be represented. To prevent this problem, when a node on the far end of a big Interface inquires about a node behind a small Interface, the first Director in the Route Label must be converted by the router to the big form before the packet is sent. A still more subtle problem with variable width encoding is that a Label for the route between nodeA to nodeB may be different depending on how one reached nodeA. Since this would be a severe hindrance to efficient routing algorithms using path inference, a node is required to *describe* (through inter-router communication) how it encodes numbers. #### Encoding Scheme Definition In order to give routers the ability to chain paths from a graph of linked nodes while still preserving Variable Width Encoding, a node is required to describe its Encoding Scheme to other nodes and adhere to a few rules to make possible the conversion of a Director to a wider bit width by other nodes. The Encoding Scheme Definition consists of an array of representations of the Encoding Forms allowed in that Encoding Scheme, each Encoding Form representation having the following three fields: * **prefixLen**: The number of bits in the Director Prefix. This is represented on the wire as a 5 bit integer. * **bitCount**: The number of bits in the Director, not including the Director Prefix. This is represented on the wire as a 5 bit integer. * **prefix**: The Director Prefix used in this Encoding Form, this is represented on the wire using as many bits as are given in **prefixLen**. This value must be the same as the prefix which will be seen in the Director in the label but since it is opaque, no specific byte order is defined. The Encoding Scheme Definition is represented on the wire as a concatenation of Encoding Forms. Each form is represented in the *reverse* order given above with **prefix** furthest to the left, followed by **bitCount** then **prefixLen** furthest to the right. It is sent over the wire in *little endian* byte order allowing the buffer to be read and written forward while the data is unpacked from least significant bit to most significant bit. #### Conversion between Encoding Forms Just telling other nodes how numbers are encoded is not enough to allow inference of paths. As routers chain together the Route Labels for paths between different nodes, they need to *alter* the Encoding Form for Directors in the Route Labels to avoid crafting a Route Label for which the reverse path is not expressible. The two types of alteration which need to be allowed are *extension* and *truncation*. Given a route between nodeX and nodeY, a node must be able to *extend* the first Director in that route by changing it from one of the forms advertised by nodeX to another. This involves changing the Director Prefix and then extending the non-prefix section of the first Director by adding zero bits to the left side. *Truncation* is much the opposite, one would convert the Director from a wider form to a narrower by changing the Director Prefix and truncating zero bits from the left side. In order to remain compatible with these optimizations, a switch must disregard a change in the number of most significant zero bits in its Director, in the case of multi-byte Directors this may influence byte order decisions. Obviously a switch implementor must design their Encoding Scheme so that a Director is unambiguous and cannot be confused with the least significant bits of the data but since this is a requirement for a successful switch implementation, it goes without saying. ### Wire format #### Switch Header Two switches exchange messages prefixed by a Switch Header, eventually wrapped in CryptoAuth sessions (called “outer” or “point-to-point” CryptoAuth sessions). The Switch Header layout is the following: 1 2 3 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 0 | | - Route Label - 4 | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 8 | Congest |S| V |labelShift | Penalty | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The `Route Label` field unsurprisingly holds the 8 byte `Route Label`, the `Type` field indicates the type of packet. The Switch Header may be followed by three different types of packets: * If the first four bytes are `0xffffffff`, then the Switch Header is followed by a Switch Control Packet (see below). * If these first four bytes after the Header are a big-endian 0, 1, 2, or 3, the Switch Header is followed by a CryptoAuth handshake packet. This 0, 1, 2, or 3 is the session state of the CryptoAuth session. The CryptoAuth handshake packet contains the Session Handle (that should be used by the recipient to prefix CryptoAuth data packets sent after that), eventually followed by a Switch Data Header. * Otherwise, these first four bytes are a Session Handle, which the recipient of these packet chose earlier to identify the emitted. They are followed by a CryptoAuth Data Packet, containing a Switch Data Header. #### Switch Control Packet TODO See [this parser](https://github.com/cjdelisle/cjdnsctrl) for more information about the CTRL packet. #### Switch Data Header The layout of the Switch Data Header is the following: 1 2 3 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 0 | ver | unusd | unused | Content Type | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The `ver`sion should be `1`. A `Content Type` set to 256 indicates this header is followed by a Route Header. ## The Router A router has 3 functions, it periodically searches for things, responds to searches, and forwards packets. When a router responds to a search, it responds with nodes which it thinks will get closer to the destination. The responses MUST NOT have addresses which are, in address space distance, further from the responding node than the search target, and they MUST NOT have routes which begin with the same interface as the route to the querying node. These two simple rules provide that no search will ever go in circles and no route will ever go down an interface, only to be bounced back. While the second rule can only be enforced by the honor system, querying nodes MUST double check the first rule. The node doing the searching adds the newly discovered nodes to their routing table and to the search, then continues the search by asking them. Upon receiving a search response containing one's own address, a node SHOULD purge all entries from its table whose routes begin with that route. This will control the proliferation of redundant routes. The "address space distance" between any two given addresses is defined as the result of the two addresses XOR'd against one another, rotated 64 bits, then interpreted as a big endian integer. The so called "XOR metric" was pioneered in the work on [Kademlia] DHT system and is used to forward a packet to someone who probably knows the whole route to the destination. The 64 bit rotation of the result is used to improve performance where the first bits of the address is fixed to avoid collisions in the IPv6 space. Adding nodes to the routing table from search responses is done by splicing the route to the node which was asked with the route to the node in the response, yielding a route to the final destination through them. Routers choose the node to forward a packet to in a similar way to how they answer search queries. They select nodes from their routing table except in this case the selection contains only one node. The packet is sent through the CryptoAuth session corresponding to this node and the label for getting to it is applied to the packet before sending to the switch. The "search target" for forwarding a packet is the IPv6 destination address of the packet. ### Format of router messages Router messages are sent as Data Packets whose content-type is 256 (Route Header). The payload of these inter-router messages is [bEncoded][bEncode] dictionaries. Routers send search queries which have a key called "q", and replies which don't. Routers SHOULD reply to a message with a "q" entry but MUST NOT reply if there is none, lest they reply to a reply. All messages have a transaction id number, a sort of cookie made of a bencoded string containing arbitrary bytes which must be reflected back in the reply. The most common query is a find node or "fn" query. "fn" queries have a field called "tar" for the target address which the node is looking for. Responses to "fn" queries have a field called "n" which is a binary string containing the 32 byte public keys and 8 byte switch labels for the responses. Example fn query in JSON: { "q": "fn", "tar": "abcdefghhijklmno", "txid": "12345" } Same query bEncoded as the routers use: d1:q2:fn3:tar16:abcdefghhijklmno4:txid5:12345e Example fn reply in JSON: NOTE: this reply only shows 2 nodes returned and is for illustration purposes in most cases the number would be an implementation specific constant around 8. { "n": "cdefghijklmnopqrstuvwxyzabcdefghi1234567qponmlkjihgzyxwvutsrstuvwxyzabcde2345678" "txid": "12345" } Same reply bEncoded d1:n80:cdefghijklmnopqrstuvwxyzabcdefghi1234567qponmlkjihgzyxwvutsrstuvwxyzabcde23456784:txid5:12345e The nodes in an fn reply are ordered from worst to best so the best answer is the last entry in the reply. ### Keys of a router message's dictionary Possible keys in a router message include: * `q` (ascii string): the query type. May be: * absent * `gp`: get peers, expects a reply with a list of nodes (`n` key) and their versions (`pn` key) which are direct peers of the recipient. If the target address `tar` is provided, the replied nodes should be the closest to this address. * `fn`: find node, expects a reply with a list of nodes (`n` key) and their versions (`pn` key) which are close to the provided target address `tar`. * `pn`: ping node * `ei` (integer): the encoding index. References one of the forms of the encoding scheme `es`. This is needed when crafting the label to send a packet through the node which has returned this message. The following hop must be encoded in this form or a greater one otherwise the return path cannot be represented. * `es` (byte string): the encoding scheme. See the section above about its format. * `n` (byte string): list of nodes. See below for its serialization. * `np` (byte string): list of nodes' protocol version. The first byte is the number of bytes taken by each version in the list (always 1 for now), followed by the versions themselves, encoded in big endian. * `p` (integer): protocol version. Matches the cjdns version. * `tar` (byte string): the target address. If provided, contains an address the sender wants to reach. * `txid` (byte string): transaction id. Opaque value chosen by senders of queries, which must be used in responses. The serialization of each item of the list of nodes `n` is: 1 2 3 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 0 | | + + 4 | | + + 8 | | + + 12 | | + Public Key + 16 | | + + 20 | | + + 24 | | + + 28 | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 32 | | + Route Label + 36 | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ## The CryptoAuth The CryptoAuth is a mechanism for wrapping interfaces, you supply it with an interface and optionally a key, and it gives you a new interface which allows you to send packets to someone who has that key. Like the rest of cjdns, it is designed to function with best effort data transit. The CryptoAuth handshake is based on piggybacking headers on top of regular data packets and while the traffic in handshake packets is encrypted and authenticated, it is not secure against replay attacks and has no forward secrecy if the private key is compromised. The CryptoAuth header adds takes 120 bytes of overhead to the packet, causing a fluctuating MTU. There are 4 types of CryptoAuth header: 1. Connect To Me - Used to start a session without knowing the other node's key. 2. Hello Packet - The first message in beginning a session. 3. Key Packet - The second message in a session. 4. Data Packet - A traffic packet with Poly1305 authentication. All CryptoAuth headers are 120 bytes long except for the Data Packet header which is 20 bytes. The first 4 bytes of any CryptoAuth header is a big endian number which is used to determine its type, this is the so-called "Session State" number. If it is the inverse of zero, it is a Connect To Me header, if it is zero, it is a Hello Packet, if one or two, it is a Hello Packet or repeated Hello Packet, if it is three or four, it is a Key Packet or repeated Key Packet. If it is any number larger than four, it is a Data Packet. Handshake packet structure (Connect To Me, Hello, Key): 1 2 3 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 0 | Session State | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4 | | + + 8 | Auth Challenge | + + 12 | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 16 | | + + 20 | | + + 24 | | + Random Nonce + 28 | | + + 32 | | + + 36 | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 40 | | + + 44 | | + + 48 | | + + 52 | | + Permanent Public Key + 56 | | + + 60 | | + + 64 | | + + 68 | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 72 | | + + 76 | | + Poly1305 Authenticator + 80 | | + + 84 | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 88 | | + + 92 | | + + 96 | | + + 100 | | + Encrypted/Authenticated Temporary Public Key + 104 | | + + 108 | | + + 112 | | + + 116 | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + Variable Length Encrypted/Authenticated Content + | | Data packet structure: 1 2 3 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 0 | Nonce | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4 | | + + 8 | Poly 1305 Authenticator | + + 12 | | + + 16 | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 20 | | + Variable Length Data + 24 | | ### 1) Connect To Me Packet If "Session State" is equal to the bitwise complement of zero, the sender is requesting that the recipient begins a connection with him, this is done in cases when the initiator of the connection does not know the key for the recipient. If the entire header is not present the recipient MUST drop the packet silently, the only field which is read in the packet is the "Permanent Public Key" field, all others SHOULD be ignored, specifically, content MUST not be passed on because it cannot be authenticated. The recipient of such a packet SHOULD send back a "hello" packet if there is no established connection. If there is already a connection over the interface, the recipient SHOULD NOT respond but MAY allow the connection to time out faster. ### 2) Hello Packet If the "Session State" field is equal to zero or one, the packet is a Hello Packet or a repeated Hello Packet. If no connection is present, one MAY be established and the recipient MAY send a Key Packet in response but it is RECOMMENDED that he wait until he has data to send first. A node who has sent a Hello Packet, has gotten no response and now wishes to send more data MUST send that data as more (repeat) Hello Packets. The temporary public key and the content are encrypted and authenticated using the permanent public keys of the two nodes and "Random Nonce" in the header. The content and temporary key is encrypted and authenticated using crypto_box_curve25519poly1305xsalsa20_afternm() function, using the shared secret computed as described in the Authentication field's section. ### 3) Key Packet If the "Session State" field is equal to two or three, the packet is a Key Packet. Key Packets are responses to Hello Packets and like Hello Packets, they contain a temporary public key encrypted and authenticated along with the data. Once a node receives a Key Packet it may begin sending data packets. A node who has received a Hello Packet, sent a Key Packet, gotten no further response, and now wishes to send more data MUST send that data as more (repeat) key packets. The content and temporary key is encrypted and authenticated using crypto_box_curve25519poly1305xsalsa20_afternm() function, using the shared secret computed using crypto_box_curve25519poly1305xsalsa20_beforenm() with one peer's temporary public key and the other peer's permanent secret key. It can be decrypted using crypto_box_curve25519poly1305xsalsa20_open_afternm(), with the shared secret computed, which can be computed the same way. ### 4) Data Packet The Data Packet is the default data packet. The first 4 bytes are used as the nonce (see next paragraph), in this case it is a 24 byte nonce and crypto_box_curve25519poly1305xsalsa20_afternm() is used to encrypt and decrypt the data, using the shared secret computed using crypto_box_curve25519poly1305xsalsa20_beforenm() with one peer's temporary public key and the other peer's temporary secret key (both roles are symmetrical and produce the same shared secret). As crypto_box_curve25519poly1305xsalsa20_afternm() requires a 24-byte nonce, the 4-byte nonce is copied in a 24-byte array that is passed to the primitive. The peer which sent the Key packet writes it in the first four bytes of the array; and the other peer to the next four bytes. This distinction is made so that packets from one peer can not be sent back to this peer to make it believe it is from the other peer. A peer should always send its nonce in increasing big-endian order, otherwise they will be dropped by the Replay Protector of the receiver if they are too much out of order. #### ReplayProtector The replay protector is a feature cjdns implementations provide. It is however not part of the protocol itself. ##### Simplest replay protector A simple replay protector would be to drop any data packet whose nonce is lower or equal to the highest packet nonce received so far. This way, a packet can never be sent twice to a node, preventing [replay attacks](https://en.wikipedia.org/wiki/Replay_attack). ##### Replay protector with sliding window This section describes the replay protector used by the “official” cjdns implementation. It is not mandatory to implement it to support the protocol. Whenever a node receives a packet, it compares its nonce to the highest nonce received so far (big-endian). If it is greater, everything is fine, the incoming packet can not be a replay. Otherwise, the packet is a late packet, meaning we already received a packet newer than it. It may be a replay, but it may also be that the channel has a non-uniform latency. The slotted replay protector allows for late packets to be received, assuming they are not “too late” and are not replays. To do that, it stores a 64-bit bitfield, with a bit for each of the 64 nonces before the highest nonce received so far. Every time a packet is received with a nonce between `highestnonce - 64` (excluded) and `highestnonce` (included), it subtracts the packet's nonce to the highest nonce, giving a number n, and looks at the n-th bit of the bitfield. If it is one, the packet is a replay and is dropped. If it is zero, the packet passes, and the bit is set to one. And whenever a packet with a nonce higher than the current highest nonce, the bitfield is shifted by the difference between so two. This provides a secure protection against replay attacks and accidentally duplicated packets EG: from 802.11 noise; but does not discard packets that arrive slightly out of order. This solution is limited in that packets which are more then 64 "slots" out of order will be discarded. In some cases, this could be a benefit since in best effort networking, never is often better than late. ### Authentication field: This field allows a node to connect using a password or other shared secret, the AuthType field specifies how the secret should be used to connect. 1 2 3 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 0 | AuthType | | +-+-+-+-+-+-+-+-+ AuthType Specific + 4 | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 8 | Unused | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The "AuthType Specific" field is specific to the authentication type. #### AuthType Zero AuthType Zero is no authentication at all. If the AuthType is set to zero, all AuthType Specific fields are disregarded and SHOULD be set to random numbers. This AuthType is the one used in Key packets and for inner (end-to-end) cryptoauth sessions. #### AuthType One AuthType One is a SHA-256 based authentication method. 1 2 3 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 0 | Auth Type | | +-+-+-+-+-+-+-+-+ Hash Code + 4 | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 8 | Unused | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ With AuthType One, the shared secret (password) is hashed once and the result is appended to the 32 byte output from scalar multiplication of the curve25519 keys these 64 bytes are hashed again with SHA-256 to make the symmetric key to be used for the session. It is also hashed a second time and the result copied over the first 8 bytes of the authentication header before the AuthType field is set. The effect being that the "Hash Code" field contains bytes 2 through 8 the hash of the hash of the password (counting indexes from 1). This is used as a sort of username so that the other end knows which password to try using in the handshake. AuthType Two is preferred to this method because it may be harder to crack (does not leak bytes of the value computing from the password). ##### AuthType Two AuthType Two is similar to AuthType One, except that bytes 2 to 8 of the Hash Code are bytes 2 to 8 of the SHA-256 hash of a login, which is known by the received of the packet to be associated with the password used for making the symmetric secret. ## Pulling It All Together The journey of a packet begins at the user interface device (TUN or similar). The user sends an IPv6 packet which comes in to the TUN device and enters the engine, it is checked to make sure its source and destination addresses are valid and then a router lookup is made on the destination address. cjdns addresses are the first 16 bytes of the SHA-512 of the SHA-512 of the public key. All addresses must begin with the byte `0xFC` otherwise they are invalid, generating a key is done by brute force key generation until the result of the double SHA-512 begins with `0xFC`. After the router lookup, the node compares the destination address to the address of the next router, if they are the same, the inner layer of encryption is skipped. Assuming they are different, the IPv6 header is copied to a safe place and a CryptoAuth session is selected for the destination address, or created if there is none, and the packet content is passed through it. The IPv6 header is re-applied on top of the CryptoAuth header for the content, the packet length field in the IPv6 header is notably *not* altered to reflect the headers which are now under it. The packet is now ready to send to the selected router. For sending the packet to the router, a CryptoAuth session is selected for the router's address and the packet, from IPv6 header down, is passed through it. A switch header is applied to the resulting encrypted structure and it is sent down to the switch for routing. The switch takes the packet and sends it to a network module which uses yet another CryptoAuth session to encipher and authenticate the packet from the switch header down. The resulting data is packaged in a network packet and sent to the switch at the next node. Upon receiving the packet, the next node sends the packet through its CryptoAuth session thus revealing the switch header and it sends the packet to its switch. The switch most likely will send the packet out to another endpoint as per the dictate of the packet label but may send it to its router, eventually the node for which the packet is destined will receive it. The router, upon receiving the packet will examine it to see if it appears to be a CryptoAuth Connect To Me packet, Hello packet, or Key packet. If it is one of these, it will insert the IPv6 address, as derived from the public key in the header, into a hashtable so it can be looked up by the switch label. Otherwise it will do a lookup. If the Address cannot be found in its hashtable, it will try asking the router if it knows of a node by that label and if all fails, the packet will be dropped. From the IPv6 address, it will lookup the CryptoAuth session or create one if necessary, then pass the opaque data through the CryptoAuth session to get the decrypted IPv6 header. If the source address for the packet is the same as the double SHA-512 of the public key for the router from which it came, it's assumed to have no inner layer of encryption and it is written to the TUN device as it is. If its source address is different, it is passed back through a CryptoAuth session as selected based on the source IPv6 address. The IPv6 header is then moved up to meet the content (into the place where the CryptoAuth header had been) and the final packet is written out to the TUN device. ![illustration](https://github.com/cjdelisle/cjdns/raw/master/doc/CjdnsModules.png) [OLSR]: http://tools.ietf.org/html/rfc3626 [HSLS]: https://en.wikipedia.org/wiki/Hazy_Sighted_Link_State_Routing_Protocol [BATMAN]: https://en.wikipedia.org/wiki/B.A.T.M.A.N. [1]: http://www.cert.org/tech_tips/denial_of_service.html [2]: http://www.verisigninc.com/assets/whitepaper-ddos-threat-forrester.pdf "DDoS: A Threat You Can’t Afford To Ignore" [LISP]: http://lisp.cisco.com/ [Bitcoin]: http://www.bitcoin.org/ "BitCoin: a decentralized electronic cash system using peer-to-peer networking, digital signatures and cryptographic proof to enable irreversible payments between parties without relying on trust." [Namecoin]: http://dot-bit.org/Main_Page "Namecoin: a peer-to-peer generic name/value datastore system based on Bitcoin technology (a decentralized cryptocurrency)." [IPsec]: https://en.wikipedia.org/wiki/IPsec "IPsec: a protocol suite for securing Internet Protocol (IP) communications by authenticating and encrypting each IP packet of a communication session. IPsec also includes protocols for establishing mutual authentication between agents at the beginning of the session and negotiation of cryptographic keys to be used during the session." [DNSSEC]: https://en.wikipedia.org/wiki/Domain_Name_System_Security_Extensions "A suite of Internet Engineering Task Force (IETF) specifications for securing certain kinds of information provided by the Domain Name System (DNS) as used on Internet Protocol (IP) networks. It is a set of extensions to DNS which provide to DNS clients (resolvers) origin authentication of DNS data, authenticated denial of existence, and data integrity, but not availability or confidentiality." [DNS]: https://en.wikipedia.org/wiki/Domain_Name_System "A hierarchical distributed naming system for computers, services, or any resource connected to the Internet or a private network. It associates various information with domain names assigned to each of the participating entities. Most importantly, it translates domain names meaningful to humans into the numerical identifiers associated with networking equipment for the purpose of locating and addressing these devices worldwide." [P2P]: https://en.wikipedia.org/wiki/Peer-to-peer "Peer-to-peer (P2P) computing or networking is a distributed application architecture that partitions tasks or workloads among peers. Peers are equally privileged, equipotent participants in the application. They are said to form a peer-to-peer network of nodes." [Internet]: https://en.wikipedia.org/wiki/Internet "A global system of interconnected computer networks that use the standard Internet protocol suite (TCP/IP) to serve billions of users worldwide. It is a network of networks that consists of millions of private, public, academic, business, and government networks, of local to global scope, that are linked by a broad array of electronic, wireless and optical networking technologies." [DDoS]: https://en.wikipedia.org/wiki/Denial-of-service_attack "An attempt to make a computer or network resource unavailable to its intended users. Although the means to carry out, motives for, and targets of a DoS attack may vary, it generally consists of the concerted efforts of a person, or multiple people to prevent an Internet site or service from functioning efficiently or at all, temporarily or indefinitely." [bEncode]: https://en.wikipedia.org/wiki/Bencode "The encoding used by the peer-to-peer file sharing system BitTorrent for storing and transmitting loosely structured data." [DHT]: https://en.wikipedia.org/wiki/Distributed_hash_table "A class of a decentralized distributed system that provides a lookup service similar to a hash table; (key, value) pairs are stored in a DHT, and any participating node can efficiently retrieve the value associated with a given key. Responsibility for maintaining the mapping from keys to values is distributed among the nodes, in such a way that a change in the set of participants causes a minimal amount of disruption. This allows a DHT to scale to extremely large numbers of nodes and to handle continual node arrivals, departures, and failures." [BitTorrent]: https://en.wikipedia.org/wiki/BitTorrent_(protocol) "A peer-to-peer file sharing protocol used for distributing large amounts of data over the Internet. " [Kademlia]: http://pdos.csail.mit.edu/~petar/papers/maymounkov-kademlia-lncs.pdf