Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

libp2p specification

libp2p logo

Overview

This repository contains the specifications for libp2p, a framework and suite of protocols for building peer-to-peer network applications. libp2p has several implementations, with more in development.

The main goal of this repository is to provide accurate reference documentation for the aspects of libp2p that are independent of language or implementation. This includes wire protocols, addressing conventions, and other "network level" concerns.

For user-facing documentation, please see https://docs.libp2p.io

In addition to describing the current state of libp2p, the specs repository serves as a coordination point and a venue to drive future developments in libp2p. For the short and long term roadmap see ROADMAP.md. To participate in the evolution of libp2p via the specs process, please see the Contributions section.

Status

The specifications for libp2p are currently incomplete, and we are working to address this by revising existing specs to ensure correctness and writing new specifications to detail currently unspecified parts of libp2p.

This document replaces an earlier RFC, which still contains much useful information and is helpful for understanding the libp2p design philosophy. It is avaliable at _archive/README.md.

Specification Index

This index contains links to all the spec documents that are currently merged. If documents are moved to new locations within the repository, this index will be updated to reflect the new locations.

Specs Framework

These specs define processes for the specification framework itself, such as the expected lifecycle and document formatting.

Core Abstractions and Types

These specs define abstractions and data types that form the "core" of libp2p and are used throughout the system.

  • Addressing - Working with addresses in libp2p.
  • Connections and Upgrading - Establishing secure, multiplexed connections between peers, possibly over insecure, single stream transports.
  • Peer Ids and Keys - Public key types & encodings, peer id calculation, and message signing semantics

Protocols

These specs define wire protocols that are used by libp2p for connectivity, security, multiplexing, and other purposes.

The protocols described below all use protocol buffers (aka protobuf) to define message schemas.

Existing protocols may use proto2, and continue to use them. proto3 is recommended for new protocols. proto3 is a simplification of proto2 and removes some footguns. For context and a discussion around proto3 vs proto2, see #465.

  • ping - Ping protocol
  • autonat - NAT detection
  • identify - Exchange keys and addresses with other peers
  • kademlia - The Kademlia Distributed Hash Table (DHT) subsystem
  • mdns - Local peer discovery with zero configuration using multicast DNS
  • mplex - The friendly stream multiplexer
  • yamux - Yet Another Multiplexer
  • noise - The libp2p Noise handshake
  • plaintext - An insecure transport for non-production usage
  • pnet - Private networking in libp2p using pre-shared keys
  • pubsub - PubSub interface for libp2p
    • gossipsub - An extensible baseline PubSub protocol
      • episub - Proximity Aware Epidemic PubSub for libp2p
  • relay - Circuit Switching for libp2p (similar to TURN)
    • dcutr - Direct Connection Upgrade through Relay protocol
  • rendezvous - Rendezvous Protocol for generalized peer discovery
  • secio - SECIO, a transport security protocol for libp2p
  • tls - The libp2p TLS Handshake (TLS 1.3+)
  • quic - The libp2p QUIC Handshake
  • webrtc - The libp2p WebRTC transports
  • WebTransport - Using WebTransport in libp2p

Contributions

Thanks for your interest in improving libp2p! We welcome contributions from all interested parties. Please take a look at the Spec Lifecycle document to get a feel for how the process works, and open an issue if there's work you'd like to discuss.

For discussions about libp2p that aren't specific to a particular spec, or if you feel an issue isn't the appropriate place for your topic, please join our discussion forum and post a new topic in the contributor's section.

libp2p specification

libp2p logo

This document presents libp2p, a modularized and extensible network stack to overcome the networking challenges faced when doing peer-to-peer applications. libp2p is used by IPFS as its networking library.

Authors:

Reviewers:

  • N/A

Abstract

This describes the IPFS network protocol. The network layer provides point-to-point transports (reliable and unreliable) between any two IPFS nodes in the network.

This document defines the spec implemented in libp2p.

Status of this spec

Organization of this document

This RFC is organized by chapters described on the Table of contents section. Each of the chapters can be found in its own file.

Table of contents

Other specs that haven't made to the main document

Contribute

Please contribute! Dive into the issues!

Please be aware that all interactions related to multiformats are subject to the IPFS Code of Conduct.

License

CC-BY-SA 3.0 License © Protocol Labs Inc.

1 Introduction

While developing IPFS, the InterPlanetary FileSystem, we came to learn about several challenges imposed by having to run a distributed file system on top of heterogeneous devices, with different network setups and capabilities. During this process, we had to revisit the whole network stack and elaborate solutions to overcome the obstacles imposed by design decisions of the several layers and protocols, without breaking compatibility or recreating technologies.

In order to build this library, we focused on tackling problems independently, creating less complex solutions with powerful abstractions that, when composed, can offer an environment for a peer-to-peer application to work successfully.

⚠️ Warning: parts of this document are incomplete and out of date. Please see this issue, and look for deprecation notices throughout. ⚠️

1.1 Motivation

libp2p is the result of our collective experience of building a distributed system, in that it puts responsibility on developers to decide how they want an app to interoperate with others in the network, and favors configuration and extensibility instead of making assumptions about the network setup.

In essence, a peer using libp2p should be able to communicate with another peer using a variety of different transports, including connection relay, and talk over different protocols, negotiated on demand.

1.2 Goals

Our goals for the libp2p specification and its implementations are:

  • Enable the use of various:
    • transports: TCP, UDP, SCTP, UDT, uTP, QUIC, SSH, etc.
    • authenticated transports: TLS, DTLS, CurveCP, SSH
  • Make efficient use of sockets (connection reuse)
  • Enable communications between peers to be multiplexed over one socket (avoiding handshake overhead)
  • Enable multiprotocols and respective versions to be used between peers, using a negotiation process
  • Be backwards compatible
  • Work in current systems
  • Use the full capabilities of current network technologies
  • Have NAT traversal
  • Enable connections to be relayed
  • Enable encrypted channels
  • Make efficient use of underlying transports (e.g. native stream muxing, native auth, etc.)

2 An analysis of the state of the art in network stacks

This section presents to the reader an analysis of the available protocols and architectures for network stacks. The goal is to provide the foundations from which to infer the conclusions and understand why libp2p has the requirements and architecture that it has.

2.1 The client-server model

The client-server model indicates that both parties at the ends of the channel have different roles, that they support different services and/or have different capabilities, or in other words, that they speak different protocols.

Building client-server applications has been the natural tendency for a number of reasons:

  • The bandwidth inside a data center is considerably higher than that available for clients to connect to each other.
  • Data center resources are considerably cheaper, due to efficient usage and bulk stocking.
  • It makes it easier for the developer and system admin to have fine grained control over the application.
  • It reduces the number of heterogeneous systems to be handled (although the number is still considerable).
  • Systems like NAT make it really hard for client machines to find and talk with each other, forcing a developer to perform very clever hacks to traverse these obstacles.
  • Protocols started to be designed with the assumption that a developer will create a client-server application from the start.

We even learned how to hide all the complexity of a distributed system behind gateways on the Internet, using protocols that were designed to perform a point-to-point operation, such as HTTP, making it opaque for the application to see and understand the cascade of service calls made for each request.

libp2p offers a move towards dialer-listener interactions, from the client-server listener, where it is not implicit which of the entities, dialer or listener, has which capabilities or is enabled to perform which actions. Setting up a connection between two applications today is a multilayered problem to solve, and these connections should not have a purpose bias, and should instead support several other protocols to work on top of the established connection. In a client-server model, a server sending data without a prior request from the client is known as a push model, which typically adds more complexity; in a dialer-listener model in comparison, both entities can perform requests independently.

2.2 Categorizing the network stack protocols by solutions

Before diving into the libp2p protocols, it is important to understand the large diversity of protocols already in wide use and deployment that help maintain today's simple abstractions. For example, when one thinks about an HTTP connection, one might naively just think that HTTP/TCP/IP are the main protocols involved, but in reality many more protocols participate, depending on the usage, the networks involved, and so on. Protocols like DNS, DHCP(v6), ARP, NDISC, OSPF, Ethernet, 802.11 (Wi-Fi) and many others get involved. Looking inside ISPs' own networks would reveal dozens more.

Additionally, it's worth noting that the traditional 7-layer OSI model characterization does not fit libp2p. Instead, we categorize protocols based on their role, i.e. the problem they solve. The upper layers of the OSI model are geared towards point-to-point links between applications, whereas the libp2p protocols speak more towards various sizes of networks, with various properties, under various different security models. Different libp2p protocols can have the same role (in the OSI model, this would be "address the same layer"), meaning that multiple protocols can run simultaneously, all addressing one role (instead of one-protocol-per-layer in traditional OSI stacking). For example, bootstrap lists, mDNS, DHT discovery, and PEX are all forms of the role "Peer Discovery"; they can coexist and even synergize.

  • Ethernet
  • Wi-Fi
  • Bluetooth
  • USB

2.2.2 Addressing a machine or process

  • IPv4
  • IPv6
  • Hidden addressing, like SDP

2.2.3 Discovering other peers or services

  • ARP
  • NDISC
  • DHCP(v6)
  • DNS
  • Onion

2.2.4 Routing messages through the network

  • RIP(1, 2)
  • OSPF
  • BGP
  • PPP
  • Tor
  • I2P
  • cjdns

2.2.5 Transport

  • TCP
  • UDP
  • UDT
  • QUIC
  • WebRTC data channel

2.2.6 Agreed semantics for applications to talk to each other

  • RMI
  • Remoting
  • RPC
  • HTTP

2.3 Current shortcomings

Although we currently have a panoply of protocols available for our services to communicate, the abundance and variety of solutions creates its own problems. It is currently difficult for an application to be able to support and be available through several transports (e.g. the lack of TCP/UDP stack in browser applications).

There is also no 'presence linking', meaning that there isn't a notion for a peer to announce itself in several transports, so that other peers can guarantee that it is always the same peer.

3 Requirements and considerations

3.1 Transport agnostic

libp2p is transport agnostic, so it can run over any transport protocol. It does not even depend on IP; it may run on top of NDN, XIA, and other new Internet architectures.

In order to reason about possible transports, libp2p uses multiaddr, a self-describing addressing format. This makes it possible for libp2p to treat addresses opaquely everywhere in the system, and have support for various transport protocols in the network layer. The actual format of addresses in libp2p is ipfs-addr, a multiaddr that ends with an IPFS node id. For example, these are all valid ipfs-addrs:

# IPFS over TCP over IPv6 (typical TCP)
/ip6/fe80::8823:6dff:fee7:f172/tcp/4001/ipfs/QmYJyUMAcXEw1b5bFfbBbzYu5wyyjLMRHXGUkCXpag74Fu

# IPFS over uTP over UDP over IPv4 (UDP-shimmed transport)
/ip4/162.246.145.218/udp/4001/utp/ipfs/QmYJyUMAcXEw1b5bFfbBbzYu5wyyjLMRHXGUkCXpag74Fu

# IPFS over IPv6 (unreliable)
/ip6/fe80::8823:6dff:fee7:f172/ipfs/QmYJyUMAcXEw1b5bFfbBbzYu5wyyjLMRHXGUkCXpag74Fu

# IPFS over TCP over IPv4 over TCP over IPv4 (proxy)
/ip4/162.246.145.218/tcp/7650/ip4/192.168.0.1/tcp/4001/ipfs/QmYJyUMAcXEw1b5bFfbBbzYu5wyyjLMRHXGUkCXpag74Fu

# IPFS over Ethernet (no IP)
/ether/ac:fd:ec:0b:7c:fe/ipfs/QmYJyUMAcXEw1b5bFfbBbzYu5wyyjLMRHXGUkCXpag74Fu

Note: At this time, no unreliable implementations exist. The protocol's interface for defining and using unreliable transport has not been defined. For more information on unreliable vs reliable transport, see here. In the context of WebRTC, CTRL+F "reliable" here.

3.2 Multi-multiplexing

The libp2p protocol is a collection of multiple protocols. In order to conserve resources, and to make connectivity easier, libp2p can perform all its operations through a single port, such as a TCP or UDP port, depending on the transports used. libp2p can multiplex its many protocols through point-to-point connections. This multiplexing is for both reliable streams and unreliable datagrams.

libp2p is pragmatic. It seeks to be usable in as many settings as possible, to be modular and flexible to fit various use cases, and to force as few choices as possible. Thus the libp2p network layer provides what we're loosely referring to as "multi-multiplexing":

  • can multiplex multiple listen network interfaces
  • can multiplex multiple transport protocols
  • can multiplex multiple connections per peer
  • can multiplex multiple client protocols
  • can multiplex multiple streams per protocol, per connection (SPDY, HTTP2, QUIC, SSH)
  • has flow control (backpressure, fairness)
  • encrypts each connection with a different ephemeral key

To give an example, imagine a single IPFS node that:

  • listens on a particular TCP/IP address
  • listens on a different TCP/IP address
  • listens on a SCTP/UDP/IP address
  • listens on a UDT/UDP/IP address
  • has multiple connections to another node X
  • has multiple connections to another node Y
  • has multiple streams open per connection
  • multiplexes streams over HTTP2 to node X
  • multiplexes streams over SSH to node Y
  • one protocol mounted on top of libp2p uses one stream per peer
  • one protocol mounted on top of libp2p uses multiple streams per peer

Not providing this level of flexbility makes it impossible to use libp2p in various platforms, use cases, or network setups. It is not important that all implementations support all choices; what is critical is that the spec is flexible enough to allow implementations to use precisely what they need. This ensures that complex user or application constraints do not rule out libp2p as an option.

3.3 Encryption

Communications on libp2p may be:

  • encrypted
  • signed (not encrypted)
  • clear (not encrypted, not signed)

We take both security and performance seriously. We recognize that encryption is not viable for some in-datacenter high performance use cases.

We recommend that:

  • implementations encrypt all communications by default
  • implementations are audited
  • unless absolutely necessary, users normally operate with encrypted communications only.

libp2p uses cyphersuites like TLS.

Note: We do not use TLS directly, because we do not want the CA system baggage. Most TLS implementations are very big. Since the libp2p model begins with keys, libp2p only needs to apply ciphers. This is a minimal portion of the whole TLS standard.

3.4 NAT traversal

Network Address Translation is ubiquitous in the Internet. Not only are most consumer devices behind many layers of NAT, but most data center nodes are often behind NAT for security or virtualization reasons. As we move into containerized deployments, this is getting worse. IPFS implementations SHOULD provide a way to traverse NATs, otherwise it is likely that operation will be affected. Even nodes meant to run with real IP addresses must implement NAT traversal techniques, as they may need to establish connections to peers behind NAT.

libp2p accomplishes full NAT traversal using an ICE-like protocol. It is not exactly ICE, as IPFS networks provide the possibility of relaying communications over the IPFS protocol itself, for coordinating hole-punching or even relaying communication.

It is recommended that implementations use one of the many NAT traversal libraries available, such as libnice, libwebrtc, or natty. However, NAT traversal must be interoperable.

3.5 Relay

Unfortunately, due to symmetric NATs, container and VM NATs, and other impossible-to-bypass NATs, libp2p MUST fallback to relaying communication to establish a full connectivity graph. To be complete, implementations MUST support relay, though it SHOULD be optional and able to be turned off by end users.

Connection relaying SHOULD be implemented as a transport, in order to be transparent to upper layers.

For an instantiation of relaying, see the p2p-circuit transport.

3.6 Enable several network topologies

Different systems have different requirements and with that comes different topologies. In the P2P literature we can find these topologies being enumerated as: unstructured, structured, hybrid and centralized.

Centralized topologies are the most common to find in Web Applications infrastructures, it requires for a given service or services to be present at all times in a known static location, so that other services can access them. Unstructured networks represent a type of P2P networks where the network topology is completely random, or at least non deterministic, while structured networks have a implicit way of organizing themselves. Hybrid networks are a mix of the last two.

With this in consideration, libp2p must be ready to perform different routing mechanisms and peer discovery, in order to build the routing tables that will enable services to propagate messages or to find each other.

3.7 Resource discovery

libp2p also solves the problem with discoverability of resources inside of a network through records. A record is a unit of data that can be digitally signed, timestamped and/or used with other methods to give it an ephemeral validity. These records hold pieces of information such as location or availability of resources present in the network. These resources can be data, storage, CPU cycles and other types of services.

libp2p must not put a constraint on the location of resources, but instead offer ways to find them easily in the network or use a side channel.

3.8 Messaging

Efficient messaging protocols offer ways to deliver content with minimum latency and/or support large and complex topologies for distribution. libp2p seeks to incorporate the developments made in Multicast and PubSub to fulfil these needs.

3.9 Naming

Networks change and applications need to have a way to use the network in such a way that it is agnostic to its topology, naming appears to solve this issues.

4 Architecture

⚠️ Warning: this section is incomplete, and parts of it are out of date. Please see this issue to track progress on improving it. ⚠️

libp2p was designed around the Unix Philosophy of creating small components that are easy to understand and test. These components should also be able to be swapped in order to accommodate different technologies or scenarios and also make it feasible to upgrade them over time.

Although different peers can support different protocols depending on their capabilities, any peer can act as a dialer and/or a listener for connections from other peers, connections that once established can be reused from both ends, removing the distinction between clients and servers.

The libp2p interface acts as a thin veneer over a multitude of subsystems that are required in order for peers to be able to communicate. These subsystems are allowed to be built on top of other subsystems as long as they respect the standardized interface. The main areas where these subsystems fit are:

  • Peer Routing - Mechanism to decide which peers to use for routing particular messages. This routing can be done recursively, iteratively or even in a broadcast/multicast mode.
  • Swarm - Handles everything that touches the 'opening a stream' part of libp2p, from protocol muxing, stream muxing, NAT traversal and connection relaying, while being multi-transport.
  • Distributed Record Store - A system to store and distribute records. Records are small entries used by other systems for signaling, establishing links, announcing peers or content, and so on. They have a similar role to DNS in the broader Internet.
  • Discovery - Finding or identifying other peers in the network.

Each of these subsystems exposes a well known interface (see chapter 6 for Interfaces) and may use each other in order to fulfill their goal. A global overview of the system is:

┌─────────────────────────────────────────────────────────────────────────────────┐
│                                  libp2p                                         │
└─────────────────────────────────────────────────────────────────────────────────┘
┌─────────────────┐┌─────────────────┐┌──────────────────────────┐┌───────────────┐
│   Peer Routing  ││      Swarm      ││ Distributed Record Store ││  Discovery    │
└─────────────────┘└─────────────────┘└──────────────────────────┘└───────────────┘

4.1 Peer Routing

A Peer Routing subsystem exposes an interface to identify which peers a message should be routed to in the DHT. It receives a key and must return one or more PeerInfo objects.

We present two examples of possible Peer Routing subsystems, the first based on a the Kademlia DHT and the second based on mDNS. Nevertheless, other Peer Routing mechanisms can be implemented, as long as they fulfil the same expectation and interface.

┌──────────────────────────────────────────────────────────────┐
│       Peer Routing                                           │
│                                                              │
│┌──────────────┐┌────────────────┐┌──────────────────────────┐│
││ kad-routing  ││ mDNS-routing   ││ other-routing-mechanisms ││
││              ││                ││                          ││
││              ││                ││                          ││
│└──────────────┘└────────────────┘└──────────────────────────┘│
└──────────────────────────────────────────────────────────────┘

4.1.1 kad-routing

kad-routing implements the Kademlia Routing table, where each peer holds a set of k-buckets, each of them containing several PeerInfo objects from other peers in the network.

4.1.2 mDNS-routing

mDNS-routing uses mDNS probes to identify if local area network peers have a given key or they are simply present.

4.2 Swarm

4.2.1 Stream Muxer

The stream muxer must implement the interface offered by interface-stream-muxer.

4.2.2 Protocol Muxer

Protocol muxing is handled on the application level instead of the conventional way at the port level (where different services/protocols listen at different ports). This enables us to support several protocols to be muxed in the same socket, saving the cost of doing NAT traversal for more than one port.

Protocol multiplexing is done through multistream, a protocol to negotiate different types of streams (protocols) using multicodec.

4.2.3 Transport

4.2.4 Crypto

4.2.5 Identify

Identify is one of the protocols mounted on top of Swarm, our Connection handler. However, it follows and respects the same pattern as any other protocol when it comes to mounting it on top of Swarm. Identify enables us to trade listenAddrs and observedAddrs between peers, which is crucial for the working of IPFS. Since every socket open implements REUSEPORT, an observedAddr by another peer can enable a third peer to connect to us, since the port will be already open and redirect to us on a NAT.

4.2.6 Relay

See Circuit Relay.

4.3 Distributed Record Store

4.3.1 Record

Follows IPRS spec.

4.3.2 abstract-record-store

4.3.3 kad-record-store

4.3.4 mDNS-record-store

4.3.5 s3-record-store

4.4 Discovery

4.4.1 mDNS-discovery

mDNS-discovery is a Discovery Protocol that uses mDNS over local area networks with zero configuration. Local area network peers are very useful to peer-to-peer protocols, because of their low latency links.

The mDNS-discovery specification describes how to use mDNS to discover other peers.

mDNS-discovery is a standalone protocol and does not depend on any other libp2p protocol. mDNS-discovery can yield peers available in the local area network, without relying on other infrastructure. This is particularly useful in intranets, networks disconnected from the Internet backbone, and networks which temporarily lose links.

mDNS-discovery can be configured per-service (i.e. discover only peers participating in a specific protocol, like IPFS), and with private networks (discover peers belonging to a private network).

We are exploring ways to make mDNS-discovery beacons encrypted (so that other nodes in the local network cannot discern what service is being used), though the nature of mDNS will always reveal local IP addresses.

Privacy note: mDNS advertises in local area networks, which reveals IP addresses to listeners in the same local network. It is not recommended to use this with privacy-sensitive applications or oblivious routing protocols.

4.4.2 random-walk

Random-Walk is a Discovery Protocol for DHTs (and other protocols with routing tables). It makes random DHT queries in order to learn about a large number of peers quickly. This causes the DHT (or other protocols) to converge much faster, at the expense of a small load at the very beginning.

4.4.3 bootstrap-list

Bootstrap-List is a Discovery Protocol that uses local storage to cache the addresses of highly stable (and somewhat trusted) peers available in the network. This allows protocols to "find the rest of the network". This is essentially the same way that DNS bootstraps itself (though note that changing the DNS bootstrap list -- the "dot domain" addresses -- is not easy to do, by design).

  • The list should be stored in long-term local storage, whatever that means to the local node (e.g. to disk).
  • Protocols can ship a default list hardcoded or along with the standard code distribution (like DNS).
  • In most cases (and certainly in the case of IPFS) the bootstrap list should be user configurable, as users may wish to establish separate networks, or place their reliance and trust in specific nodes.

4.5 Messaging

4.5.1 PubSub

See pubsub/ and pubsub/gossipsub/.

4.6 Naming

4.6.1 IPRS

IPRS spec

4.6.2 IPNS

5 Data structures

The network protocol deals with these data structures:

6 Interfaces

⚠️ Warning: this section is incomplete, and parts of it are out of date. Please see this issue to track progress on improving it. ⚠️

libp2p is a collection of several protocols working together to offer a common solid interface that can talk with any other network addressable process. This is made possible by shimming currently existing protocols and implementations into a set of explicit interfaces: Peer Routing, Discovery, Stream Muxing, Transports, Connections and so on.

6.1 libp2p

libp2p, the top module that provides an interface to all the other modules that make a libp2p instance, must offer an interface for dialing to a peer and plugging in all of the modules (e.g. which transports) we want to support. We present the libp2p interface and UX in section 6.6, after presenting every other module interface.

6.2 Peer Routing

A Peer Routing module offers a way for a libp2p Node to find the PeerInfo of another Node, so that it can dial that node. In its most pure form, a Peer Routing module should have an interface that takes a 'key', and returns a set of PeerInfos. See https://github.com/libp2p/interface-peer-routing for the interface and tests.

6.3 Swarm

Current interface available and updated at:

https://github.com/libp2p/js-libp2p-swarm#usage

6.3.1 Transport

https://github.com/libp2p/interface-transport

6.3.2 Connection

https://github.com/libp2p/interface-connection

6.3.3 Stream Muxing

https://github.com/libp2p/interface-stream-muxer

6.4 Distributed Record Store

https://github.com/libp2p/interface-record-store

6.5 Peer Discovery

A Peer Discovery module interface should return PeerInfo objects, as it finds new peers to be considered by our Peer Routing modules.

6.6 libp2p interface and UX

libp2p implementations should enable it to be instantiated programmatically, or to use a previous compiled library with some of the protocol decisions already made, so that the user can reuse or expand.

Constructing a libp2p instance programatically

Example made with JavaScript, should be mapped to other languages:

var Libp2p = require('libp2p')

var node = new Libp2p()

// add a swarm instance
node.addSwarm(swarmInstance)

// add one or more Peer Routing mechanisms
node.addPeerRouting(peerRoutingInstance)

// add a Distributed Record Store
node.addDistributedRecordStore(distributedRecordStoreInstance)

Configuring libp2p is quite straightforward since most of the configuration comes from instantiating the several modules, one at a time.

Dialing and Listening for connections to/from a peer

Ideally, libp2p uses its own mechanisms (PeerRouting and Record Store) to find a way to dial to a given peer:

node.dial(PeerInfo)

To receive an incoming connection, specify one or more protocols to handle:

node.handleProtocol('<multicodec>', function (duplexStream) {

})

Finding a peer

Finding a peer is done through Peer Routing, so the interface is the same.

Storing and Retrieving Records

Like Finding a peer, Storing and Retrieving records is done through Record Store, so the interface is the same.

7 Properties

⚠️ Warning: this section is incomplete, and parts of it are out of date. Please see this issue to track progress on improving it. ⚠️

7.1 Communication Model - Streams

The Network layer handles all the problems of connecting to a peer, and exposes simple bidirectional streams. Users can both open a new stream (NewStream) and register a stream handler (SetStreamHandler). The user is then free to implement whatever wire messaging protocol she desires. This makes it easy to build peer-to-peer protocols, as the complexities of connectivity, multi-transport support, flow control, and so on, are handled.

To help capture the model, consider that:

  • NewStream is similar to making a Request in an HTTP client.
  • SetStreamHandler is similar to registering a URL handler in an HTTP server

So a protocol, such as a DHT, could:

node := p2p.NewNode(peerid)

// register a handler, here it is simply echoing everything.
node.SetStreamHandler("/helloworld", func (s Stream) {
  io.Copy(s, s)
})

// make a request.
buf1 := []byte("Hello World!")
buf2 := make([]byte, len(buf1))

stream, _ := node.NewStream("/helloworld", peerid) // open a new stream
stream.Write(buf1)  // write to the remote
stream.Read(buf2)   // read what was sent back
fmt.Println(buf2)   // print what was sent back

7.2 Ports - Constrained Entrypoints

In the Internet of 2015, we have a processing model where a program may be running without the ability to open multiple -- or even single -- network ports. Most hosts are behind NAT, whether of the household ISP variety or the new containerized data-center type. And some programs may even be running in browsers, with no ability to open sockets directly (sort of). This presents challenges to completely peer-to-peer networks that aspire to connect any hosts together -- whether they're running on a page in the browser, or in a container within a container.

IPFS only needs a single channel of communication with the rest of the network. This may be a single TCP or UDP port, or a single connection through WebSockets or WebRTC. In a sense, the role of the TCP/UDP network stack -- i.e. multiplexing applications and connections -- may now be forced to happen at the application level.

7.3 Transport Protocols

IPFS is transport-agnostic. It can run on any transport protocol. The ipfs-addr format (which is an IPFS-specific multiaddr) describes the transport. For example:

# ipv4 + tcp
/ip4/10.1.10.10/tcp/29087/ipfs/QmVcSqVEsvm5RR9mBLjwpb2XjFVn5bPdPL69mL8PH45pPC

# ipv6 + tcp
/ip6/2601:9:4f82:5fff:aefd:ecff:fe0b:7cfe/tcp/1031/ipfs/QmRzjtZsTqL1bMdoJDwsC6ZnDX1PW1vTiav1xewHYAPJNT

# ipv4 + udp + udt
/ip4/104.131.131.82/udp/4001/udt/ipfs/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ

# ipv4 + udp + utp
/ip4/104.131.67.168/udp/1038/utp/ipfs/QmU184wLPg7afQjBjwUUFkeJ98Fp81GhHGurWvMqwvWEQN

IPFS delegates the transport dialing to a multiaddr-based network package, such as go-multiaddr-net. It is advisable to build modules like this in other languages, and scope the implementation of other transport protocols.

Some of the transport protocols we will be using:

  • UTP
  • UDT
  • SCTP
  • WebRTC (SCTP, etc)
  • WebSockets
  • TCP Remy

7.4 Non-IP Networks

Efforts like NDN and XIA are new architectures for the Internet, which are closer to the model IPFS uses than what IP provides today. IPFS will be able to operate on top of these architectures trivially, as there are no assumptions made about the network stack in the protocol. Implementations will likely need to change, but changing implementations is vastly easier than changing protocols.

7.5 On the wire

We have the hard constraint of making IPFS work across any duplex stream (an outgoing and an incoming stream pair, any arbitrary connection) and work on any platform.

To make this work, IPFS has to solve a few problems:

7.5.1 Protocol-Multiplexing

Protocol Multiplexing means running multiple different protocols over the same stream. This could happen sequentially (one after the other), or concurrently (at the same time, with their messages interleaved). We achieve protocol multiplexing using three pieces:

7.5.2 multistream - self-describing protocol stream

multistream is a self-describing protocol stream format. It is extremely simple. Its goal is to define a way to add headers to protocols that describe the protocol itself. It is sort of like adding versions to a protocol, but extremely explicit.

For example:

/ipfs/QmVXZiejj3sXEmxuQxF2RjmFbEiE9w7T82xDn3uYNuhbFb/ipfs-dht/0.2.3
<dht-message>
<dht-message>
...

7.5.3 multistream-selector - self-describing protocol stream selector

multistream-select is a simple multistream protocol that allows listing and selecting other protocols. This means that Protomux has a list of registered protocols, listens for one, and then nests (or upgrades) the connection to speak the registered protocol. This takes direct advantage of multistream: it enables interleaving multiple protocols, as well as inspecting what protocols might be spoken by the remote endpoint.

For example:

/ipfs/QmdRKVhvzyATs3L6dosSb6w8hKuqfZK2SyPVqcYJ5VLYa2/multistream-select/0.3.0
/ipfs/QmVXZiejj3sXEmxuQxF2RjmFbEiE9w7T82xDn3uYNuhbFb/ipfs-dht/0.2.3
<dht-message>
<dht-message>
...

7.5.4 Stream Multiplexing

Stream Multiplexing is the process of multiplexing (or combining) many different streams into a single one. This is a complicated subject because it enables protocols to run concurrently over the same wire, and all sorts of notions regarding fairness, flow control, head-of-line blocking, etc. start affecting the protocols. In practice, stream multiplexing is well understood and there are many stream multiplexing protocols. To name a few:

  • HTTP/2
  • SPDY
  • QUIC
  • SSH

IPFS nodes are free to support whatever stream multiplexors they wish, on top of the default one. The default one is there to enable even the simplest of nodes to speak multiple protocols at once. The default multiplexor will be HTTP/2 (or maybe QUIC?), but implementations for it are sparse, so we are beginning with SPDY. We simply select which protocol to use with a multistream header.

For example:

/ipfs/QmdRKVhvzyATs3L6dosSb6w8hKuqfZK2SyPVqcYJ5VLYa2/multistream-select/0.3.0
/ipfs/Qmb4d8ZLuqnnVptqTxwqt3aFqgPYruAbfeksvRV1Ds8Gri/spdy/3
<spdy-header-opening-a-stream-0>
/ipfs/QmVXZiejj3sXEmxuQxF2RjmFbEiE9w7T82xDn3uYNuhbFb/ipfs-dht/0.2.3
<dht-message>
<dht-message>
<spdy-header-opening-a-stream-1>
/ipfs/QmVXZiejj3sXEmxuQxF2RjmFbEiE9w7T82xDn3uYNuhbFb/ipfs-bitswap/0.3.0
<bitswap-message>
<bitswap-message>
<spdy-header-selecting-stream-0>
<dht-message>
<dht-message>
<dht-message>
<dht-message>
<spdy-header-selecting-stream-1>
<bitswap-message>
<bitswap-message>
<bitswap-message>
<bitswap-message>
...

7.5.5 Portable Encodings

In order to be ubiquitous, we must use hyper-portable format encodings, those that are easy to use in various other platforms. Ideally these encodings are well-tested in the wild, and widely used. There may be cases where multiple encodings have to be supported (and hence we may need a multicodec self-describing encoding), but this has so far not been needed. For now, we use protobuf for all protocol messages exclusively, but other good candidates are capnp, bson, and ubjson.

7.5.6 Secure Communications

The wire protocol is -- of course -- wrapped with encryption. We use cyphersuites similar to TLS. This is explained further in requirements and considerations: encryption.

7.5.7 Protocol Multicodecs

Here, we present a table with the multicodecs defined for each IPFS protocol that has a wire componenent. This list may change over time and currently exists as a guide for implementation.

protocolmulticodeccomment
secio/secio/1.0.0
TLS/tls/1.3.0not implemented
plaintext/plaintext/1.0.0
spdy/spdy/3.1.0
yamux/yamux/1.0.0
multiplex/mplex/6.7.0
identify/ipfs/id/1.0.0
ping/ipfs/ping/1.0.0
circuit-relay/libp2p/relay/circuit/0.1.0spec
diagnostics/ipfs/diag/net/1.0.0
Kademlia DHT/ipfs/kad/1.0.0
bitswap/ipfs/bitswap/1.0.0

8 Implementations

A libp2p implementation should (recommended) follow a certain level of granularity when implementing different modules and functionalities, so that common interfaces are easy to expose, test and check for interoperability with other implementations.

This is the list of current modules available for libp2p:

  • libp2p (entry point)
  • Swarm
  • Peer Routing
    • libp2p-kad-routing
    • libp2p-mDNS-routing
  • Discovery
    • libp2p-mdns-discovery
    • libp2p-random-walk
    • libp2p-railing
  • Distributed Record Store
  • Generic
    • PeerInfo
    • PeerId
    • multihash
    • multiaddr
    • multistream
    • multicodec
    • ipld
    • repo

Current known implementations (or WIP) are:

8.1 Swarm

8.1.1 Swarm Dialer

The swarm dialer manages making a successful connection to a target peer, given a stream of addresses as inputs, and making sure to respect any and all rate limits imposed. To this end, we have designed the following logic for dialing:

DialPeer(peerID) {
	if PeerIsBeingDialed(peerID) {
		waitForDialToComplete(peerID)
		return BestConnToPeer(peerID)
	}
	
	StartDial(peerID)

	waitForDialToComplete(peerID)
	return BestConnToPeer(peerID)
}

	
StartDial(peerID) {
	addrs = getAddressStream(peerID)

	addrs.onNewAddr(function(addr) {
		if rateLimitCanDial(peerID, addr) {
			doDialAsync(peerID, addr)
		} else {
			rateLimitScheduleDial(peerID, addr)
		}
	})
}

// doDialAsync starts dialing to a specific address without blocking.
// when the dial returns, it releases rate limit tokens, and if it
// succeeded, will finalize the dial process.
doDialAsync(peerID, addr) {
	go transportDial(addr, function(conn, err) {
		rateLimitReleaseTokens(peerID, addr)

		if err != null {
			// handle error
		}

		dialSuccess(conn)
	})
}

// rateLimitReleaseTokens checks for any tokens the given dial
// took, and then for each of them, checks if any other dial is waiting
// for any of those tokens. If waiting dials are found, those dials are started
// immediately. Otherwise, the tokens are released to their pools.
rateLimitReleaseTokens(peerID, addr) {
	tokens = tokensForDial(peerID, addr)

	for token in tokens {
		dial = dialWaitingForToken(token)
		if dial != null {
			doDialAsync(dial.peer, dial.addr)
		} else {
			token.release()
		}
	}
	
}

IPRS - InterPlanetary Record System spec

Authors: Juan Benet

Reviewers:


The Spec for IPRS.

This spec defines IPRS (InterPlanetary Record System) a system for distributed record keeping meant to operate across networks that may span massive distances (>1AU) or suffer long partitions (>1 earth yr). IPRS is meant to establish a common record-keeping layer to solve common problems. It is a well-layered protocol: it is agnostic to underlying replication and transport systems, and it supports a variety of different applications. IPRS is part of the InterPlanetary File System Project, and is general enough to be used in a variery of other systems.

Definitions

Records

A (distributed) record is a piece of data meant to be transmitted and stored in various computers across a network. It carries a value external to the record-keeping system, which clients of the record system use. For example: a namespace record may store the value of a name. All record systems include a notion of record validity, which allows users of the record system to verify whether the value of a record is correct and valid under the user's circumstances. For example, a record's validity may depend upon a cryptographic signature, a range of spacetime, et-cetera.

Record System

A (distributed) record system is a protocol which defines a method for crafting, serializing, distributing, verifying, and using records over computer networks. (e.g. the Domain Name System).

  • crafting - construction of a record (the process of calculating the values of a record)
  • serializing - formating a record into a bitstring.
  • distributing - transportation of a record from one set of computers to another.
  • verifying - checking a record's values to ensure correctness and validity.

Validity Schemes

A validity scheme is a small sub-protocol that defines how a record's validity is to be calculated. Validity is the quality of a record being usable by a user at a particular set of circumstances (e.g. a range of time). It is distinct from correctness in that correctness governs whether the record was correctly constructed, and validity governs whether a record may still be used. All valid records must be correct. For simplicity, the process of checking correctness is included in the validity scheme.

For example, suppose Alice and Bob want to store records on a public bulletin board. To make sure their records are not tampered with, Alice and Bob decide they will include cryptographic signatures. This can ensure correctness. Further, they also agree to add new records every day, to detect whether their records are being replayed or censored. Thus, their validity scheme might be:

type Record struct {
  Value     []byte
  Expires   time.Time
  Signature []byte
}

func signablePart(r *Record) []byte {
  var sigbuf bytes.Buffer
  sigbuf.Write(r.Value)
  sigbuf.Write(r.Expires)
  return sigbuf.Bytes()
}

func MakeRecord(value []byte, authorKey crypto.PrivateKey) Record {
  rec := Record{}
  rec.Value = value

  // establish an expiration date
  rec.Expires = time.Now() + time.Day

  // cryptographically sign the record
  rec.Signature = authorKey.Sign(signablePart(rec))

  return rec
}

func VerifyRecord(rec Record, authorKey crypto.PublicKey) (ok bool) {

  // always check the signature first
  sigok := authorKey.Verify(rec.Signature, signablePart(rec))
  if !sigok {
    return false // sig did not check out! forged record?
  }

  // check the expiration.
  if rec.Expires < time.Now() {
    return false // not valid anymore :(
  }

  // everything seems ok!
  return true
}

Note that even in such a simple system, we already called out to two other systems Alice and Bob are subscribing to:

  • a Public Key Infrastructure (PKI) that lets Alice and Bob know each other's keys, and verify the validity of messages authored by each other.
  • a Time Infrastructure (TI) that lets Alice and Bob agree upon a common notion of time intervals and validity durations.

Both of these are large systems on their own, which impose constraints and security parameters on the record system. For example, if Alice and Bob think that NTP timestamps are a good TI, the validity of their records is dependent on their ability to establish an accurate NTP timestamp securely (i.e. they need secure access to shared clocks). Another TI might be to "use the last observed record", and this also is dependent on having a secure announcement channel.

IPRS is Validity Scheme Agnostic, meaning that it seeks to establish a common way to craft and distribute records for users of a system without necessarily tying them down to specific world-views (e.g. "NTP is a secure way to keep time", "The CA system is a secure PKI"), or forcing them to work around specific system choices that impose constraints unreasonable for their use case (e.g. "Real-Time Video Over TOR")

Merkle DAG and IPFS Objects

A merkle dag is a directed acyclic graph whose links are (a) hashes of the edge target, and (b) contained within the edge source. (syn. merkle tree, hash tree)

In this spec, the merkle dag (specific one) refers to the IPFS merkle dag. IPFS Object refers to objects in the merkle dag, which follow the IPFS merkledag format. (Read those specs)

Constraints

IPRS has the following hard constraints:

  • MUST be transport agnostic. (transport refers to how computers communicate).
  • MUST be replication agnostic. (replication refers to the protocol computers use to transfer and propagate whole records and other objects)
  • MUST be validity scheme agnostic. (validity scheme includes PKI, TI, and other "agreed upon" trusted infrastructure)
  • MUST be trustless: no trusted third parties are imposed by IPRS (though some may be adopted by a validity scheme. e.g. root CAs in the CA system PKI, or a blockchain in a blockchain TI). In most cases, users may have to trust each other (as they must trust the record value -- e.g. DNS), but in some cases there may be cryptographic schemes that enable full trustlessness.

It is easy to be agnostic to transport, replication, and validity scheme as long as users can expect to control or agree upon the programs or protocols used in concert with IPRS. Concretely, the user can select specific transports or validity schemes to suit the user's application constraints. It is the user's responsibility to ensure both record crafters and verifiers agree upon these selections.

Construction

The Objects

IPRS records are expressed as merkledag objects. This means that the records are linked authenticated data structures, and can be natively replicated over IPFS itself and other merkledag distribution systems.

The objects:

  • A Record object expresses a value, a validity scheme, and validity data.
  • A Signature object could be used to sign and authenticate a record.
  • An Encryption object could be used to encrypt a record.
Record Node {
  Scheme   Link // link to a validity scheme
  Value    Link // link to an object representing the value.
  Version  Data // record version number
  Validity Data // data needed to satisfy the validity scheme
}

To achieve good performance, record storage and transfer should bundle all the necessary objects and transmit them together. While "the record object" is only one of the dag objects, "the full record" means a bundle of all objects needed to fully represent, verify, and use the record. (This recommendation does not necessarily include data that records describe, for example an ipfs provider record (which signals to consumers that certain data is available) would not include the data itself as part of "the full record").

The Interface

The IPRS interface is below. It has a few types and functions. We use the Go language to express it, but this is language agnostic.

// Record is the base type. user can define other types that
// extend Record.
type Record struct {
  Scheme    Link // link to the validity scheme
  Signature Link // link to a cryptographic signature over the rest of record
  Value     Data // an opaque value
}

// Validator is a function that returns whether a record is valid.
// Users provide their own Validator implementations.
type Validator func(r *Record) (bool, error)

// Order is a function that sorts two records based on validity.
// This means that one record should be preferred over the other.
// there must be a total order. if return is 0, then a == b.
// Return value is -1, 0, 1.
type Order func(a, b *Record) int

// Marshal/Unmarshal specifies a way to code the record
type Marshal(r *Record) ([]byte, error)
type Unmarshal(r *Record, []byte) (error)

Interface Example

For example, Alice and Bob earlier could use the following interface:

type Record struct {
  Scheme    Link // link to the validity scheme
  Expires   Data // datetime at which record expires
  Value     Data // an opaque value
}


func Validator(r *Record) (bool, error) {
  authorKey := recordSigningKey(r)

  // always check the signature first
  sigok := authorKey.Verify(r.Signature, signablePart(r))
  if !sigok {
    return false, errors.New("invalid signature. forged record?")
  }

  // check the expiration.
  if r.Expires < time.Now() {
    return false, errors.New("record expired.")
  }

  return true, nil
}

func Order(a, b *Record) int {
  if a.Expires > b.Expires {
    return 1
  }
  if a.Expires < b.Expires {
    return -1
  }

  // only return 0 if records are the exact same record.
  // otherwise, if the ordering doesn't matter (in this case
  // because the expiry is the same) return one of them
  // deterministically. Comparing the hashes takes care of this.
  ra := a.Hash()
  rb := b.Hash()
  return bytes.Compare(ra, rb)
}

func Marshal(r *Record) ([]byte, error) {
  return recordProtobuf.Marshal(r)
}

func Unmarshal(r *Record, d []byte) (error) {
  return recordProtobuf.Unmarshal(r, d)
}

Example Record Types

For ease of use, IPRS implementations should include a set of common record types:

  • signed, valid within a datetime range
  • signed, expiring after a Time-To-Live
  • signed, based on ancestry (chain)
  • signed, with cryptographic freshness

Signed, valid within a datetime range

This record type uses digital signatures (and thus a PKI) and timestamps (and thus a TI). It establishes that a record is valid during a particular datetime range. 0 (beginning of time), and infinity (end of time) can express unbounded validity.

Signed, expiring after a Time-To-Live

This record type uses digital signatures (and thus a PKI) and TTLs (and thus a TI). It establishes that a record is valid for a certain amount of time after a particular event. For example, an event may be "upon receipt" to specify that a record is valid for a given amount of time after a processor first receives it. This is equivalent to the way DNS sets expiries.

Signed, based on ancestry (chain)

This record type uses digital signatures (and thus a PKI) and merkle-links to other, previous records. It establishes that the "most recent" record (merkle-ordered) is the most valid. This functions similar to a git commit chain.

Signed, with cryptographic freshness

This record type uses digital signatures (and thus a PKI) and a cryptographic notion of freshness (and therefore a TI). It establishes that records are only valid if within some threshold of recent time. It is similar to a TTL.

Addressing in libp2p

How network addresses are encoded and used in libp2p

Lifecycle StageMaturityStatusLatest Revision
3ARecommendationActiver0, 2021-07-22

Authors: @yusefnapora

Interest Group: [@mxinden, @Stebalien, @raulk, @marten-seemann, @vyzo]

See the lifecycle document for context about the maturity level and spec status.

Table of Contents

Overview

libp2p makes a distinction between a peer's identity and its location. A peer's identity is stable, verifiable, and valid for the entire lifetime of the peer (whatever that may be for a given application). Peer identities are derived from public keys as described in the peer id spec.

On a particular network, at a specific point in time, a peer may have one or more locations, which can be represented using addresses. For example, I may be reachable via the global IPv4 address of 198.51.100 on TCP port 1234.

In a system that only supported TCP/IP or UDP over IP, we could easily write our addresses with the familiar <ip>:<port> notation and store them as tuples of address and port. However, libp2p was designed to be transport agnostic, which means that we can't assume that we'll even be using an IP-backed network at all.

To support a growing set of transport protocols without special-casing each addressing scheme, libp2p uses multiaddr to encode network addresses for all supported transport protocols, in a self-describing manner.

This document does not cover the address format (multiaddr), but rather how multiaddr is used in libp2p. For details on the former, visit linked spec. For more information on other use cases, or to find links to multiaddr implementations in various languages, see the mulitaddr repository.

multiaddr in libp2p

multiaddrs are used throughout libp2p for encoding network addresses. When addresses need to be shared or exchanged between processes, they are encoded in the binary representation of multiaddr.

When exchanging addresses, peers send a multiaddr containing both their network address and peer id, as described in the section on the p2p multiaddr.

multiaddr basics

A multiaddr is a sequence of instructions that can be traversed to some destination.

For example, the /ip4/198.51.100/tcp/1234 multiaddr starts with ip4, which is the lowest-level protocol that requires an address. The tcp protocol runs on top of ip4, so it comes next.

The multiaddr above consists of two components, the /ip4/198.51.100 component, and the /tcp/1234 component. It's not possible to split either one further; /ip4 alone is an invalid multiaddr, because the ip4 protocol was defined to require a 32 bit address. Similarly, tcp requires a 16 bit port number.

Although we referred to /ip4/198.51.100 and /tcp/1234 as "components" of a larger TCP/IP address, each is actually a valid multiaddr according to the multiaddr spec. However, not every syntactically valid multiaddr is a functional description of a process in the network. As we've seen, even a simple TCP/IP connection requires composing two multiaddrs into one. See the section on composing multiaddrs for information on how multiaddrs can be combined, and the Transport multiaddrs section for the combinations that describe valid transport addresses.

The multiaddr protocol table contains all currently defined protocols and the length of their address components.

Composing multiaddrs

As shown above, protocol addresses can be composed within a multiaddr in a way that mirrors the composition of protocols within a networking stack.

The terms generally used to describe composition of multiaddrs are "encapsulation" and "decapsulation", and they essentially refer to adding and removing protocol components from a multiaddr, respectively.

Encapsulation

A protocol is said to be "encapsulated within" another protocol when data from an "inner" protocol is wrapped by another "outer" protocol, often by re-framing the data from the inner protocol into the type of packets, frames or datagrams used by the outer protocol.

Some examples of protocol encapsulation are HTTP requests encapsulated within TCP/IP streams, or TCP segments themselves encapsulated within IP datagrams.

The multiaddr format was designed so that addresses encapsulate each other in the same manner as the protocols that they describe. The result is an address that begins with the "outermost" layer of the network stack and works progressively "inward". For example, in the address /ip4/198.51.100/tcp/80/ws, the outermost protocol is IPv4, which encapsulates TCP streams, which in turn encapsulate WebSockets.

All multiaddr implementations provide a way to encapsulate two multiaddrs into a composite. For example, /ip4/198.51.100 can encapsulate /tcp/42 to become /ip4/198.51.100/tcp/42.

Decapsulation

Decapsulation takes a composite multiaddr and removes an "inner" multiaddr from it, returning the result.

For example, if we start with /ip4/198.51.100/tcp/1234/ws and decapsulate /ws, the result is /ip4/198.51.100/tcp/1234.

It's important to note that decapsulation returns the original multiaddr up to the last occurrence of the decapsulated multiaddr. This may remove more than just the decapsulated component itself if there are more protocols encapsulated within it. Using our example above, decapsulating either /tcp/1234/ws or /tcp/1234 from /ip4/198.51.100/tcp/1234/ws will result in /ip4/198.51.100. This is unsurprising if you consider the utility of the /ip4/198.51.100/ws address that would result from simply removing the tcp component.

The p2p multiaddr

libp2p defines the p2p multiaddr protocol, whose address component is the peer id of a libp2p peer. The text representation of a p2p multiaddr looks like this:

/p2p/QmYyQSo1c1Ym7orWxLYvCrM2EmxFTANf8wXmmE7DWjhx5N

Where QmYyQSo1c1Ym7orWxLYvCrM2EmxFTANf8wXmmE7DWjhx5N is the string representation of a peer's peer ID derived from its public key.

By itself, a p2p address does not give you enough addressing information to locate a peer on the network; it is not a transport address. However, like the ws protocol for WebSockets, a p2p address can be encapsulated within another multiaddr.

For example, the above p2p address can be combined with the transport address on which the node is listening:

/ip4/198.51.100/tcp/1234/p2p/QmYyQSo1c1Ym7orWxLYvCrM2EmxFTANf8wXmmE7DWjhx5N

This combination of transport address plus p2p address is the format in which peers exchange addresses over the wire in the identify protocol and other core libp2p protocols.

Historical Note: the ipfs multiaddr Protocol

The p2p multiaddr protocol was originally named ipfs, and we've been eliminating support for the ipfs string representation of this multiaddr component. It may be printed as /ipfs/<peer-id> instead of /p2p/<peer-id> in its string representation depending on the implementation in use. Both names resolve to the same protocol code, and they are equivalent in the binary form.

Transport multiaddrs

Because multiaddr is an open and extensible format, it's not possible to guarantee that any valid multiaddr is semantically meaningful or usable in a particular network. For example, the /tcp/42 multiaddr, while valid, is not useful on its own as a locator.

This section covers the types of multiaddr supported by libp2p transports. It's possible that this section will go out of date as new transport modules are developed, at which point pull-requests to update this document will be greatly appreciated.

IP and Name Resolution

Most libp2p transports use the IP protocol as a foundational layer, and as a result, most transport multiaddrs will begin with a component that represents an IPv4 or IPv6 address.

This may be an actual address, such as /ip4/198.51.100 or /ip6/fe80::883:a581:fff1:833, or it could be something that resolves to an IP address, like a domain name.

libp2p will attempt to resolve "name-based" addresses into IP addresses. The current multiaddr protocol table defines four resolvable or "name-based" protocols:

protocoldescription
dnsResolves DNS A and AAAA records into both IPv4 and IPv6 addresses.
dns4Resolves DNS A records into IPv4 addresses.
dns6Resolves DNS AAAA records into IPv6 addresses.
dnsaddrResolves multiaddrs from a special TXT record.

When the /dns protocol is used, the lookup may result in both IPv4 and IPv6 addresses, in which case IPv6 will be preferred. To explicitly resolve to IPv4 or IPv6 addresses, use the /dns4 or /dns6 protocols, respectively.

Note that in some restricted environments, such as inside a web browser, libp2p may not have access to the resolved IP addresses at all, in which case the runtime will determine what IP version is used.

When a name-based multiaddr encapsulates another multiaddr, only the name-based component is affected by the lookup process. For example, if example.com resolves to 192.0.2.0, libp2p will resolve the address /dns4/example.com/tcp/42 to /ip4/192.0.2.0/tcp/42.

A libp2p-specific DNS-backed format, /dnsaddr resolves addresses from a TXT record associated with the _dnsaddr subdomain of a given domain.

Note that this is different from dnslink, which uses TXT records to reference content addressed objects.

For example, resolving /dnsaddr/libp2p.io will perform a TXT lookup for _dnsaddr.libp2p.io. If the result contains entries of the form dnsaddr=<multiaddr>, the embedded multiaddrs will be parsed and used.

For example, asking the DNS server for the TXT records of one of the bootstrap nodes, ams-2.bootstrap.libp2p.io, returns the following records:

> dig +short _dnsaddr.am6.bootstrap.libp2p.io txt
"dnsaddr=/dns6/am6.bootstrap.libp2p.io/tcp/443/wss/p2p/QmbLHAnMoJPWSCR5Zhtx6BHJX9KiKNN6tpvbUcqanj75Nb"
"dnsaddr=/dns4/am6.bootstrap.libp2p.io/tcp/443/wss/p2p/QmbLHAnMoJPWSCR5Zhtx6BHJX9KiKNN6tpvbUcqanj75Nb"
"dnsaddr=/ip6/2604:1380:4602:5c00::3/tcp/4001/p2p/QmbLHAnMoJPWSCR5Zhtx6BHJX9KiKNN6tpvbUcqanj75Nb"
"dnsaddr=/ip4/147.75.87.27/tcp/4001/p2p/QmbLHAnMoJPWSCR5Zhtx6BHJX9KiKNN6tpvbUcqanj75Nb"
"dnsaddr=/ip6/2604:1380:4602:5c00::3/udp/4001/quic-v1/p2p/QmbLHAnMoJPWSCR5Zhtx6BHJX9KiKNN6tpvbUcqanj75Nb"
"dnsaddr=/ip4/147.75.87.27/udp/4001/quic-v1/p2p/QmbLHAnMoJPWSCR5Zhtx6BHJX9KiKNN6tpvbUcqanj75Nb"

The dnsaddr lookup serves a similar purpose to a standard A-record DNS lookup, however there are differences that can be important for some use cases. The most significant is that the dnsaddr entry contains a full multiaddr, which may include a port number or other information that an A-record lacks, and it may even specify a non-IP transport. Also, there are cases in which the A-record already serves a useful purpose; using dnsaddr allows a second "namespace" for libp2p registrations.

TCP

The libp2p TCP transport is supported in all implementations and can be used wherever TCP/IP sockets are accessible.

Addresses for the TCP transport are of the form <ip-multiaddr>/tcp/<tcp-port>, where <ip-multiaddr> is a multiaddr that resolves to an IP address, as described in the IP and Name Resolution section. The <tcp-port> argument must be a 16-bit unsigned integer.

WebSockets

WebSocket connections are encapsulated within TCP/IP sockets, and the WebSocket multiaddr format mirrors this arrangement.

A libp2p WebSocket multiaddr is of the form <tcp-multiaddr>/ws or <tcp-multiaddr>/wss (TLS-encrypted), where <tcp-multiaddr> is a valid mulitaddr for the TCP transport, as described above.

QUIC

QUIC sessions are encapsulated within UDP datagrams, and the libp2p QUIC multiaddr format mirrors this arrangement.

A libp2p QUIC multiaddr is of the form <ip-multiaddr>/udp/<udp-port>/quic, where <ip-multiaddr> is a multiaddr that resolves to an IP address, as described in the IP and Name Resolution section. The <udp-port> argument must be a 16-bit unsigned integer in network byte order.

p2p-circuit Relay Addresses

The libp2p circuit relay protocol allows a libp2p peer A to communicate with another peer B via a third party C. This is useful for circumstances where A and B would be unable to communicate directly.

Once a connection to the relay is established, peers can accept incoming connections through the relay, using a p2p-circuit address.

Like the ws WebSocket multiaddr protocol the p2p-circuit multiaddr does not carry any additional address information. Instead it is composed with two other multiaddrs to describe a relay circuit.

A full p2p-circuit address that describes a relay circuit is of the form: <relay-multiaddr>/p2p-circuit/<destination-multiaddr>.

<relay-multiaddr> is the full address for the peer relaying the traffic (the "relay node").

The details of the transport connection between the relay node and the destination peer are usually not relevant to other peers in the network, so <destination-multiaddr> generally only contains the p2p address of the destination peer.

A full example would be:

/ip4/192.0.2.0/tcp/5002/p2p/QmdPU7PfRyKehdrP5A3WqmjyD6bhVpU1mLGKppa2FjGDjZ/p2p-circuit/p2p/QmVT6GYwjeeAF5TR485Yc58S3xRF5EFsZ5YAF4VcP3URHt

Here, the destination peer has the peer id QmVT6GYwjeeAF5TR485Yc58S3xRF5EFsZ5YAF4VcP3URHt and is reachable through a relay node with peer id QmdPU7PfRyKehdrP5A3WqmjyD6bhVpU1mLGKppa2FjGDjZ running on TCP port 5002 of the IPv4 loopback interface.

Identify v1.0.0

The identify protocol is used to exchange basic information with other peers in the network, including addresses, public keys, and capabilities.

Lifecycle StageMaturity LevelStatusLatest Revision
3ARecommendationActiver1, 2021-08-09

Authors: @vyzo

Interest Group: @yusefnapora, @tomaka, @richardschneider, @Stebalien, @bigs

See the lifecycle document for context about the maturity level and spec status.

Table of Contents

Overview

There are two variations of the identify protocol, identify and identify/push.

identify

The identify protocol has the protocol id /ipfs/id/1.0.0, and it is used to query remote peers for their information.

The protocol works by opening a stream to the remote peer you want to query, using /ipfs/id/1.0.0 as the protocol id string. The peer being identified responds by returning an Identify message and closes the stream.

identify/push

The identify/push protocol has the protocol id /ipfs/id/push/1.0.0, and it is used to inform known peers about changes that occur at runtime.

When a peer's basic information changes, for example, because they've obtained a new public listen address, they can use identify/push to inform others about the new information.

The push variant works by opening a stream to each remote peer you want to update, using /ipfs/id/push/1.0.0 as the protocol id string. When the remote peer accepts the stream, the local peer will send an Identify message and close the stream.

Upon receiving the pushed Identify message, the remote peer should update their local metadata repository with the information from the message. Note that missing fields should be ignored, as peers may choose to send partial updates containing only the fields whose values have changed.

The Identify Message

syntax = "proto2";
message Identify {
  optional string protocolVersion = 5;
  optional string agentVersion = 6;
  optional bytes publicKey = 1;
  repeated bytes listenAddrs = 2;
  optional bytes observedAddr = 4;
  repeated string protocols = 3;
}

protocolVersion

The protocol version identifies the family of protocols used by the peer. The field is optional but recommended for debugging and statistic purposes.

Previous versions of this specification required connections to be closed on version mismatch. This requirement is revoked to allow interoperability between protocol families / networks.

Example value: /my-network/0.1.0.

agentVersion

This is a free-form string, identifying the implementation of the peer. The usual format is agent-name/version, where agent-name is the name of the program or library and version is its semantic version.

publicKey

This is the public key of the peer, marshalled in binary form as specicfied in peer-ids.

listenAddrs

These are the addresses on which the peer is listening as multi-addresses.

observedAddr

This is the connection source address of the stream-initiating peer as observed by the peer being identified; it is a multi-address. The initiator can use this address to infer the existence of a NAT and its public address.

For example, in the case of a TCP/IP transport the observed addresses will be of the form /ip4/x.x.x.x/tcp/xx. In the case of a circuit relay connection, the observed address will be of the form /p2p/QmRelay/p2p-circuit. In the case of onion transport, there is no observable source address.

protocols

This is a list of protocols supported by the peer.

A node should only advertise a protocol if it's willing to receive inbound streams on that protocol. This is relevant for asymmetrical protocols. For example assume an asymmetrical request-response style protocol foo where some clients only support initiating requests while some servers (only) support responding to requests. To prevent clients from initiating requests to other clients, which given them being clients they fail to respond, clients should not advertise foo in their protocols list.

Connection Establishment in libp2p

Lifecycle StageMaturityStatusLatest Revision
1AWorking DraftActiver1, 2022-12-07

Authors: @yusefnapora

Interest Group: @JustMaier, @vasco-santos @bigs, @mgoelzer

See the lifecycle document for context about the maturity level and spec status.

Table of Contents

Overview

This document describes the process of establishing connections to new peers in libp2p and, if necessary, adding security and stream multiplexing capabilities to "raw" connections provided by transport protocols.

We also discuss opening new streams over an existing connection, and the protocol negotiation process that occurs to route traffic to the correct protocol handler.

This document does not cover the establishment of "transport level" connections, for example opening "raw" TCP sockets, as those semantics are specific to each transport.

What is covered here is the process that occurs after making the initial transport level connection, up to the point where "application level" streams are opened, and their protocols are identified and data is routed appropriately to handler functions.

Definitions

A connection is a reliable, bidirectional communication channel between two libp2p peers that provides security and the ability to open multiple logically independent streams.

Security in this context means that all communications (after an initial handshake) are encrypted, and that the identity of each peer is cryptographically verifiable by the other peer.

Streams are reliable, bidirectional channels that are multiplexed over a libp2p connection. They must support backpressure, which prevents receivers from being flooded by data from eager senders. They can also be "half closed", meaning that a stream can be closed for writing data but still open to receiving data and vice versa.

Support for multiple streams ensures that a single connection between peers can support a wide variety of interactions, each with their own protocol. This is especially helpful if connections are difficult to establish due to NAT traversal issues or other connectivity barriers.

Connections take place over an underlying transport, for example TCP sockets, websockets, or various protocols layered over UDP.

While some transport protocols like QUIC have "built in" security and stream multiplexing, others such as TCP need to have those capabilities layered on top of the "raw" transport connection.

When the base capabilities of security and stream multiplexing are not natively supported by the underlying transport protocol, a connection upgrade process occurs to augment the raw transport connection with the required features.

libp2p peers can both initiate connections to other peers and accept incoming connections. We use the term dial to refer to initiating outbound connections, and listen to refer to accepting inbound connections.

Protocol Negotiation

One of libp2p's core design goals is to be adaptable to many network environments, including those that don't yet exist. To provide this flexibility, the connection upgrade process supports multiple protocols for connection security and stream multiplexing and allows peers to select which to use for each connection.

The process of selecting protocols is called protocol negotiation. In addition to its role in the connection upgrade process, protocol negotiation is also used whenever a new stream is opened over an existing connection. This allows libp2p applications to route application-specific protocols to the correct handler functions.

Each protocol supported by a peer is identified using a unique string called a protocol id. While any string can be used, the conventional format is a path-like structure containing a short name and a version number, separated by / characters. For example: /yamux/1.0.0 identifies version 1.0.0 of the yamux stream multiplexing protocol. multistream-select itself has a protocol id of /multistream/1.0.0.

Including a version number in the protocol id simplifies the case where you want to concurrently support multiple versions of a protocol, perhaps a stable version and an in-development version. By default, libp2p will route each protocol id to its handler function using exact literal matching of the protocol id, so new versions will need to be registered separately. However, the handler function receives the protocol id negotiated for each new stream, so it's possible to register the same handler for multiple versions of a protocol and dynamically alter functionality based on the version in use for a given stream.

multistream-select

libp2p uses a protocol called multistream-select for protocol negotiation. Below we cover the basics of multistream-select and its use in libp2p. For more details, see the multistream-select repository.

Before engaging in the multistream-select negotiation process, it is assumed that the peers have already established a bidirectional communication channel, which may or may not have the security and multiplexing capabilities of a libp2p connection. If those capabilities are missing, multistream-select is used in the connection upgrade process to determine how to provide them.

Messages are sent encoded as UTF-8 byte strings, and they are always followed by a \n newline character. Each message is also prefixed with its length in bytes (including the newline), encoded as an unsigned variable-length integer according to the rules of the multiformats unsigned varint spec.

For example, the string "na" is sent as the following bytes (shown here in hex):

0x036e610a

The first byte is the varint-encoded length (0x03), followed by na (0x6e 0x61), then the newline (0x0a).

The basic multistream-select interaction flow looks like this:

see multistream.plantuml for diagram source

Let's walk through the diagram above. The peer initiating the connection is called the Initiator, and the peer accepting the connection is the Responder.

The Initiator first opens a channel to the Responder. This channel could either be a new connection or a new stream multiplexed over an existing connection.

Next, both peers will send the multistream protocol id to establish that they want to use multistream-select. Both sides may send the initial multistream protocol id simultaneously, without waiting to receive data from the other side. If either side receives anything other than the multistream protocol id as the first message, they abort the negotiation process.

Once both peers have agreed to use multistream-select, the Initiator sends the protocol id for the protocol they would like to use. If the Responder supports that protocol, it will respond by echoing back the protocol id, which signals agreement. If the protocol is not supported, the Responder will respond with the string "na" to indicate that the requested protocol is Not Available.

If the peers agree on a protocol, multistream-select's job is done, and future traffic over the channel will adhere to the rules of the agreed-upon protocol.

If a peer receives a "na" response to a proposed protocol id, they can either try again with a different protocol id or close the channel.

Upgrading Connections

libp2p is designed to support a variety of transport protocols, including those that do not natively support the core libp2p capabilities of security and stream multiplexing. The process of layering capabilities onto "raw" transport connections is called "upgrading" the connection.

Because there are many valid ways to provide the libp2p capabilities, the connection upgrade process uses protocol negotiation to decide which specific protocols to use for each capability. The protocol negotiation process uses multistream-select as described in the Protocol Negotiation section.

When raw connections need both security and multiplexing, security is always established first, and the negotiation for stream multiplexing takes place over the encrypted channel.

Here's an example of the connection upgrade process:

see conn-upgrade.plantuml for diagram source

First, the peers both send the multistream protocol id to establish that they'll use multistream-select to negotiate protocols for the connection upgrade.

Next, the Initiator proposes the TLS protocol for encryption, but the Responder rejects the proposal as they don't support TLS.

The Initiator then proposes the Noise protocol, which is supported by the Responder. The Listener echoes back the protocol id for Noise to indicate agreement.

At this point the Noise protocol takes over, and the peers exchange the Noise handshake to establish a secure channel. If the Noise handshake fails, the connection establishment process aborts. If successful, the peers will use the secured channel for all future communications, including the remainder of the connection upgrade process.

Once security has been established, the peers negotiate which stream multiplexer to use. The negotiation process works in the same manner as before, with the dialing peer proposing a multiplexer by sending its protocol id, and the listening peer responding by either echoing back the supported id or sending "na" if the multiplexer is unsupported.

Once security and stream multiplexing are both established, the connection upgrade process is complete, and both peers are able to use the resulting libp2p connection to open new secure multiplexed streams.

Note: In the case where both peers initially act as initiators, e.g. during NAT hole punching, tie-breaking is done via the multistream-select simultaneous open protocol extension.

Inlining Muxer Negotiation

If both peers support it, it's possible to shortcut the muxer selection by moving it into the security handshake. Details are specified in [inlined-muxer-negotiation].

Opening New Streams Over a Connection

Once we've established a libp2p connection to another peer, new streams are multiplexed over the connection using the native facilities of the transport, or the stream multiplexer negotiated during the upgrade process if the transport lacks native multiplexing. Either peer can open a new stream to the other over an existing connection.

When a new stream is opened, a protocol is negotiated using multistream-select. The protocol negotiation process for new streams is very similar to the one used for upgrading connections. However, while the security and stream multiplexing modules for connection upgrades are typically libp2p framework components, the protocols negotiated for new streams can be easily defined by libp2p applications.

Streams are routed to application-defined handler functions based on their protocol id string. Incoming stream requests will propose a protocol id to use for the stream using multistream-select, and the peer accepting the stream request will determine if there are any registered handlers capable of handling the protocol. If no handlers are found, the peer will respond to the proposal with "na".

When registering protocol handlers, it's possible to use a custom predicate or "match function", which will receive incoming protocol ids and return a boolean indicating whether the handler supports the protocol. This allows more flexible behavior than exact literal matching, which is the default behavior if no match function is provided.

Practical Considerations

This section will go over a few aspects of connection establishment and state management that are worth considering when implementing libp2p.

Interoperability

Support for connection security protocols and stream multiplexers varies across libp2p implementations. To support the widest variety of peers, implementations should support a baseline "stack" of security and multiplexing protocols.

The recommended baseline security protocol is Noise, which is supported in all current libp2p implementations.

The recommended baseline stream multiplexer is yamux, which provides a very simple programmatic API and is supported in most libp2p implementations.

State Management

While the connection establishment process itself does not require any persistent state, some state management is useful to assist bootstrapping and maintain resource limits.

Peer Metadata Storage

It's recommended that libp2p implementations provide a persistent metadata storage interface that contains at minimum the peer id and last known valid addresses for each peer. This allows you to more easily "catch back up" and rejoin a dense network between invocations of your libp2p application without having to rely on a few bootstrap nodes and random DHT walks to build up a routing table.

Even during a single invocation of an application, you're likely to benefit from an in-memory metadata storage facility, which will allow you to cache addresses for connection resumption. Designing a storage interface which can be backed by memory or persistent storage will let you swap in whichever is appropriate for your use case and stage of development.

For examples, see go-libp2p-peerstore and js-peer-book.

Connection Limits

Maintaining a large number of persistent connections can cause issues with some network environments and can lead to resource exhaustion and erratic behavior.

It's highly recommended that libp2p implementations maintain an upper bound on the number of open connections. Doing so while still maintaining robust performance and connectivity will likely require implementing some kind of priority mechanism for selecting which connections are the most "expendable" when you're near the limit.

Resource allocation, measurement and enforcement policies are all an active area of discussion in the libp2p community, and implementations are free to develop whatever prioritization system makes sense.

Supported protocols

A libp2p node SHOULD scope its set of supported protocols to the underlying physical connection to a peer. It MAY only support a protocol based on properties of a physical connection to e.g. limit the use of bandwidth-heavy protocols over a relayed or metered connection. A libp2p node MAY offer different sets of protocols to different peers. It MAY revoke or add the support for a protocol at any time, for example to only offer certain services after learning its NAT status on a connection. Therefore, libp2p nodes SHOULD NOT assume that the set of protocols on a connection is static.

Connection Lifecycle Events

The establishment of new connections and streams is likely to be a "cross-cutting concern" that's of interest to various parts of your application (or parts of libp2p) besides the protocol handlers that directly deal with the traffic.

For example, the persistent metadata component could automatically add peer ids and addresses to its registry whenever a new peer connects, or a DHT module could update its routing tables when a connection is terminated.

To support this, it's recommended that libp2p implementations support a notification or event delivery system that can inform interested parties about connection lifecycle events.

The full set of lifecycle events is not currently specified, but a recommended baseline would be:

EventDescription
ConnectedA new connection has been opened
DisconnectedA connection has closed
OpenedStreamA new stream has opened over a connection
ClosedStreamA stream has closed
ListenWe've started listening on a new address
ListenCloseWe've stopped listening on an address

Hole punching

See hole punching document.

Future Work

A replacement for multistream-select is being discussed which proposes solutions for several inefficiencies and shortcomings in the current protocol negotiation and connection establishment process. The ideal outcome of that discussion will require many changes to this document, once the new multistream semantics are fully specified.

For connection management, there is currently a draft of a connection manager specification that may replace the current connmgr interface in go-libp2p and may also form the basis of other connection manager implementations. There is also a proposal for a more comprehensive resource management system, which would track and manage other finite resources as well as connections.

Also related to connection management, libp2p has recently added support for QUIC, a transport protocol layered on UDP that can resume sessions with much lower overhead than killing and re-establishing a TCP connection. As QUIC and other "connectionless" transports become more widespread, we want to take advantage of this behavior where possible and integrate lightweight session resumption into the connection manager.

Event delivery is also undergoing a refactoring in go-libp2p, with work on an in-process event bus in progress now that will augment (and perhaps eventually replace) the current notification system.

One of the near-term goals of the event bus refactor is to more easily respond to changes in the protocols supported by a remote peer. Those changes are communicated over the wire using the [identify/push protocol][identify-push]. Using an event bus allows other, unrelated components of libp2p (for example, a DHT module) to respond to changes without tightly coupling components together with direct dependencies.

While the event bus refactoring is specific to go-libp2p, a future spec may standardize event types used to communicate information across key libp2p subsystems, and may possibly require libp2p implementations to provide an in-process event delivery system. If and when this occurs, this spec will be updated to incorporate the changes.

Ping

Lifecycle StageMaturityStatusLatest Revision
3ARecommendationActiver0, 2022-11-04

Authors: @marcopolo

Interest Group: @marcopolo, @mxinden, @marten-seemann

Table of Contents

Protocol

The ping protocol is a simple liveness check that peers can use to test the connectivity and performance between two peers. The libp2p ping protocol is different from the ping command line utility (ICMP ping), as it requires an already established libp2p connection.

The dialing peer sends a 32-byte payload of random binary data on an open stream. The listening peer echoes the same 32-byte payload back to the dialing peer. The dialing peer then measures the RTT from when it wrote the bytes to when it received them.

The dialing peer MAY repeat the process by sending another payload with random bytes on the same stream, where the listening peer SHOULD loop and echo the next payload. The dialing peer SHOULD close the write operation of the stream after sending the last payload, and the listening peer SHOULD finish writing the echoed payload and then exit the loop and close the stream.

The dialing peer MUST NOT keep more than one outbound stream for the ping protocol per peer. The listening peer SHOULD accept at most two streams per peer since cross-stream behavior is non-linear and stream writes occur asynchronously. The listening peer may perceive the dialing peer closing and opening the wrong streams (for instance, closing stream B and opening stream A even though the dialing peer is opening stream B and closing stream A).

The protocol ID is /ipfs/ping/1.0.0.

Diagram

Ping Protocol Diagram

Instructions to reproduce diagram

From the root, run: plantuml -tsvg ping/ping.md

@startuml
skinparam backgroundColor white

entity Client
entity Server

== /ipfs/ping/1.0.0 ==
loop until Client closes write
    Client -> Server: 32 random bytes
    Client <- Server: Same 32 random bytes
end
@enduml

Peer Ids and Keys

Lifecycle StageMaturity LevelStatusLatest Revision
3ARecommendationActiver2, 2021-04-30

Authors: @mgoelzer, @yusefnapora, @lidel

Interest Group: @raulk, @vyzo, @Stebalien

See the lifecycle document for context about maturity level and spec status.

Table of Contents

Overview

libp2p uses cryptographic key pairs to sign messages and derive unique peer identities (or "peer ids").

This document describes the types of keys supported, how keys are serialized for transmission, and how peer ids are generated from the hash of serialized public keys.

Although private keys are not transmitted over the wire, the serialization format used to store keys on disk is also included as a reference for libp2p implementors who would like to import existing libp2p key pairs.

Key encodings and message signing semantics are covered below.

Keys

Libp2p encodes keys in a protobuf containing a key type and the encoded key (where the encoding depends on the type).

Specifically:

syntax = "proto2";

enum KeyType {
	RSA = 0;
	Ed25519 = 1;
	Secp256k1 = 2;
	ECDSA = 3;
}

message PublicKey {
	required KeyType Type = 1;
	required bytes Data = 2;
}

message PrivateKey {
	required KeyType Type = 1;
	required bytes Data = 2;
}

The PublicKey and PrivateKey messages contain a Data field with serialized keys, and a Type enum that specifies the type of key.

Each key type has its own serialization format within the Data field, described below.

Finally, libp2p places a stronger requirement on the protobuf encoder than the protobuf spec: encoding must be deterministic. To achieve this, libp2p imposes two additional requirements:

  1. Fields must be minimally encoded. That is, varints must use the minimal representation (fewest bytes that can encode the given number).
  2. Fields must be encoded in tag order (i.e., key type, then the key data).
  3. All fields must be included.
  4. No additional fields may be defined.

Note that PrivateKey messages are never transmitted over the wire. Current libp2p implementations store private keys on disk as a serialized PrivateKey protobuf message. libp2p implementors who want to load existing keys can use the PrivateKey message definition to deserialize private key files.

Where are keys used?

Keys are used in two places in libp2p. The first is for signing messages. Here are some examples of messages we sign:

  • IPNS records
  • PubSub messages
  • SECIO handshake

The second is for generating peer ids; this is discussed in the section below.

Key Types

Four key types are supported:

  • RSA
  • Ed25519
  • Secp256k1
  • ECDSA

Implementations MUST support Ed25519. Implementations SHOULD support RSA if they wish to interoperate with the mainline IPFS DHT and the default IPFS bootstrap nodes. Implementations MAY support Secp256k1 and ECDSA, but nodes using those keys may not be able to connect to all other nodes.

In all cases, implementation MAY allow the user to enable/disable specific key types via configuration. Note that disabling support for compulsory key types may hinder connectivity.

The following sections describe:

  1. How each key type is encoded into the libp2p key's Data field.
  2. How each key type creates and validates signatures.

Implementations may use whatever in-memory representation is convenient, provided the encodings described below are used at the "I/O boundary".

RSA

We encode the public key using the DER-encoded PKIX format.

We encode the private key as a PKCS1 key using ASN.1 DER.

To sign a message, we first hash it with SHA-256 and then sign it using the RSASSA-PKCS1-V1.5-SIGN method, as originally defined in RSA PKCS#1 v1.5.

Ed25519

Ed25519 specifies the exact format for keys and signatures, so we do not do much additional encoding, except as noted below.

We do not do any special additional encoding for Ed25519 public keys.

The encoding for Ed25519 private keys is a little unusual. There are two formats that we encourage implementors to support:

  • Preferred method is a simple concatenation: [private key bytes][public key bytes] (64 bytes)
  • Older versions of the libp2p code used the following format: [private key][public key][public key] (96 bytes). If you encounter this type of encoding, the proper way to process it is to compare the two public key strings (32 bytes each) and verify they are identical. If they are, then proceed as you would with the preferred method. If they do not match, reject or error out because the byte array is invalid.

Ed25519 signatures follow the normal Ed25519 standard.

Secp256k1

We use the standard Bitcoin EC encoding for Secp256k1 public and private keys.

To sign a message, we hash the message with SHA 256, then sign it using the standard Bitcoin EC signature algorithm (BIP0062), and then use standard Bitcoin encoding.

ECDSA

We encode the public key using ASN.1 DER.

We encode the private key using DER-encoded PKIX.

To sign a message, we hash the message with SHA 256, and then sign it with the ECDSA standard algorithm, then we encode it using DER-encoded ASN.1.

Test vectors

The following test vectors are hex-encoded bytes of the above described protobuf encoding. The provided public key belongs to the private key. Implementations SHOULD check that they can produce the provided public key from the private key.

Keybytes
ECDSA private key08031279307702010104203E5B1FE9712E6C314942A750BD67485DE3C1EFE85B1BFB520AE8F9AE3DFA4A4CA00A06082A8648CE3D030107A14403420004DE3D300FA36AE0E8F5D530899D83ABAB44ABF3161F162A4BC901D8E6ECDA020E8B6D5F8DA30525E71D6851510C098E5C47C646A597FB4DCEC034E9F77C409E62
ECDSA public key0803125b3059301306072a8648ce3d020106082a8648ce3d03010703420004de3d300fa36ae0e8f5d530899d83abab44abf3161f162a4bc901d8e6ecda020e8b6d5f8da30525e71d6851510c098e5c47c646a597fb4dcec034e9f77c409e62
ED25519 private key080112407e0830617c4a7de83925dfb2694556b12936c477a0e1feb2e148ec9da60fee7d1ed1e8fae2c4a144b8be8fd4b47bf3d3b34b871c3cacf6010f0e42d474fce27e
ED25519 public key080112201ed1e8fae2c4a144b8be8fd4b47bf3d3b34b871c3cacf6010f0e42d474fce27e
secp256k1 private key0802122053DADF1D5A164D6B4ACDB15E24AA4C5B1D3461BDBD42ABEDB0A4404D56CED8FB
secp256k1 public key08021221037777e994e452c21604f91de093ce415f5432f701dd8cd1a7a6fea0e630bfca99
rsa private key080012ae123082092a0201000282020100e1beab071d08200bde24eef00d049449b07770ff9910257b2d7d5dda242ce8f0e2f12e1af4b32d9efd2c090f66b0f29986dbb645dae9880089704a94e5066d594162ae6ee8892e6ec70701db0a6c445c04778eb3de1293aa1a23c3825b85c6620a2bc3f82f9b0c309bc0ab3aeb1873282bebd3da03c33e76c21e9beb172fd44c9e43be32e2c99827033cf8d0f0c606f4579326c930eb4e854395ad941256542c793902185153c474bed109d6ff5141ebf9cd256cf58893a37f83729f97e7cb435ec679d2e33901d27bb35aa0d7e20561da08885ef0abbf8e2fb48d6a5487047a9ecb1ad41fa7ed84f6e3e8ecd5d98b3982d2a901b4454991766da295ab78822add5612a2df83bcee814cf50973e80d7ef38111b1bd87da2ae92438a2c8cbcc70b31ee319939a3b9c761dbc13b5c086d6b64bf7ae7dacc14622375d92a8ff9af7eb962162bbddebf90acb32adb5e4e4029f1c96019949ecfbfeffd7ac1e3fbcc6b6168c34be3d5a2e5999fcbb39bba7adbca78eab09b9bc39f7fa4b93411f4cc175e70c0a083e96bfaefb04a9580b4753c1738a6a760ae1afd851a1a4bdad231cf56e9284d832483df215a46c1c21bdf0c6cfe951c18f1ee4078c79c13d63edb6e14feaeffabc90ad317e4875fe648101b0864097e998f0ca3025ef9638cd2b0caecd3770ab54a1d9c6ca959b0f5dcbc90caeefc4135baca6fd475224269bbe1b02030100010282020100a472ffa858efd8588ce59ee264b957452f3673acdf5631d7bfd5ba0ef59779c231b0bc838a8b14cae367b6d9ef572c03c7883b0a3c652f5c24c316b1ccfd979f13d0cd7da20c7d34d9ec32dfdc81ee7292167e706d705efde5b8f3edfcba41409e642f8897357df5d320d21c43b33600a7ae4e505db957c1afbc189d73f0b5d972d9aaaeeb232ca20eebd5de6fe7f29d01470354413cc9a0af1154b7af7c1029adcd67c74b4798afeb69e09f2cb387305e73a1b5f450202d54f0ef096fe1bde340219a1194d1ac9026e90b366cce0c59b239d10e4888f52ca1780824d39ae01a6b9f4dd6059191a7f12b2a3d8db3c2868cd4e5a5862b8b625a4197d52c6ac77710116ebd3ced81c4d91ad5fdfbed68312ebce7eea45c1833ca3acf7da2052820eacf5c6b07d086dabeb893391c71417fd8a4b1829ae2cf60d1749d0e25da19530d889461c21da3492a8dc6ccac7de83ac1c2185262c7473c8cc42f547cc9864b02a8073b6aa54a037d8c0de3914784e6205e83d97918b944f11b877b12084c0dd1d36592f8a4f8b8da5bb404c3d2c079b22b6ceabfbcb637c0dbe0201f0909d533f8bf308ada47aee641a012a494d31b54c974e58b87f140258258bb82f31692659db7aa07e17a5b2a0832c24e122d3a8babcc9ee74cbb07d3058bb85b15f6f6b2674aba9fd34367be9782d444335fbed31e3c4086c652597c27104938b47fa10282010100e9fdf843c1550070ca711cb8ff28411466198f0e212511c3186623890c0071bf6561219682fe7dbdfd81176eba7c4faba21614a20721e0fcd63768e6d925688ecc90992059ac89256e0524de90bf3d8a052ce6a9f6adafa712f3107a016e20c80255c9e37d8206d1bc327e06e66eb24288da866b55904fd8b59e6b2ab31bc5eab47e597093c63fab7872102d57b4c589c66077f534a61f5f65127459a33c91f6db61fc431b1ae90be92b4149a3255291baf94304e3efb77b1107b5a3bda911359c40a53c347ff9100baf8f36dc5cd991066b5bdc28b39ed644f404afe9213f4d31c9d4e40f3a5f5e3c39bebeb244e84137544e1a1839c1c8aaebf0c78a7fad590282010100f6fa1f1e6b803742d5490b7441152f500970f46feb0b73a6e4baba2aaf3c0e245ed852fc31d86a8e46eb48e90fac409989dfee45238f97e8f1f8e83a136488c1b04b8a7fb695f37b8616307ff8a8d63e8cfa0b4fb9b9167ffaebabf111aa5a4344afbabd002ae8961c38c02da76a9149abdde93eb389eb32595c29ba30d8283a7885218a5a9d33f7f01dbdf85f3aad016c071395491338ec318d39220e1c7bd69d3d6b520a13a30d745c102b827ad9984b0dd6aed73916ffa82a06c1c111e7047dcd2668f988a0570a71474992eecf416e068f029ec323d5d635fd24694fc9bf96973c255d26c772a95bf8b7f876547a5beabf86f06cd21b67994f944e7a5493028201010095b02fd30069e547426a8bea58e8a2816f33688dac6c6f6974415af8402244a22133baedf34ce499d7036f3f19b38eb00897c18949b0c5a25953c71aeeccfc8f6594173157cc854bd98f16dffe8f28ca13b77eb43a2730585c49fc3f608cd811bb54b03b84bddaa8ef910988567f783012266199667a546a18fd88271fbf63a45ae4fd4884706da8befb9117c0a4d73de5172f8640b1091ed8a4aea3ed4641463f5ff6a5e3401ad7d0c92811f87956d1fd5f9a1d15c7f3839a08698d9f35f9d966e5000f7cb2655d7b6c4adcd8a9d950ea5f61bb7c9a33c17508f9baa313eecfee4ae493249ebe05a5d7770bbd3551b2eeb752e3649e0636de08e3d672e66cb90282010100ad93e4c31072b063fc5ab5fe22afacece775c795d0efdf7c704cfc027bde0d626a7646fc905bb5a80117e3ca49059af14e0160089f9190065be9bfecf12c3b2145b211c8e89e42dd91c38e9aa23ca73697063564f6f6aa6590088a738722df056004d18d7bccac62b3bafef6172fc2a4b071ea37f31eff7a076bcab7dd144e51a9da8754219352aef2c73478971539fa41de4759285ea626fa3c72e7085be47d554d915bbb5149cb6ef835351f231043049cd941506a034bf2f8767f3e1e42ead92f91cb3d75549b57ef7d56ac39c2d80d67f6a2b4ca192974bfc5060e2dd171217971002193dba12e7e4133ab201f07500a90495a38610279b13a48d54f0c99028201003e3a1ac0c2b67d54ed5c4bbe04a7db99103659d33a4f9d35809e1f60c282e5988dddc964527f3b05e6cc890eab3dcb571d66debf3a5527704c87264b3954d7265f4e8d2c637dd89b491b9cf23f264801f804b90454d65af0c4c830d1aef76f597ef61b26ca857ecce9cb78d4f6c2218c00d2975d46c2b013fbf59b750c3b92d8d3ed9e6d1fd0ef1ec091a5c286a3fe2dead292f40f380065731e2079ebb9f2a7ef2c415ecbb488da98f3a12609ca1b6ec8c734032c8bd513292ff842c375d4acd1b02dfb206b24cd815f8e2f9d4af8e7dea0370b19c1b23cc531d78b40e06e1119ee2e08f6f31c6e2e8444c568d13c5d451a291ae0c9f1d4f27d23b3a00d60ad
rsa public key080012a60430820222300d06092a864886f70d01010105000382020f003082020a0282020100e1beab071d08200bde24eef00d049449b07770ff9910257b2d7d5dda242ce8f0e2f12e1af4b32d9efd2c090f66b0f29986dbb645dae9880089704a94e5066d594162ae6ee8892e6ec70701db0a6c445c04778eb3de1293aa1a23c3825b85c6620a2bc3f82f9b0c309bc0ab3aeb1873282bebd3da03c33e76c21e9beb172fd44c9e43be32e2c99827033cf8d0f0c606f4579326c930eb4e854395ad941256542c793902185153c474bed109d6ff5141ebf9cd256cf58893a37f83729f97e7cb435ec679d2e33901d27bb35aa0d7e20561da08885ef0abbf8e2fb48d6a5487047a9ecb1ad41fa7ed84f6e3e8ecd5d98b3982d2a901b4454991766da295ab78822add5612a2df83bcee814cf50973e80d7ef38111b1bd87da2ae92438a2c8cbcc70b31ee319939a3b9c761dbc13b5c086d6b64bf7ae7dacc14622375d92a8ff9af7eb962162bbddebf90acb32adb5e4e4029f1c96019949ecfbfeffd7ac1e3fbcc6b6168c34be3d5a2e5999fcbb39bba7adbca78eab09b9bc39f7fa4b93411f4cc175e70c0a083e96bfaefb04a9580b4753c1738a6a760ae1afd851a1a4bdad231cf56e9284d832483df215a46c1c21bdf0c6cfe951c18f1ee4078c79c13d63edb6e14feaeffabc90ad317e4875fe648101b0864097e998f0ca3025ef9638cd2b0caecd3770ab54a1d9c6ca959b0f5dcbc90caeefc4135baca6fd475224269bbe1b0203010001

Peer Ids

Peer IDs are derived by hashing the encoded public key with multihash. Keys that serialize to more than 42 bytes must be hashed using sha256 multihash, keys that serialize to at most 42 bytes must be hashed using the "identity" multihash codec.

Specifically, to compute a peer ID of a key:

  1. Encode the public key as described in the keys section.
  2. If the length of the serialized bytes is less than or equal to 42, compute the "identity" multihash of the serialized bytes. In other words, no hashing is performed, but the multihash format is still followed (byte plus varint plus serialized bytes). The idea here is that if the serialized byte array is short enough, we can fit it in a multihash verbatim without having to condense it using a hash function.
  3. If the length is greater than 42, then hash it using the SHA256 multihash.

String representation

There are two ways to represent peer IDs in text: as a raw base58btc encoded multihash (e.g., Qm..., 1...) and as a multibase encoded CID (e.g., bafz...). Libp2p is slowly transitioning from the first (legacy) format to the second (new).

Implementations MUST support parsing both forms of peer IDs. Implementations SHOULD display peer IDs using the first (raw base58btc encoded multihash) format until the second format is widely supported.

Peer IDs encoded as CIDs must be encoded using CIDv1 and must use the libp2p-key multicodec (0x72). By default, such peer IDs SHOULD be encoded in using the base32 multibase (RFC4648, without padding).

For reference, CIDs (encoded in text) have the following format

<multibase-prefix><cid-version><multicodec><multihash>

Encoding

To encode a peer ID using the legacy format, simply encode it with base58btc.

To encode a peer ID using the new format, create a CID with the libp2p-key multicodec and encode it using multibase.

Decoding

To decode a peer ID:

  • If it starts with 1 or Qm, it's a bare base58btc encoded multihash. Decode it according to the base58btc algorithm.
  • If it starts with a multibase prefix, it's a CIDv1 CID. Decode it according to the multibase and CID spec.
    • Once decoded, verify that the CIDs multicodec is libp2p-key.
    • Finally, extract the multihash from the CID. This is the peer ID.
  • Otherwise, it's not a valid peer ID.

Examples:

  • bafzbeie5745rpv2m6tjyuugywy4d5ewrqgqqhfnf445he3omzpjbx5xqxe -- Peer ID (sha256) encoded as a CID (inspect).
  • QmYyQSo1c1Ym7orWxLYvCrM2EmxFTANf8wXmmE7DWjhx5N -- Peer ID (sha256) encoded as a raw base58btc multihash.
  • 12D3KooWD3eckifWpRn9wQpMG9R9hX3sD158z7EqHWmweQAJU5SA -- Peer ID (ed25519, using the "identity" multihash) encoded as a raw base58btc multihash.

NAT Discovery

How we detect if we're behind a NAT.

Specifications:

Lifecycle StageMaturityStatusLatest Revision
3ARecommendationActiver1, 2023-02-16

Authors: @marten-seemann

Interest Group: @mxinden, @vyzo, @raulk, @stebalien, @willscott

See the lifecycle document for context about the maturity level and spec status.

Table of Contents

Overview

A priori, a node cannot know if it is behind a NAT / firewall or if it is publicly reachable. Knowing its NAT status is essential for the node to be well-behaved in the network: A node that's behind a NAT doesn't need to advertise its (undiable) addresses to the rest of the network, preventing superfluous dials from other peers. Furthermore, it might actively seek to improve its connectivity by finding a relay server, which would allow other peers to establish a relayed connection.

To determine if it is located behind a NAT, nodes use the autonat protocol. Using this protocol, the node requests other peers to dial its presumed public addresses. If a couple of these dial attempts succeed, the node can be reasonably certain that it is not located behind a NAT. Likewise, if a couple of these dial attempts fail, this is a strong indicator that a NAT is blocking incoming connections.

AutoNAT Protocol

The AutoNAT Protocol uses the Protocol ID /libp2p/autonat/1.0.0. The node wishing to determine its NAT status opens a stream using this protocol ID, and then sends a Dial message. The Dial message contains a list of multiaddresses. Upon receiving this message, the peer starts to dial these addresses. It MAY add the observed address of the connection on which the request was received to the list of addresses. It MAY dial the addresses in parallel. The peer MAY also use a different IP and peer ID than it uses for its regular libp2p connection to perform these dial backs.

In order to prevent attacks like the one described in RFC 3489, Section 12.1.1 (see excerpt below), implementations MUST NOT dial any multiaddress unless it is based on the IP address the requesting node is observed as. This restriction as well implies that implementations MUST NOT accept dial requests via relayed connections as one can not validate the IP address of the requesting node.

RFC 3489 12.1.1 Attack I: DDOS Against a Target

In this case, the attacker provides a large number of clients with the same faked MAPPED-ADDRESS that points to the intended target. This will trick all the STUN clients into thinking that their addresses are equal to that of the target. The clients then hand out that address in order to receive traffic on it (for example, in SIP or H.323 messages). However, all of that traffic becomes focused at the intended target. The attack can provide substantial amplification, especially when used with clients that are using STUN to enable multimedia applications.

If all dials fail, the receiver sends a DialResponse message with the ResponseStatus E_DIAL_ERROR. If at least one of the dials complete successfully, it sends a DialResponse with the ResponseStatus OK. It SHOULD include the address it successfully dialed in its response.

The initiator uses the responses obtained from multiple peers to determine its NAT status. If more than 3 peers report a successfully dialed address, the node SHOULD assume that it is not located behind a NAT and publicly accessible. On the other hand, if more than 3 peers report unsuccessful dials, the node SHOULD assume that it is not publicly accessible. Nodes are encouraged to periodically re-check their status, especially after changing their set of addresses they're listening on.

RPC messages

Messages are exchanged by:

  1. Opening a new stream.
  2. Sending the RPC request message.
  3. Listening for the RPC response message.

All RPC messages sent over a stream are prefixed with the message length in bytes, encoded as an unsigned variable length integer as defined by the multiformats unsigned-varint spec.

syntax = "proto2";

message Message {
  enum MessageType {
    DIAL          = 0;
    DIAL_RESPONSE = 1;
  }

  enum ResponseStatus {
    OK               = 0;
    E_DIAL_ERROR     = 100;
    E_DIAL_REFUSED   = 101;
    E_BAD_REQUEST    = 200;
    E_INTERNAL_ERROR = 300;
  }

  message PeerInfo {
    optional bytes id = 1;
    repeated bytes addrs = 2;
  }

  message Dial {
    optional PeerInfo peer = 1;
  }

  message DialResponse {
    optional ResponseStatus status = 1;
    optional string statusText = 2;
    optional bytes addr = 3;
  }

  optional MessageType type = 1;
  optional Dial dial = 2;
  optional DialResponse dialResponse = 3;
}

Security Considerations

Note that in the current iteration of this protocol, a node doesn't check if a peer's report of a successful dial is accurate. This might be solved in a future iteration of this protocol, see https://github.com/libp2p/go-libp2p-autonat/issues/10 for a detailed discussion.

AutonatV2: spec

Lifecycle StageMaturityStatusLatest Revision
1AWorking DraftActiver2, 2023-04-15

Authors: @sukunrt

Interest Group: @marten-seemann, @marcopolo, @mxinden

Overview

A priori, a node cannot know if it is behind a NAT / firewall or if it is publicly reachable. Moreover, the node may be publicly reachable on some of its addresses and not on others. Knowing the reachability status of its addresses is crucial for proper network behavior: the node can avoid advertising unreachable addresses, reducing unnecessary connection attempts from other peers. If the node has no publicly accessible addresses, it may proactively improve its connectivity by locating a relay server, enabling other peers to connect through a relayed connection.

In autonat v2 client sends a request with a priority ordered list of addresses and a nonce. On receiving this request the server dials the first address in the list that it is capable of dialing and provides the nonce. Upon completion of the dial, the server responds to the client with the response containing the dial outcome.

As the server dials exactly one address from the list, autonat v2 allows nodes to determine reachability for individual addresses. Using autonat v2 nodes can build an address pipeline where they can test individual addresses discovered by different sources like identify, upnp mappings, circuit addresses etc for reachability. Having a priority ordered list of addresses provides the ability to verify low priority addresses. Implementations can generate low priority address guesses and add them to requests for high priority addresses as a nice to have. This is especially helpful when introducing a new transport. Initially, such a transport will not be widely supported in the network. Requests for verifying such addresses can be reused to get information about other addresses

The client can verify the server did successfully dial an address of the same transport as it reported in the response by checking the local address of the connection on which the nonce was received on.

Compared to autonat v1 there are three major differences

  1. autonat v1 allowed testing reachability for the node. autonat v2 allows testing reachability for an individual address.
  2. autonat v2 provides a mechanism for nodes to verify whether the peer actually successfully dialled an address.
  3. autonat v2 provides a mechanism for nodes to dial an IP address different from the requesting node's observed IP address without risking amplification attacks. autonat v1 disallowed such dials to prevent amplification attacks.

AutoNAT V2 Protocol

Autonat V2 Interaction

A client node wishing to determine reachability of its addresses sends a DialRequest message to a server on a stream with protocol ID /libp2p/autonat/2/dial-request. Each DialRequest is sent on a new stream.

This DialRequest message has a list of addresses and a fixed64 nonce. The list is ordered in descending order of priority for verification. AutoNAT V2 is primarily for testing reachability on Public Internet. Client SHOULD NOT send any private address as defined in RFC 1918 in the list. The Server SHOULD NOT dial any private address.

Upon receiving this request, the server selects an address from the list to dial. The server SHOULD use the first address it is willing to dial. The server MUST NOT dial any address other than this one. If this selected address has an IP address different from the requesting node's observed IP address, server initiates the Amplification attack prevention mechanism (see Amplification Attack Prevention ). On completion, the server proceeds to the next step. If the selected address has the same IP address as the client's observed IP address, server proceeds to the next step skipping Amplification Attack Prevention steps.

The server dials the selected address, opens a stream with Protocol ID /libp2p/autonat/2/dial-back and sends a DialBack message with the nonce received in the request. The client on receiving this message replies with a DialBackResponse message with the status set to OK. The client MUST close this stream after sending the response. The dial back response provides the server assurance that the message was delivered so that it can close the connection.

Upon completion of the dial back, the server sends a DialResponse message to the client node on the /libp2p/autonat/2/dial-request stream. The response contains addrIdx, the index of the address the server selected to dial and DialStatus, a dial status indicating the outcome of the dial back. The DialStatus for an address is set according to Requirements for DialStatus. The response also contains an appropriate ResponseStatus set according to Requirements For ResponseStatus.

The client MUST check that the nonce received in the DialBack is the same as the nonce it sent in the DialRequest. If the nonce is different, it MUST discard this response.

The server MUST close the stream after sending the response. The client MUST close the stream after receiving the response.

Requirements for DialStatus

On receiving a DialRequest, the server first selects an address that it will dial.

If server chooses to not dial any of the requested addresses, ResponseStatus is set to E_DIAL_REFUSED. The fields addrIdx and DialStatus are meaningless in this case. See Requirements For ResponseStatus.

If the server selects an address for dialing, addrIdx is set to the index(zero-based) of the address on the list and the DialStatus is set according to the following consideration:

If the server was unable to connect to the client on the selected address, DialStatus is set to E_DIAL_ERROR, indicating the selected address is not publicly reachable.

If the server was able to connect to the client on the selected address, but an error occured while sending an nonce on the /libp2p/autonat/2/dial-back stream, DialStatus is set to E_DIAL_BACK_ERROR. This might happen in case of resource limited situations on client or server, or when either the client or the server is misconfigured.

If the server was able to connect to the client and successfully send a nonce on the /libp2p/autonat/2/dial-back stream, DialStatus is set to OK.

Requirements for ResponseStatus

The ResponseStatus sent by the server in the DialResponse message MUST be set according to the following requirements

E_REQUEST_REJECTED: The server didn't serve the request because of rate limiting, resource limit reached or blacklisting.

E_DIAL_REFUSED: The server didn't dial back any address because it was incapable of dialing or unwilling to dial any of the requested addresses.

E_INTERNAL_ERROR: Error not classified within the above error codes occured on server preventing it from completing the request.

OK: The server completed the request successfully. A request is considered a success when the server selects an address to dial and dials it, successfully or unsuccessfully.

Implementations MUST discard responses with status codes they do not understand.

Amplification Attack Prevention

Interaction

When a client asks a server to dial an address that is not the client's observed IP address, the server asks the client to send some non trivial amount of bytes as a cost to dial a different IP address. To make amplification attacks unattractive, servers SHOULD ask for 30k to 100k bytes. Since most handshakes cost less than 10k bytes in bandwidth, 30kB is sufficient to make attacks unattractive.

On receiving a DialRequest, the server selects the first address it is capable of dialing. If this selected address has a IP different from the client's observed IP, the server sends a DialDataRequest message with the selected address's index(zero-based) and numBytes set to a sufficiently large value on the /libp2p/autonat/2/dial-request stream

Upon receiving a DialDataRequest message, the client decides whether to accept or reject the cost of dial. If the client rejects the cost, the client resets the stream and the DialRequest is considered aborted. If the client accepts the cost, the client starts transferring numBytes bytes to the server. The client transfers these bytes wrapped in DialDataResponse protobufs where the data field in each individual protobuf is limited to 4096 bytes in length. This allows implementations to use a small buffer for reading and sending the data. Only the size of the data field of DialDataResponse protobufs is counted towards the bytes transferred. Once the server has received at least numBytes bytes, it proceeds to dial the selected address. Servers SHOULD allow the last DialDataResponse message received from the client to be larger than the minimum required amount. This allows clients to serialize their DialDataResponse message once and reuse it for all Requests.

If an attacker asks a server to dial a victim node, the only benefit the attacker gets is forcing the server and the victim to do a cryptographic handshake which costs some bandwidth and compute. The attacker by itself can do a lot of handshakes with the victim without spending any compute by using the same key repeatedly. The only benefit of going via the server to do this attack is not spending bandwidth required for a handshake. So the prevention mechanism only focuses on bandwidth costs. There is a minor benefit of bypassing IP blocklists, but that's made unattractive by the fact that servers may ask 5x more data than the bandwidth cost of a handshake.

UDP based protocol's, like QUIC and DNS-over-UDP, need to prevent similar amplification attacks caused by IP spoofing. To verify that received packets don't have a spoofed IP, the server sends a random token to the client, which echoes the token back. For example, in QUIC, an attacker can use the victim's IP in the initial packet to make it process a much larger ServerHello packet. QUIC servers use a Retry Packet containing a token to validate that the client can receive packets at the address it claims. See QUIC Address Validation for details of the scheme.

Implementation Suggestions

For any given address, client implementations SHOULD do the following

  • Periodically recheck reachability status.
  • Query multiple servers to determine reachability.

The suggested heuristic for implementations is to consider an address reachable if more than 3 servers report a successful dial and to consider an address unreachable if more than 3 servers report unsuccessful dials. Implementations are free to use different heuristics than this one

Servers SHOULD NOT reuse their listening port when making a dial back. In case the client has reused their listen port when dialing out to the server, not reusing the listen port for attempts prevents accidental hole punches. Clients SHOULD only rely on the nonce and not on the peerID for verifying the dial back as the server is free to use a separate peerID for the dial backs.

Servers SHOULD determine whether they have IPv6 and IPv4 connectivity. IPv4 only servers SHOULD refuse requests for dialing IPv6 addresses and IPv6 only servers SHOULD refuse requests for dialing IPv4 addresses.

RPC Messages

All RPC messages sent over a stream are prefixed with the message length in bytes, encoded as an unsigned variable length integer as defined by the multiformats unsigned-varint spec.

All RPC messages on stream /libp2p/autonat/2/dial-request are of type Message. A DialRequest message is sent as a Message with the msg field set to DialRequest. DialResponse and DialDataRequest are handled similarly.

On stream /libp2p/autonat/2/dial-back, a DialAttempt message is sent directly


message Message {
    oneof msg {
        DialRequest dialRequest   = 1;
        DialResponse dialResponse = 2;
        DialDataRequest dialDataRequest = 3;
        DialDataResponse dialDataResponse = 4;
    }
}


message DialRequest {
    repeated bytes addrs = 1;
    fixed64 nonce = 2;
}


message DialDataRequest {
    uint32 addrIdx = 1;
    uint64 numBytes = 2;
}


enum DialStatus {
    UNUSED            = 0;
    E_DIAL_ERROR      = 100;
    E_DIAL_BACK_ERROR = 101;
    OK                = 200;
}


message DialResponse {
    enum ResponseStatus {
        E_INTERNAL_ERROR   = 0;
        E_REQUEST_REJECTED = 100;
        E_DIAL_REFUSED     = 101;
        OK  = 200;
    }

    ResponseStatus status = 1;
    uint32 addrIdx        = 2;
    DialStatus dialStatus = 3;
}


message DialDataResponse {
    bytes data = 1;
}


message DialBack {
    fixed64 nonce = 1;
}

message DialBackResponse {
    enum DialBackStatus {
        OK = 0;
    }

    DialBackStatus status = 1;
}

libp2p Kademlia DHT specification

Lifecycle StageMaturityStatusLatest Revision
3ARecommendationActiver2, 2022-12-09

Authors: @raulk, @jhiesey, @mxinden

Interest Group: @guillaumemichel

See the lifecycle document for context about the maturity level and spec status.


Overview

The Kademlia Distributed Hash Table (DHT) subsystem in libp2p is a DHT implementation largely based on the Kademlia [0] whitepaper, augmented with notions from S/Kademlia [1], Coral [2] and the BitTorrent DHT.

This specification assumes the reader has prior knowledge of those systems. So rather than explaining DHT mechanics from scratch, we focus on differential areas:

  1. Specialisations and peculiarities of the libp2p implementation.
  2. Actual wire messages.
  3. Other algorithmic or non-standard behaviours worth pointing out.

For everything else that isn't explicitly stated herein, it is safe to assume behaviour similar to Kademlia-based libraries.

Code snippets use a Go-like syntax.

Definitions

Replication parameter (k)

The amount of replication is governed by the replication parameter k. The recommended value for k is 20.

Distance

In all cases, the distance between two keys is XOR(sha256(key1), sha256(key2)).

Kademlia routing table

An implementation of this specification must try to maintain k peers with shared key prefix of length L, for every L in [0..(keyspace-length - 1)], in its routing table. Given the keyspace length of 256 through the sha256 hash function, L can take values between 0 (inclusive) and 255 (inclusive). The local node shares a prefix length of 256 with its own key only.

Implementations may use any data structure to maintain their routing table. Examples are the k-bucket data structure outlined in the Kademlia paper [0] or XOR-tries (see go-libp2p-xor).

Alpha concurrency parameter (α)

The concurrency of node and value lookups are limited by parameter α, with a default value of 10. This implies that each lookup process can perform no more than 10 inflight requests, at any given time.

Client and server mode

When the libp2p Kademlia protocol is run on top of a network of heterogeneous nodes, unrestricted nodes should operate in server mode and restricted nodes, e.g. those with intermittent availability, high latency, low bandwidth, low CPU/RAM/Storage, etc., should operate in client mode.

As an example, publicly routable nodes running the libp2p Kademlia protocol, e.g. servers in a datacenter, should operate in server mode and non-publicly routable nodes, e.g. laptops behind a NAT and firewall, should operate in client mode. The concrete factors used to classify nodes into clients and servers depend on the characteristics of the network topology and the properties of the Kademlia DHT. Factors to take into account are e.g. network size, replication factor and republishing period.

For instance, setting the replication factor to a low value would require more reliable peers, whereas having higher replication factor could allow for less reliable peers at the cost of more overhead. Ultimately, peers that act as servers should help the network (i.e., provide positive utility in terms of availability, reachability, bandwidth). Any factor that slows down network operations (e.g., a node not being reachable, or overloaded) for the majority of times it is being contacted should instead be operating as a client node.

Nodes, both those operating in client and server mode, add another node to their routing table if and only if that node operates in server mode. This distinction allows restricted nodes to utilize the DHT, i.e. query the DHT, without decreasing the quality of the distributed hash table, i.e. without polluting the routing tables.

Nodes operating in server mode advertise the libp2p Kademlia protocol identifier via the identify protocol. In addition server mode nodes accept incoming streams using the Kademlia protocol identifier. Nodes operating in client mode do not advertise support for the libp2p Kademlia protocol identifier. In addition they do not offer the Kademlia protocol identifier for incoming streams.

DHT operations

The libp2p Kademlia DHT offers the following types of operations:

  • Peer routing

    • Finding the closest nodes to a given key via FIND_NODE.
  • Value storage and retrieval

    • Storing a value on the nodes closest to the value's key by looking up the closest nodes via FIND_NODE and then putting the value to those nodes via PUT_VALUE.

    • Getting a value by its key from the nodes closest to that key via GET_VALUE.

  • Content provider advertisement and discovery

    • Adding oneself to the list of providers for a given key at the nodes closest to that key by finding the closest nodes via FIND_NODE and then adding oneself via ADD_PROVIDER.

    • Getting providers for a given key from the nodes closest to that key via GET_PROVIDERS.

In addition the libp2p Kademlia DHT offers the auxiliary bootstrap operation.

Peer routing

The below is one possible algorithm to find nodes closest to a given key on the DHT. Implementations may diverge from this base algorithm as long as they adhere to the wire format and make progress towards the target key.

Let's assume we’re looking for nodes closest to key Key. We then enter an iterative network search.

We keep track of the set of peers we've already queried (Pq) and the set of next query candidates sorted by distance from Key in ascending order (Pn). At initialization Pn is seeded with the k peers from our routing table we know are closest to Key, based on the XOR distance function (see distance definition).

Then we loop:

  1. The lookup terminates when the initiator has queried and gotten responses from the k (see replication parameter k) closest nodes it has seen.

    (See Kademlia paper [0].)

    The lookup might terminate early in case the local node queried all known nodes, with the number of nodes being smaller than k.

  2. Pick as many peers from the candidate peers (Pn) as the α concurrency factor allows. Send each a FIND_NODE(Key) request, and mark it as queried in Pq.

  3. Upon a response:

    1. If successful the response will contain the k closest nodes the peer knows to the key Key. Add them to the candidate list Pn, except for those that have already been queried.
    2. If an error or timeout occurs, discard it.
  4. Go to 1.

Value storage and retrieval

Value storage

To put a value the DHT finds k or less closest peers to the key of the value using the FIND_NODE RPC (see peer routing section), and then sends a PUT_VALUE RPC message with the record value to each of the peers.

Value retrieval

When getting a value from the DHT, implementions may use a mechanism like quorums to define confidence in the values found on the DHT, put differently a mechanism to determine when a query is finished. E.g. with quorums one would collect at least Q (quorum) responses from distinct nodes to check for consistency before returning an answer.

Entry validation: Should the responses from different peers diverge, the implementation should use some validation mechanism to resolve the conflict and select the best result (see entry validation section).

Entry correction: Nodes that returned worse records and nodes that returned no record but where among the closest to the key, are updated via a direct PUT_VALUE RPC call when the lookup completes. Thus the DHT network eventually converges to the best value for each record, as a result of nodes collaborating with one another.

The below is one possible algorithm to lookup a value on the DHT. Implementations may diverge from this base algorithm as long as they adhere to the wire format and make progress towards the target key.

Let's assume we’re looking for key Key. We first try to fetch the value from the local store. If found, and Q == { 0, 1 }, the search is complete.

Otherwise, the local result counts for one towards the search of Q values. We then enter an iterative network search.

We keep track of:

  • the number of values we've fetched (cnt).
  • the best value we've found (best), and which peers returned it (Pb)
  • the set of peers we've already queried (Pq) and the set of next query candidates sorted by distance from Key in ascending order (Pn).
  • the set of peers with outdated values (Po).

At initialization we seed Pn with the α peers from our routing table we know are closest to Key, based on the XOR distance function.

Then we loop:

  1. If we have collected Q or more answers, we cancel outstanding requests and return best. If there are no outstanding requests and Pn is empty we terminate early and return best. In either case we notify the peers holding an outdated value (Po) of the best value we discovered, or holding no value for the given key, even though being among the k closest peers to the key, by sending PUT_VALUE(Key, best) messages.
  2. Pick as many peers from the candidate peers (Pn) as the α concurrency factor allows. Send each a GET_VALUE(Key) request, and mark it as queried in Pq.
  3. Upon a response:
    1. If successful, and we receive a value:
      1. If this is the first value we've seen, we store it in best, along with the peer who sent it in Pb.
      2. Otherwise, we resolve the conflict by e.g. calling Validator.Select(best, new):
        1. If the new value wins, store it in best, and mark all formerly “best" peers (Pb) as outdated peers (Po). The current peer becomes the new best peer (Pb).
        2. If the new value loses, we add the current peer to Po.
    2. If successful with or without a value, the response will contain the closest nodes the peer knows to the Key. Add them to the candidate list Pn, except for those that have already been queried.
    3. If an error or timeout occurs, discard it.
  4. Go to 1.

Entry validation

Implementations should validate DHT entries during retrieval and before storage e.g. by allowing to supply a record Validator when constructing a DHT node. Below is a sample interface of such a Validator:

// Validator is an interface that should be implemented by record
// validators.
type Validator interface {
	// Validate validates the given record, returning an error if it's
	// invalid (e.g., expired, signed by the wrong key, etc.).
	Validate(key string, value []byte) error

	// Select selects the best record from the set of records (e.g., the
	// newest).
	//
	// Decisions made by select should be stable.
	Select(key string, values [][]byte) (int, error)
}

Validate() should be a pure function that reports the validity of a record. It may validate a cryptographic signature, or similar. It is called on two occasions:

  1. To validate values retrieved in a GET_VALUE query.
  2. To validate values received in a PUT_VALUE query before storing them in the local data store.

Similarly, Select() is a pure function that returns the best record out of 2 or more candidates. It may use a sequence number, a timestamp, or other heuristic of the value to make the decision.

Content provider advertisement and discovery

There are two things at play with regard to provider record (and therefore content) liveness and reachability:

Content needs to be reachable, despite peer churn; and nodes that store and serve provider records should not serve records for stale content, i.e., content that the original provider does not wish to make available anymore.

The following two parameters help cover both of these cases.

  1. Provider Record Republish Interval: The content provider needs to make sure that the nodes chosen to store the provider record are still online when clients ask for the record. In order to guarantee this, while taking into account the peer churn, content providers republish the records they want to provide. Choosing the particular value for the Republish interval is network-specific and depends on several parameters, such as peer reliability and churn.

    • For the IPFS network it is currently set to 22 hours.
  2. Provider Record Expiration Interval: The network needs to provide content that content providers are still interested in providing. In other words, nodes should not keep records for content that content providers have stopped providing (aka stale records). In order to guarantee this, provider records should expire after some interval, i.e., nodes should stop serving those records, unless the content provider has republished the provider record. Again, the specific setting depends on the characteristics of the network.

    • In the IPFS DHT the Expiration Interval is set to 48 hours.

The values chosen for those parameters should be subject to continuous monitoring and investigation. Ultimately, the values of those parameters should balance the tradeoff between provider record liveness (due to node churn) and traffic overhead (to republish records). The latest parameters are based on the comprehensive study published in provider-record-measurements.

Provider records are managed through the ADD_PROVIDER and GET_PROVIDERS messages.

It is also worth noting that the keys for provider records are multihashes. This is because:

  • Provider records are used as a rendezvous point for all the parties who have advertised that they store some piece of content.
  • The same multihash can be in different CIDs (e.g. CIDv0 vs CIDv1 of a SHA-256 dag-pb object, or the same multihash but with different codecs such as dag-pb vs raw).
  • Therefore, the rendezvous point should converge on the minimal thing everyone agrees on, which is the multihash, not the CID.

Content provider advertisement

When the local node wants to indicate that it provides the value for a given key, the DHT finds the (k = 20) closest peers to the key using the FIND_NODE RPC (see peer routing section), and then sends an ADD_PROVIDER RPC with its own PeerInfo to each of these peers. The study in provider-record-measurements proved that the replication factor of k = 20 is a good setting, although continuous monitoring and investigation may change this recommendation in the future.

Each peer that receives the ADD_PROVIDER RPC should validate that the received PeerInfo matches the sender's peerID, and if it does, that peer should store the PeerInfo in its datastore. Implementations may choose to not store the addresses of the providing peer e.g. to reduce the amount of required storage or to prevent storing potentially outdated address information. Implementations that choose to keep the network address (i.e., the multiaddress) of the providing peer should do it for a period of time that they are confident the network addresses of peers do not change after the provider record has been (re-)published. As with previous constant values, this is dependent on the network's characteristics. A safe value here is the Routing Table Refresh Interval. In the kubo IPFS implementation, this is set to 30 mins. After that period, peers provide the provider's peerID only, in order to avoid pointing to stale network addresses (i.e., the case where the peer has moved to a new network address).

Content provider discovery

Getting the providers for a given key is done in the same way as getting a value for a given key (see getting values section) except that instead of using the GET_VALUE RPC message the GET_PROVIDERS RPC message is used.

When a node receives a GET_PROVIDERS RPC, it must look up the requested key in its datastore, and respond with any corresponding records in its datastore, plus a list of closer peers in its routing table.

Bootstrap process

The bootstrap process is responsible for keeping the routing table filled and healthy throughout time. The below is one possible algorithm to bootstrap. Implementations may diverge from this base algorithm as long as they adhere to the wire format and keep their routing table up-to-date, especially with peers closest to themselves.

The process runs once on startup, then periodically with a configurable frequency (default: 10 minutes). On every run, we generate a random peer ID for every non-empty routing table's k-bucket and we look it up via the process defined in peer routing. Peers encountered throughout the search are inserted in the routing table, as per usual business.

In addition, to improve awareness of nodes close to oneself, implementations should include a lookup for their own peer ID.

Every repetition is subject to a QueryTimeout (default: 10 seconds), which upon firing, aborts the run.

RPC messages

Remote procedure calls are performed by:

  1. Opening a new stream.
  2. Sending the RPC request message.
  3. Listening for the RPC response message.
  4. Closing the stream.

On any error, the stream is reset.

Implementations may choose to re-use streams by sending one or more RPC request messages on a single outgoing stream before closing it. Implementations must handle additional RPC request messages on an incoming stream.

All RPC messages sent over a stream are prefixed with the message length in bytes, encoded as an unsigned variable length integer as defined by the multiformats unsigned-varint spec.

All RPC messages conform to the following protobuf:

syntax = "proto2";

// Record represents a dht record that contains a value
// for a key value pair
message Record {
    // The key that references this record
    bytes key = 1;

    // The actual value this record is storing
    bytes value = 2;

    // Note: These fields were removed from the Record message
    //
    // Hash of the authors public key
    // optional string author = 3;
    // A PKI signature for the key+value+author
    // optional bytes signature = 4;

    // Time the record was received, set by receiver
    // Formatted according to https://datatracker.ietf.org/doc/html/rfc3339
    string timeReceived = 5;
};

message Message {
    enum MessageType {
        PUT_VALUE = 0;
        GET_VALUE = 1;
        ADD_PROVIDER = 2;
        GET_PROVIDERS = 3;
        FIND_NODE = 4;
        PING = 5;
    }

    enum ConnectionType {
        // sender does not have a connection to peer, and no extra information (default)
        NOT_CONNECTED = 0;

        // sender has a live connection to peer
        CONNECTED = 1;

        // sender recently connected to peer
        CAN_CONNECT = 2;

        // sender recently tried to connect to peer repeatedly but failed to connect
        // ("try" here is loose, but this should signal "made strong effort, failed")
        CANNOT_CONNECT = 3;
    }

    message Peer {
        // ID of a given peer.
        bytes id = 1;

        // multiaddrs for a given peer
        repeated bytes addrs = 2;

        // used to signal the sender's connection capabilities to the peer
        ConnectionType connection = 3;
    }

    // defines what type of message it is.
    MessageType type = 1;

    // defines what coral cluster level this query/response belongs to.
    // in case we want to implement coral's cluster rings in the future.
    int32 clusterLevelRaw = 10; // NOT USED

    // Used to specify the key associated with this message.
    // PUT_VALUE, GET_VALUE, ADD_PROVIDER, GET_PROVIDERS
    bytes key = 2;

    // Used to return a value
    // PUT_VALUE, GET_VALUE
    Record record = 3;

    // Used to return peers closer to a key in a query
    // GET_VALUE, GET_PROVIDERS, FIND_NODE
    repeated Peer closerPeers = 8;

    // Used to return Providers
    // GET_VALUE, ADD_PROVIDER, GET_PROVIDERS
    repeated Peer providerPeers = 9;
}

These are the requirements for each MessageType:

  • FIND_NODE: In the request key must be set to the binary PeerId of the node to be found. In the response closerPeers is set to the k closest Peers.

  • GET_VALUE: In the request key is an unstructured array of bytes. record is set to the value for the given key (if found in the datastore) and closerPeers is set to the k closest peers.

  • PUT_VALUE: In the request record is set to the record to be stored and key on Message is set to equal key of the Record. The target node validates record, and if it is valid, it stores it in the datastore and as a response echoes the request.

  • GET_PROVIDERS: In the request key is set to a CID. The target node returns the closest known providerPeers (if any) and the k closest known closerPeers.

  • ADD_PROVIDER: In the request key is set to a CID. The target node verifies key is a valid CID, all providerPeers that match the RPC sender's PeerID are recorded as providers.

  • PING: Deprecated message type replaced by the dedicated ping protocol. Implementations may still handle incoming PING requests for backwards compatibility. Implementations must not actively send PING requests.

Note: Any time a relevant Peer record is encountered, the associated multiaddrs are stored in the node's peerbook.


References

[0]: Maymounkov, P., & Mazières, D. (2002). Kademlia: A Peer-to-Peer Information System Based on the XOR Metric. In P. Druschel, F. Kaashoek, & A. Rowstron (Eds.), Peer-to-Peer Systems (pp. 53–65). Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/3-540-45748-8_5

[1]: Baumgart, I., & Mies, S. (2014). S / Kademlia : A practicable approach towards secure key-based routing S / Kademlia : A Practicable Approach Towards Secure Key-Based Routing, (June). https://doi.org/10.1109/ICPADS.2007.4447808

[2]: Freedman, M. J., & Mazières, D. (2003). Sloppy Hashing and Self-Organizing Clusters. In IPTPS. Springer Berlin / Heidelberg. Retrieved from https://www.cs.princeton.edu/~mfreed/docs/coral-iptps03.pdf

Multicast DNS (mDNS)

Local peer discovery with zero configuration using multicast DNS.

Lifecycle StageMaturityStatusLatest Revision
1AWorking DraftActiver2, 2021-10-12

Authors: @richardschneider

Interest Group: @yusefnapora, @raulk, @daviddias, @jacobheun

See the lifecycle document for context about the maturity level and spec status.

Table of Contents

Overview

The goal is to allow peers to discover each other when on the same local network with zero configuration. mDNS uses a multicast system of DNS records; this allows all peers on the local network to see all query responses.

Conceptually, it is very simple. When a peer starts (or detects a network change), it sends a query for all peers. As responses come in, the peer adds the other peers' information into its local database of peers.

Definitions

  • service-name is the DNS Service Discovery (DNS-SD) service name for all peers. It is defined as _p2p._udp.local.

  • host-name is the fully qualified name of the peer. It is derived from the peer's name and p2p.local.

  • peer-name is the case-insensitive unique identifier of the peer, and is less than 64 characters.

    As the this field doesn't carry any meaning, it is sufficient to ensure the uniqueness of this identifier. Peers SHOULD generate a random, lower-case alphanumeric string of least 32 characters in length when booting up their node. Peers SHOULD NOT use their Peer ID here because a future Peer ID could exceed the DNS label limit of 63 characters.

If a private network is in use, then the service-name contains the base-16 encoding of the network's fingerprint as in _p2p-X._udp.local. This prevents public and private networks from discovering each other's peers.

Peer Discovery

Request

To find all peers, a DNS message is sent with the question _p2p._udp.local PTR. Peers will then start responding with their details.

Note that a peer must respond to its own query. This allows other peers to passively discover it.

Response

On receipt of a find all peers query, a peer sends a DNS response message (QR = 1) that contains the answer

<service-name> PTR <peer-name>.<service-name>

The additional records of the response contain the peer's discovery details:

<peer-name>.<service-name> TXT "dnsaddr=..."

The TXT record contains the multiaddresses that the peer is listening on. Each multiaddress is a TXT attribute with the form dnsaddr=/.../p2p/QmId. Multiple dnsaddr attributes and/or TXT records are allowed.

DNS Service Discovery

DNS-SD support is not needed for peers to discover each other. However, it is extremely useful for network administrators to discover what is running on the network.

Meta Query

This allows discovery of all services. The question is _services._dns-sd._udp.local PTR.

A peer responds with the answer

    _services._dns-sd._udp.local PTR <service-name>

Find All Response

On receipt of a find all peers query, the following additional records should be included

    <peer-name>.<service-name> SRV ... <host-name>
    <host-name>              A <ipv4 address>
    <host-name>              AAAA <ipv6 address>

Gotchas

Many existing tools ignore the Additional Records, and always send individual queries for the peer's discovery details. To accomodate this, a peer should respond to the following queries:

  • <peer-name>.<service-name> SRV
  • <peer-name>.<service-name> TXT
  • <host-name> A
  • <host-name> AAAA

Issues

[ ] mDNS requires link-local addresses. Loopback and "NAT busting" addresses should not sent and must be ignored on receipt?

References

Meta Query

Goal: find all services on the local network.

Question

_services._dns-sd._udp.local PTR

Answer

_services._dns-sd._udp.local IN PTR _p2p._udp.local

Find All Peers

Goal: find all peers on the local network.

Question

_p2p._udp.local PTR

Answer

_p2p._udp.local IN PTR `<peer-name>`._p2p._udp.local

Additional Records

  • <peer-name>._p2p._udp.local IN TXT dnsaddr=/ip6/2001:DB8::7573:b0a8:46b0:bfea/tcp/4001/p2p/id
  • <peer-name>._p2p._udp.local IN TXT dnsaddr=/ip4/192.0.2.0/tcp/4001/p2p/id

mplex

The spec for the friendly Stream Multiplexer (that works in 3 languages!)

Lifecycle StageMaturityStatusLatest Revision
3ARecommendationActiver0, 2018-10-10

Authors: @daviddias, @Stebalien, @tomaka

Interest Group: @yusefnapora, @richardschneider, @jacobheun

See the lifecycle document for context about the maturity level and spec status.

Table of Contents

Overview

Mplex is a Stream Multiplexer protocol used by js-ipfs and go-ipfs in their implementations. The origins of this protocol are based in multiplex, the JavaScript-only Stream Multiplexer. After many battle field tests, we felt the need to improve and fix some of its bugs and mechanics, resulting on this new version used by libp2p.

This document will attempt to define a specification for the wire protocol and algorithm used in both implementations.

Mplex is a very simple protocol that does not provide many features offered by other stream multiplexers. Notably, mplex does not provide backpressure at the protocol level.

Implementations in:

Message format

Every communication in mplex consists of a header, and a length prefixed data segment.

The header is an unsigned base128 varint. The lower three bits are the message flags, and the rest of the bits (shifted down by three bits) are the stream ID this message pertains to:

header = readUvarint()
flag = header & 0x07
id = header >> 3

The maximum header length is 9 bytes (per the unsigned-varint spec). With 9 continuation bits and 3 message flag bits the maximum stream ID is 60 bits (maximum value of 2^60 - 1).

Flag Values

| NewStream        | 0 |
| MessageReceiver  | 1 |
| MessageInitiator | 2 |
| CloseReceiver    | 3 |
| CloseInitiator   | 4 |
| ResetReceiver    | 5 |
| ResetInitiator   | 6 |

The data segment is length prefixed by another unsigned varint. This results in one message looking like:

| header  | length  | data           |
| uvarint | uvarint | 'length' bytes |

Protocol

Mplex operates over a reliable ordered pipe between two peers, such as a TCP socket, or a unix pipe.

Opening a new stream

To open a new stream, first allocate a new stream ID. Then, send a message with the flag set to NewStream, the ID set to the newly allocated stream ID, and the data of the message set to the name of the stream.

Stream names are purely for debugging purposes and are not otherwise considered by the protocol. An empty string may also be used for the stream name, and they may also be repeated (using the same stream name for every stream is valid). Reusing a stream ID after closing a stream may result in undefined behaviour.

The party that opens a stream is called the stream initiator. Both parties can open a substream with the same ID, therefore this distinction is used to identify whether each message concerns the channel opened locally or remotely.

Writing to a stream

To write data to a stream, one must send a message with the flag MessageReceiver (1) or MessageInitiator (2) (depending on whether or not the writer is the one initiating the stream). The data field should contain the data you wish to write to the stream, up to 1MiB per message.

Closing a stream

Mplex supports half-closed streams. Closing a stream closes it for writing and closes the remote end for reading but allows writing in the other direction.

To close a stream, send a message with a zero length body and a CloseReceiver (3) or CloseInitiator (4) flag (depending on whether or not the closer is the one initiaing the stream). Writing to a stream after it has been closed is a protocol violation. Reading from a remote-closed stream should return all data sent before closing the stream and then EOF thereafter.

Resetting a stream

To immediately close a stream for both reading and writing, use reset. This should generally only be used on error; during normal operation, both sides should close instead.

To reset a stream, send a message with a zero length body and a ResetReceiver (5) or ResetInitiator (6) flag. Reset must immediately close both ends of the stream for both reading and writing. Writing to a stream after it has been reset is a protocol violation. Since reset is generally sent when an error happens, all future reads from a reset stream should return an error (not EOF).

Implementation notes

If a stream is being actively written to, the reader must take care to keep up with inbound data. Due to the lack of back pressure at the protocol level, the implementation must handle slow readers by doing one or both of:

  1. Blocking the entire connection until the offending stream is read.
  2. Resetting the offending stream.

For example, the go-mplex implementation blocks for a short period of time and then resets the stream if necessary.

yamux

Lifecycle StageMaturityStatusLatest Revision
3ARecommendationActiver0, 2023-02-17

Authors: @thomaseizinger

Interest Group: @marten-seemann, @wemeetagain, @ianopolous

See the lifecycle document for context about maturity level and spec status.

Overview

Yamux is Stream Multiplexer protocol originally specified by @hashicorp. The specification lives here: https://github.com/hashicorp/yamux/blob/master/spec.md

The below sections are a verbatim copy (modulo formatting changes) of the spec at the time this document was created. This allows us to preserve the specification in case the linked document is ever removed or edited.

Specification

The protocol string of yamux for multistream-select is: /yamux/1.0.0.

Framing

Yamux uses a streaming connection underneath, but imposes a message framing so that it can be shared between many logical streams. Each frame contains a header like:

  • Version (8 bits)
  • Type (8 bits)
  • Flags (16 bits)
  • StreamID (32 bits)
  • Length (32 bits)

This means that each header has a 12 byte overhead. All fields are encoded in network order (big endian). Each field is described below:

Version Field

The version field is used for future backward compatibility. At the current time, the field is always set to 0, to indicate the initial version.

Type Field

The type field is used to switch the frame message type. The following message types are supported:

  • 0x0 Data - Used to transmit data. May transmit zero length payloads depending on the flags.

  • 0x1 Window Update - Used to updated the senders receive window size. This is used to implement per-stream flow control.

  • 0x2 Ping - Used to measure RTT. It can also be used to heart-beat and do keep-alives over TCP.

  • 0x3 Go Away - Used to close a session.

Flag Field

The flags field is used to provide additional information related to the message type. The following flags are supported:

  • 0x1 SYN - Signals the start of a new stream. May be sent with a data or window update message. Also sent with a ping to indicate outbound.

  • 0x2 ACK - Acknowledges the start of a new stream. May be sent with a data or window update message. Also sent with a ping to indicate response.

  • 0x4 FIN - Performs a half-close of a stream. May be sent with a data message or window update.

  • 0x8 RST - Reset a stream immediately. May be sent with a data or window update message.

StreamID Field

The StreamID field is used to identify the logical stream the frame is addressing. The client side should use odd ID's, and the server even. This prevents any collisions. Additionally, the 0 ID is reserved to represent the session.

Both Ping and Go Away messages should always use the 0 StreamID.

Length Field

The meaning of the length field depends on the message type:

  • Data - provides the length of bytes following the header
  • Window update - provides a delta update to the window size
  • Ping - Contains an opaque value, echoed back
  • Go Away - Contains an error code

Message Flow

There is no explicit connection setup, as Yamux relies on an underlying transport to be provided. However, there is a distinction between client and server side of the connection.

Opening a stream

To open a stream, an initial data or window update frame is sent with a new StreamID. The SYN flag should be set to signal a new stream.

The receiver must then reply with either a data or window update frame with the StreamID along with the ACK flag to accept the stream or with the RST flag to reject the stream.

Because we are relying on the reliable stream underneath, a connection can begin sending data once the SYN flag is sent. The corresponding ACK does not need to be received. This is particularly well suited for an RPC system where a client wants to open a stream and immediately fire a request without waiting for the RTT of the ACK.

This does introduce the possibility of a connection being rejected after data has been sent already. This is a slight semantic difference from TCP, where the connection cannot be refused after it is opened. Clients should be prepared to handle this by checking for an error that indicates a RST was received.

Closing a stream

To close a stream, either side sends a data or window update frame along with the FIN flag. This does a half-close indicating the sender will send no further data.

Once both sides have closed the connection, the stream is closed.

Alternatively, if an error occurs, the RST flag can be used to hard close a stream immediately.

Flow Control

When Yamux is initially starts each stream with a 256KiB window size. There is no window size for the session.

To prevent the streams from stalling, window update frames should be sent regularly. Yamux can be configured to provide a larger limit for windows sizes. Both sides assume the initial 256KB window, but can immediately send a window update as part of the SYN/ACK indicating a larger window.

Both sides should track the number of bytes sent in Data frames only, as only they are tracked as part of the window size.

Session termination

When a session is being terminated, the Go Away message should be sent. The Length should be set to one of the following to provide an error code:

  • 0x0 Normal termination
  • 0x1 Protocol error
  • 0x2 Internal error

Implementation considerations

ACK backlog & backpressure

Yamux allows for a stream to be opened (and used) before it is acknowledged by the remote. Yamux also does not specify a backpressure mechanism for opening new streams.

This presents a problem: A peer must read from the socket and decode the frames to make progress on existing streams. But any frame could also open yet another stream.

The ACK backlog is defined as the number of streams that a peer has opened which have not yet been acknowledged. To support a basic form of backpressure, implementions:

  • SHOULD at most allow an ACK backlog of 256 streams.
  • MAY buffer unacknowledged inbound streams instead of resetting them when the application currently cannot handle any more streams. Such a buffer MUST be bounded in size to mitigate DoS attacks.
  • MAY delay acknowledging new streams until the application has received or is about to send the first DATA frame.

noise-libp2p - Secure Channel Handshake

A libp2p transport secure channel handshake built with the Noise Protocol Framework.

Lifecycle StageMaturityStatusLatest Revision
3ARecommendationActiver5, 2022-12-07

Authors: @yusefnapora

Interest Group: @raulk, @tomaka, @romanb, @shahankhatch, @Mikerah, @djrtwo, @dryajov, @mpetrunic, @AgeManning, @morrigan, @araskachoi, @mhchia

See the lifecycle document for context about the maturity level and spec status.

Table of Contents

Overview

The Noise Protocol Framework is a framework for building security protocols by composing a small set of cryptographic primitives into patterns with verifiable security properties.

This document specifies noise-libp2p, a libp2p channel security handshake built using the Noise Protocol Framework. As a framework for building protocols rather than a protocol itself, Noise presents a large decision space with many tradeoffs. The Design Considerations section goes into detail about the choices made when designing the protocol.

Secure channels in libp2p are established with the help of a transport upgrader, a component that layers security and stream multiplexing over "raw" connections like TCP sockets. When peers connect, the upgrader uses a protocol called multistream-select to negotiate which security and multiplexing protocols to use. The upgrade process is described in the connection establishment spec.

The transport upgrade process is likely to evolve soon, as we are in the process of designing multiselect 2, a successor to multistream-select. Some noise-libp2p features are designed to enable proposed features of multiselect 2, however noise-libp2p is fully compatible with the current upgrade process and multistream-select. See the Negotiation section for details about protocol negotiation.

Every Noise connection begins with a handshake between an initiating peer and a responding peer, or in libp2p terms, a dialer and a listener. Over the course of the handshake, peers exchange public keys and perform Diffie-Hellman exchanges to arrive at a pair of symmetric keys that can be used to efficiently encrypt traffic. The Noise Handshake section describes the handshake pattern and how libp2p-specific data is exchanged during the handshake.

During the handshake, the static DH key used for Noise is authenticated using the libp2p identity keypair, as described in the Static Key Authentication section.

Following a successful handshake, peers use the resulting encryption keys to send ciphertexts back and forth. The format for transport messages and the wire protocol used to exchange them is described in the Wire Format section. The cryptographic primitives used to secure the channel are described in the Cryptographic Primitives section.

Negotiation

libp2p has an existing protocol negotiation mechanism which is used to reach agreement on the secure channel and multiplexing protocols used for new connections. A description of the current protocol negotiation flow is available in the libp2p connections spec.

noise-libp2p is identified by the protocol ID string /noise. Peers using multistream-select for protocol negotiation may send this protocol ID during connection establishment to attempt to use noise-libp2p.

Future versions of this spec may define new protocol IDs using the /noise prefix, for example /noise/2.

The Noise Handshake

During the Noise handshake, peers perform an authenticated key exchange according to the rules defined by a concrete Noise protocol. A concrete Noise protocol is identified by the choice of handshake pattern and cryptographic primitives used to construct it.

This section covers the method of authenticating the Noise static key, the libp2p-specific data that is exchanged in handshake message payloads, and the supported handshake pattern.

Static Key Authentication

The Security Considerations section of the Noise spec says:

* Authentication: A Noise protocol with static public keys verifies that the
corresponding private keys are possessed by the participant(s), but it's up to
the application to determine whether the remote party's static public key is
acceptable. Methods for doing so include certificates which sign the public key
(and which may be passed in handshake payloads), preconfigured lists of public
keys, or "pinning" / "key-continuity" approaches where parties remember public
keys they encounter and check whether the same party presents the same public
key in the future.

All libp2p peers possess a cryptographic keypair which is used to derive their peer id, which we will refer to as their "identity keypair." To avoid potential static key reuse, and to allow libp2p peers with any type of identity keypair to use Noise, noise-libp2p uses a separate static keypair for Noise that is distinct from the peer's identity keypair.

A given libp2p peer will have one or more static Noise keypairs throughout its lifetime. Because the static key is authenticated using the libp2p identity key, it is not necessary for the key to actually be "static" in the traditional sense, and implementations MAY generate a new static Noise keypair for each new session. Alternatively, a single static keypair may be generated when noise-libp2p is initialized and used for all sessions. Implementations SHOULD NOT store the static Noise key to disk, as there is no benefit and a hightened risk of exposure.

To authenticate the static Noise key used in a handshake, noise-libp2p includes a signature of the static Noise public key in a handshake payload. This signature is produced with the private libp2p identity key, which proves that the sender was in possession of the private identity key at the time the payload was generated.

libp2p Data in Handshake Messages

In addition to authenticating the static Noise key, noise-libp2p implementations MAY send additional "early data" in the handshake message payload. The contents of this early data are opaque to noise-libp2p, however it is assumed that it will be used to advertise supported stream multiplexers, thus avoiding a round-trip negotiation after the handshake completes.

The use of early data MUST be restricted to internal libp2p APIs, and the early data payload MUST NOT be used to transmit user or application data. Some handshake messages containing the early data payload may be susceptible to replay attacks, therefore the processing of early data must be idempotent. The noise-libp2p implementation itself MUST NOT process the early data payload in any way during the handshake, except to produce and validate the signature as described below.

Early data provided by a remote peer should only be made available to other libp2p components after the handshake is complete and the payload signature has been validated. If the handshake fails for any reason, the early data payload MUST be discarded immediately.

Any early data provided to noise-libp2p MUST be included in the handshake payload as a byte string without alteration by the noise-libp2p implementation.

The libp2p Handshake Payload

The Noise Protocol Framework caters for sending early data alongside handshake messages. We leverage this construct to transmit:

  1. the libp2p identity key along with a signature, to authenticate each party to the other.
  2. extensions used by the libp2p stack.

The extensions are inserted into the first message of the handshake pattern that guarantees secrecy. Specifically, this means that the initiator MUST NOT send extensions in their first message. The initiator sends its extensions in message 3 (closing message), and the responder sends theirs in message 2 (their only message). It should be stressed, that while the second message of the handshake pattern has forward secrecy, the sender has not authenticated the responder yet, so this payload might be sent to any party, including an active attacker.

When decrypted, the payload contains a serialized protobuf NoiseHandshakePayload message with the following schema:

syntax = "proto2";

message NoiseExtensions {
    repeated bytes webtransport_certhashes = 1;
    repeated string stream_muxers = 2;
}

message NoiseHandshakePayload {
  optional bytes identity_key = 1;
  optional bytes identity_sig = 2;
  optional NoiseExtensions extensions = 4;
}

The identity_key field contains a serialized PublicKey message as defined in the peer id spec.

The identity_sig field is produced using the libp2p identity private key according to the signing rules in the peer id spec. The data to be signed is the UTF-8 string noise-libp2p-static-key:, followed by the Noise static public key, encoded according to the rules defined in section 5 of RFC 7748.

The extensions field contains Noise extensions and is described in Noise Extensions.

Upon receiving the handshake payload, peers MUST decode the public key from the identity_key field into a usable form. The key MUST then be used to validate the identity_sig field against the static Noise key received in the handshake. If the signature is invalid, the connection MUST be terminated immediately.

Handshake Pattern

Noise defines twelve fundamental interactive handshake patterns for exchanging public keys between parties and performing Diffie-Hellman computations. The patterns are named according to whether static keypairs are used, and if so, by what means each party gains knowledge of the other's static public key.

noise-libp2p supports the XX handshake pattern, which provides mutual authentication and encryption of static keys and handshake payloads and is resistant to replay attacks.

Prior revisions of this spec included a compound protocol involving the IK and XXfallback patterns, but this was removed due to the benefits not justifying the considerable additional complexity.

XX

XX:
  -> e
  <- e, ee, s, es
  -> s, se

In the XX handshake pattern, both parties send their static Noise public keys to the other party.

The first handshake message contains the initiator's ephemeral public key, which allows subsequent key exchanges and message payloads to be encrypted.

The second and third handshake messages include a handshake payload, which contains a signature authenticating the sender's static Noise key as described in the Static Key Authentication section and may include other internal libp2p data.

The XX handshake MUST be supported by noise-libp2p implementations.

Noise Extensions

Since the Noise handshake pattern itself doesn't define any extensibility mechanism, this specification defines an extension registry, modeled after RFC 6066 (for TLS) and RFC 9000 (for QUIC).

Note that this document only defines the NoiseExtensions code points, and leaves it up to the protocol using that code point to define semantics associated with these code point.

Code points above 1024 MAY be used for experimentation. Code points up to this value MUST be registered in this document before deployment.

Cryptographic Primitives

The Noise framework allows protocol designers to choose from a small set of Diffie-Hellman key exchange functions, symmetric ciphers, and hash functions.

For simplicity, and to avoid the need to explicitly negotiate Noise protocols, noise-libp2p defines a single "cipher suite".

noise-libp2p implementations MUST support the 25519 DH functions, ChaChaPoly cipher functions, and SHA256 hash function as defined in the Noise spec.

Noise Protocol Name

A Noise HandshakeState is initialized with the hash of a Noise protocol name, which defines the handshake pattern and cipher suite used. Because noise-libp2p supports a single cipher suite and handshake pattern, the Noise protocol name MUST be: Noise_XX_25519_ChaChaPoly_SHA256.

Wire Format

noise-libp2p defines a simple message framing format for sending data back and forth over the underlying transport connection.

All data is segmented into messages with the following structure:

noise_message_lennoise_message
2 bytesvariable length

The noise_message_len field stores the length in bytes of the noise_message field, encoded as a 16-bit big-endian unsigned integer.

The noise_message field contains a Noise Message as defined in the Noise spec, which has a maximum length of 65535 bytes.

During the handshake phase, noise_message will be a Noise handshake message. Noise handshake messages may contain encrypted payloads. If so, they will have the structure described in the Encrypted Payloads section.

After the handshake completes, noise_message will be a Noise transport message, which is defined as an AEAD ciphertext consisting of an encrypted payload plus 16 bytes of authentication data.

Encryption and I/O

During the handshake phase, the initiator (Alice) will initialize a Noise HandshakeState object with the Noise protocol name Noise_XX_25519_ChaChaPoly_SHA256.

Alice and Bob exchange handshake messages, during which they authenticate each other's static Noise keys. Handshake messages are framed as described in the Wire Format section, and if a handshake message contains a payload, it will have the structure described in Encrypted Payloads.

Following a successful handshake, each peer will possess two Noise CipherState objects. One is used to encrypt outgoing data to the remote party, and the other is used to decrypt incoming data.

After the handshake, peers continue to exchange messages in the format described in the Wire Format section. However, instead of containing a Noise handshake message, the contents of the noise_message field will be Noise transport message, which is an AEAD ciphertext consisting of an encrypted payload plus 16 bytes of authentication data, as defined in the Noise spec.

In the unlikely event that peers exchange more than 2^64 - 1 messages, they MUST terminate the connection to avoid reusing nonces, in accordance with the Noise spec.

Design Considerations

No Negotiation of Noise Protocols

Supporting a single cipher suite allows us to avoid negotiating which concrete Noise protocol to use for a given connection. This removes a huge source of incidental complexity and makes implementations much simpler. Changes to the cipher suite will require a new version of noise-libp2p, but this should happen infrequently enough to be a non-issue.

Users who require cipher agility are encouraged to adopt TLS 1.3, which supports negotiation of cipher suites.

Why the XX handshake pattern?

An earlier draft of this spec included a compound protocol called Noise Pipes that uses the IK and XXfallback handshake patterns to enable a slightly more efficient handshake when the remote peer's static Noise key is known a priori. During development of the Go and JavaScript implementations, this was determined to add too much complexity to be worth the benfit, and the benefit turned out to be less than originally hoped. See the discussion on github for more context.

Why ChaChaPoly?

We debated supporting AESGCM in addition to or instead of ChaChaPoly. The desire for a simple protocol without explicit negotiation of ciphers and handshake patterns led us to support a single cipher, so the question became which to support.

While AES has broad hardware support that can lead to significant performance improvements on some platforms, secure and performant software implementations are hard to come by. To avoid excluding runtime platforms without hardware AES support, we chose the ChaChaPoly cipher, which is possible to implement in software on all platforms.

Distinct Noise and Identity Keys

Using a separate keypair for Noise adds complexity to the protocol by requiring signature validation and transmission of libp2p public keys during the handshake.

However, none of the key types supported by libp2p for use as identity keys are fully compatible with Noise. While it is possible to convert an ed25519 key into the X25519 format used with Noise, it is not possible to do the reverse. This makes it difficult to use any libp2p identity key directly as the Noise static key.

Also, Noise recommends only using Noise static keys with other Noise protocols using the same hash function. Since we can't guarantee that users won't also use their libp2p identity keys in other contexts (e.g. SECIO handshakes, signing pubsub messages, etc), requiring separate keys seems prudent.

Why Not Noise Signatures?

Since we're using signatures for authentication, the Noise Signatures extension is a natural candidate for adoption.

Unfortunately, the Noise Signatures spec requires both parties to use the same signature algorithm, which would prevent peers with different identity key types to complete a Noise Signatures handshake. Also, only Ed25519 signatures are currently supported by the spec, while libp2p identity keys may be of other unsupported types like RSA.

Changelog

r1 - 2020-01-20

  • Renamed protobuf fields
  • Edited for clarity

r2 - 2020-03-30

  • Removed Noise Pipes and related handshake patterns
  • Removed padding within encrypted payloads

r3 - 2022-09-20

  • Change Protobuf definition to proto2 (due to the layout of the protobuf used, this is backwards-compatible change)

r4 - 2022-09-22

  • Add Noise extension registry

Plaintext Secure Channel

An insecure connection handshake for non-production environments.

⚠️ Intended only for debugging and interoperability testing purposes. ⚠️

Lifecycle StageMaturityStatusLatest Revision
1AWorking DraftActiver0, 2019-05-27

Authors: @yusefnapora

Interest Group: @raulk, @Warchant, @Stebalien, @mhchia

See the lifecycle document for context about the maturity level and spec status.

Overview

Secure communications are a key feature of libp2p, and encrypted transport is configured by default in libp2p implementations to encourage security for all production traffic. However, there are some use cases such as testing in which encryption is unnecessary. For such cases, the plaintext "security" protocol can be used. By conforming to the same interface as real security adapters like SECIO and TLS, the plaintext module can be used as a drop-in replacement when encryption is not needed.

As the name suggests, the plaintext security module does no encryption, and all data is transmitted in plain text. However, peer identity in libp2p is derived from public keys, even when peers are communicating over an insecure channel. For this reason, peers using the plaintext protocol still exchange public keys and peer ids when connecting to each other.

It bears repeating that the plaintext protocol was designed for development and testing ONLY, and MUST NOT be used in production environments. No encryption or authentication of any kind is provided. Also note that enabling the plaintext module will effectively nullify the security guarantees of any other security modules that may be enabled, as an attacker will be able to negotiate a plaintext connection at any time.

This document describes the exchange of peer ids and keys that occurs when initiating a plaintext connection. This exchange happens after the plaintext protocol has been negotiated as part of the connection upgrade process.

Protocol Id and Version History

The plaintext protocol described in this document has the protocol id of /plaintext/2.0.0.

An earlier version, /plaintext/1.0.0, was implemented in several languages, but it did not include any exchange of public keys or peer ids. This led to undefined behavior in parts of libp2p that assumed the presence of a peer id.

As version 1.0.0 had no associated wire protocol, it was never specified.

Messages

Peers exchange their peer id and public key encoded in a protobuf message using the protobuf version 2 syntax.

syntax = "proto2";

message Exchange {
  optional bytes id = 1;
  optional PublicKey pubkey = 2;
}

The id field contains the peer's id encoded as a multihash, using the binary multihash encoding.

The PublicKey message uses the same definition specified in the peer id spec. For reference, it is defined as follows:

syntax = "proto2";

enum KeyType {
	RSA = 0;
	Ed25519 = 1;
	Secp256k1 = 2;
	ECDSA = 3;
}

message PublicKey {
	required KeyType Type = 1;
	required bytes Data = 2;
}

The encoding of the Data field in the PublicKey message is specified in the key encoding section of the peer id spec.

Protocol

Prerequisites

Prior to undertaking the exchange described below, it is assumed that we have already established a dedicated bidirectional channel between both parties, and that they have negotiated the plaintext protocol id as described in the protocol negotiation section of the connection establishment spec.

Message Framing

All handshake messages sent over the wire are prefixed with the message length in bytes, encoded as an unsigned variable length integer as defined by the multiformats unsigned-varint spec. Actual payloads exchanged once the plaintext handshake has completed are NOT prefixed with their lengths, but sent as-is.

Exchange

Once the plaintext protocol has been negotiated, both peers immediately send an Exchange message containing their peer id and public key.

Upon receiving an Exchange message from the remote peer, each side will validate that the given peer id is consistent with the given public key by deriving a peer id from the key and asserting that it's a match with the id field in the Exchange message.

Dialing a peer in libp2p requires knowledge of the listening peer's peer id. As a result, the dialing peer ALSO verifies that the peer id presented by the listening peer matches the peer id that they attempted to dial. As the listening peer has no prior knowledge of the dialer's id, only one peer is able to perform this additional check.

Once each side has received the Exchange message, they may store the public key and peer id for the remote peer in their local peer metadata storage (e.g. go-libp2p's peerstore, or js-libp2p's peer-book).

Following delivery and verification of Exchange messages, the plaintext protocol is complete. Should a verification or timeout error occur, the connection MUST be terminated abruptly.

The connection is now ready for insecure and unauthenticated data exchange. While we do exchange public keys upfront, replay attacks and forgery are trivial, and we perform no authentication of messages. Therefore, we reiterate the unsuitability of /plaintext/2.0.0 for production usage.

Pre-shared Key Based Private Networks in libp2p

Lifecycle StageMaturityStatusLatest Revision
3ARecommendationActiver0, 2017-02-17

Authors: @Kubuxu

Interest Group: @yusefnapora, @jacobheun, @lgierth, @daviddias

See the lifecycle document for context about the maturity level and spec status.


This document describes the first version of private networks (PN) featured in libp2p.

For the first implementation, only pre-shared key (PSK) functionality is available, as the Public Key Infrastructure approach is much more complex and requires more technical preparation.

It was implemented as an additional encryption layer before any IPFS traffic, and was designed to leak the absolute minimum of information on its own. All traffic leaving the node inside a PN is encrypted and there is no characteristic handshake.

Interface

An IPFS node or libp2p swarm is either in a a public network, or it is a member of a private network.

A private network is defined by the 256-bit secret key, which has to be known and used by all members inside the network.

In the case of an IPFS node, this key is stored inside the IPFS repo in a file named swarm.key. The file uses a path-based multicodec where, for now, the codec that is defined and used is /key/swarm/psk/1.0.0/. The codec expects the next path-based multicodec to define the base encoding for the rest of the file (/bin/, /base16/, /base64/), which is the 256-bit PSK. The key has to be exactly 256-bits (32 bytes) long.

Security Guarantees

Nodes of different private networks must not be able to connect to each other. This extends to node in private network connecting to node in public network. This means that no information exchange, apart from the handshakes required for private network authentication, should take place.

These guarnetee is only provided when knowledge of private key is limited to trusted party.

Safeguard

In the libp2p swarm there is a safeguard implemented that prevents it from dialing with no PSK set, which would mean the node would connect with the rest of the public network.

It can be enabled by setting LIBP2P_FORCE_PNET=1 in the environment before starting IPFS or any other libp2p based application. In the event that the node is trying to connect with no PSK, thus connecting to the public network, an error will be raised and the process will be aborted.

Cryptography of Private Networks

The cryptography behind PNs was chosen to have a minimal resource overhead but to maintain security guarantees that connections should be established with and only with nodes sharing the same PSK. We have decided to encrypt all traffic, thus reducing the possible attack surface of protocols that are part of IPFS/libp2p.

It is important to mention that traffic in a private network is encrypted twice, once with PSK and once with the regular cryptographic stack for libp2p (secio or in the future TLS1.3). This allows the PSK layer to provide only above security guarantee, and for example not worrying about authenticity of the data. Possible replay attacks will be caught by the regular cryptographic layer above PNs layer.

Choosing stream ciphers

We considered three stream ciphers: AES-CTR, Salsa20 and ChaCha. Salsa20 and ChaCha are very similar ciphers, as ChaCha is fully based on Salsa20. And unfortunately, due of ChaCha's lack of adoption, we were not able to find vetted implementations in relevant programming languages. Because of this, the final consideration was between AES-CTR and Salsa20.

There are three main reasons why we decided for Salsa20 over AES-CTR:

  1. We plan on using the same PSK among many nodes. This means that we need to randomize the nonce. For security the nonce collision should be a very unlikely event (frequently used value: 2-32). The Salsa20 family provides the XSalsa20 [1] stream cipher with a nonce of 192-bits. In comparison the usual mode of operation for AES-CTR usually operates with a 96-bit nonce. Which gives only possible different 1.7e24 nonces , and only 6.0e9 nonces form a birthday problem set with collision probablity higher than 2-32. In case of XSalsa20 to reach the same collision probability over 1e24 nonces have to be generated.
  2. The stream counter for the Salsa20 family is 64-bit long, and in composition with a 64 byte block size gives a total stream length of 270 bytes. This is more than will ever be transmitted through any connection (1ZiB). The AES-CTR (in its usual configuration of 96-bit nonce, 32-bit counter) with a block size of 16 bytes results in a stream length of 236, which is only 64 GiB. It means that re-keying (re-nonceing in our case) would be necessary. As the nonce space is already much smaller for AES, re-nonceing would further increase nonce collision risk.
  3. The speed was the last factor which was very important. The encryption layer is an added additional overhead. From our benchmarks, Salsa20 performs two times better on recent Intel 6th Generation processors and on ARM based processors (800MB/s vs 400MB/s and 13.5MB/s vs 7MB/s).

Algorithm

The algorithm is very simple. New nonce is created, it is corss-shared with the other party and XSalsa20 stream is initalized. After 24 bytes of random data (nonce), all traffic is encrypted using XSalsa20. If nodes are not using same PSK the traffic from decryption will be still scrambled which will prevent any data exchange from higher layers.

On Writing side:

// (⊕ denotes bytewise xor operation)
SS = <shared secret>
N = randomNonce(24) // 24 byte nonce
write(out, N)       // send nonce
S20 = newXSalsa20Stream(SS, N)
for data = <data to send> {
  write(out, (data ⊕ S20))
}

On reading side

// (⊕ denotes bytewise xor operation)
SS = <shared secret>
N = byte[24]        // 24 byte nonce
read(in, N)         // read nonce
S20 = newXSalsa20Stream(SS, N)
for data = read(in) {
  process(data ⊕ S20)
}

Where for each connection pair or reading and writing modules is created.

PubSub interface for libp2p

Generalized publish/subscribe interface for libp2p.

Lifecycle StageMaturityStatusLatest Revision
3ARecommendationActiver3, 2020-09-25

Authors: @whyrusleeping, @protolambda, @raulk, @vyzo.

Interest Group: @yusefnapora, @raulk, @vyzo, @Stebalien, @jamesray1, @vasco-santos

See the lifecycle document for context about the maturity level and spec status.

Table of Contents

Overview

This is the specification for generalized pubsub over libp2p. Pubsub in libp2p is currently still experimental and this specification is subject to change. This document does not go over specific implementation of pubsub routing algorithms, it merely describes the common wire format that implementations will use.

libp2p pubsub currently uses reliable ordered streams between peers. It assumes that each peer is certain of the identity of each peer it is communicating with. It does not assume that messages between peers are encrypted, however encryption defaults to being enabled on libp2p streams.

You can find information about the PubSub research and notes in the following repos:

  • https://github.com/libp2p/research-pubsub
  • https://github.com/libp2p/pubsub-notes

Implementations

Stream management

Data should be exchanged between peers using two separately negotiated streams, one inbound, one outbound. These streams are treated as unidirectional streams. The outbound stream is used only to write data. The inbound stream is used only to read data.

The RPC

All communication between peers happens in the form of exchanging protobuf RPC messages between participating peers.

The RPC protobuf is as follows:

syntax = "proto2";
message RPC {
	repeated SubOpts subscriptions = 1;
	repeated Message publish = 2;

	message SubOpts {
		optional bool subscribe = 1;
		optional string topicid = 2;
	}
}

This is a relatively simple message containing zero or more subscription messages, and zero or more content messages. The subscription messages contain a topicid string that specifies the topic, and a boolean signifying whether to subscribe or unsubscribe to the given topic. True signifies 'subscribe' and false signifies 'unsubscribe'.

The Message

The RPC message can contain zero or more messages of type 'Message'. The Message protobuf looks like this:

syntax = "proto2";
message Message {
	optional string from = 1;
	optional bytes data = 2;
	optional bytes seqno = 3;
        required string topic = 4;
	optional bytes signature = 5;
	optional bytes key = 6;
}

The optional fields may be omitted, depending on the signature policy and message ID function.

The from field (optional) denotes the author of the message. This is the peer who initially authored the message, and NOT the peer who propagated it. Thus, as the message is routed through a swarm of pubsubbing peers, the original authorship is preserved.

The seqno field (optional) is a 64-bit big-endian uint that is a linearly increasing number that is unique among messages originating from each given peer. No two messages on a pubsub topic from the same peer should have the same seqno value, however messages from different peers may have the same sequence number. In other words, this number is not globally unique. It is used in conjunction with from to derive a unique message_id (in the default configuration).

Henceforth, we define the term origin-stamped messaging to refer to messages whose from and seqno fields are populated.

The data (optional) field is an opaque blob of data representing the payload. It can contain any data that the publisher wants it to.

The topic field specifies a topic that this message is being published to.

The signature and key fields (optional) are used for message signing, if such feature is enabled, as explained below.

The size of the Message should be limited, say to 1 MiB, but could also be configurable, for more information see issue 118, while messages should be rejected if they are over this size. Note that for applications where state such as messages is stored, such as blockchains, it is suggested to have some kind of storage economics (see e.g. here, here and here).

Message Identification

Pubsub requires to uniquely identify messages via a message ID. This enables a wide range of processes like de-duplication, tracking, scoring, circuit-breaking, and others.

The message_id is calculated from the Message struct.

By default, origin-stamping is in force. This strategy relies on the string concatenation of the from and seqno fields, to uniquely identify a message based on the author.

Alternatively, a user-defined message_id_fn may be supplied, where message_id_fn(Message) => message_id. Such a function could compute the hash of the data field within the Message, and thus one could reify content-addressed messaging.

If fabricated collisions are not a concern, or difficult enough within the window the message is relevant in, a message_id based on a short digest of inputs may benefit performance.

[[ Margin note ]]: There's a potential caveat with using hashes instead of seqnos: the peer won't be able to send identical messages (e.g. keepalives) within the timecache interval, as they will get treated as duplicates. This consequence may or may not be relevant to the application at hand. Reference: #116.

Note that the availability of these fields on the Message object will depend on the signature policy configured for the topic.

Whichever the choice, it is crucial that all peers participating in a topic implement identical message ID calculation logic, or the topic will malfunction.

Message Signing

Signature behavior is configured in two axes: signature creation, and signature verification.

Signature creation. There are two configurations possible:

  • Sign: when publishing a message, perform origin-stamping and produce a signature.
  • NoSign: when publishing a message, do not perform origin-stamping and do not produce a signature.

For signing purposes, the signature and key fields are used:

  • The signature field contains the signature.
  • The key field contains the signing key when it cannot be inlined in the source peer ID (from). When present, it must match the peer ID.

The signature is computed over the marshalled message protobuf excluding the signature field itself.

This includes any fields that are not recognized, but still included in the marshalled data.

The protobuf blob is prefixed by the string libp2p-pubsub: before signing.

[[ Margin note: ]] Protobuf serialization is non-deterministic/canonical, and the same data structure may result in different, valid serialised bytes across implementations, as well as other issues. In the near future, the signature creation and verification algorithm will be made deterministic.

Signature verification. There are two configurations possible:

  • Strict: either expect or not expect a signature.
  • Lax (legacy, insecure, underterministic, to be deprecated): accept a signed message if the signature verification passes, or if it's unsigned.

When signature validation fails for a signed message, the implementation must drop the message and omit propagation. Locally, it may treat this event in whichever manner it wishes (e.g. logging, penalization, etc.).

Signature Policy Options

The usage of the signature, key, from, and seqno fields in Message is configurable per topic.

[[ Implementation note ]]: At the time of writing this section, go-libp2p-pubsub (reference implementation of this spec) allows for configuring the signature policy at the global pubsub instance level. This needs to be pushed down to topic-level configuration. Other implementations should support topic-level configuration, as this spec mandates.

The intersection of signing behaviours across the two axes (signature creation and signature verification) gives way to four signature policy options:

  • StrictSign, StrictNoSign. Deterministic, usage encouraged.
  • LaxSign, LaxNoSign. Non-deterministic, legacy, usage discouraged. Mostly for backwards compatibility. Will be deprecated. If the implementation decides to support these, their use should be discouraged through deprecation warnings.

StrictSign option

On the producing side:

  • Build messages with the signature, key (from may be enough for certain inlineable public key types), from and seqno fields.

On the consuming side:

  • Enforce the fields to be present, reject otherwise.
  • Propagate only if the fields are valid and signature can be verified, reject otherwise.

StrictNoSign option

On the producing side:

  • Build messages without the signature, key, from and seqno fields.
  • The corresponding protobuf key-value pairs are absent from the marshalled message, not just empty.

On the consuming side:

  • Enforce the fields to be absent, reject otherwise.
  • Propagate only if the fields are absent, reject otherwise.
  • A message_id function will not be able to use the above fields, and should instead rely on the data field. A commonplace strategy is to calculate a hash.

LaxSign legacy option

Not required for backwards-compatibility. Considered insecure, nevertheless defined for completeness.

Always sign, and verify incoming signatures, but accept unsigned messages.

On the producing side:

  • Build messages with the signature, key (from may be enough), from and seqno fields.

On the consuming side:

  • signature may be absent, and not verified.
  • Verify signature, iff the signature is present, then reject if signature is invalid.

LaxNoSign option

Previous default for 'no signature verification' mode.

Do not sign nor origin-stamp, but verify incoming signatures, and accept unsigned messages.

On the producing side:

  • Build messages without the signature, key, from and seqno fields.

On the consuming side:

  • Accept and propagate messages with above fields.
  • Verify signature, iff the signature is present, then reject if signature is invalid.

[[ Margin note: ]] For content-addressed messaging, StrictNoSign is the most appropriate policy option, coupled with a user-defined message_id_fn, and a validator function to verify protocol-defined signatures.

When publisher anonymity is being sought, StrictNoSign is also the most appropriate policy, as it refrains from outputting the from and seqno fields.

The Topic Descriptor

The topic descriptor message is used to define various options and parameters of a topic. It currently specifies the topic's human readable name, its authentication options, and its encryption options. The AuthOpts and EncOpts of the topic descriptor message are not used in current implementations, but may be used in future. For clarity, this is added as a comment in the file, and may be removed once used.

The TopicDescriptor protobuf is as follows:

syntax = "proto2";
message TopicDescriptor {
	optional string name = 1;
	// AuthOpts and EncOpts are unused as of Oct 2018, but
	// are planned to be used in future.
	optional AuthOpts auth = 2;
	optional EncOpts enc = 3;

	message AuthOpts {
		optional AuthMode mode = 1;
		repeated bytes keys = 2;

		enum AuthMode {
			NONE = 0;
			KEY = 1;
			WOT = 2;
		}
	}

	message EncOpts {
		optional EncMode mode = 1;
		repeated bytes keyHashes = 2;

		enum EncMode {
			NONE = 0;
			SHAREDKEY = 1;
			WOT = 2;
		}
	}
}

The name field is a string used to identify or mark the topic. It can be descriptive or random or anything that the creator chooses.

Note that instead of using TopicDescriptor.name, for privacy reasons the TopicDescriptor struct may be hashed, and used as the topic ID. Another option is to use a CID as a topic ID. While a consensus has not been reached, for forwards and backwards compatibility, using an enum TopicID that allows custom types in variants (i.e. Name, hashedTopicDescriptor, CID) may be the most suitable option if it is available within an implementation's language (otherwise it would be implementation defined).

The auth field specifies how authentication will work for this topic. Only authenticated peers may publish to a given topic. See 'AuthOpts' below for details.

The enc field specifies how messages published to this topic will be encrypted. See 'EncOpts' below for details.

AuthOpts

The AuthOpts message describes an authentication scheme. The mode field specifies which scheme to use, and the keys field is an array of keys. The meaning of the keys field is defined by the selected AuthMode.

There are currently three options defined for the AuthMode enum:

AuthMode 'NONE'

No authentication, anyone may publish to this topic.

AuthMode 'KEY'

Only peers whose peerIDs are listed in the keys array may publish to this topic, messages from any other peer should be dropped.

AuthMode 'WOT'

Web Of Trust: any trusted peer may publish to the topic. A trusted peer is one whose peerID is listed in the keys array, or any peer who is 'trusted' by another trusted peer. The mechanism of signifying trust in another peer is yet to be defined.

EncOpts

The EncOpts message describes an encryption scheme for messages in a given topic. The mode field denotes which encryption scheme will be used, and the keyHashes field specifies a set of hashes of keys whose purpose may be defined by the selected mode.

There are currently three options defined for the EncMode enum:

EncMode 'NONE'

Messages are not encrypted, anyone can read them.

EncMode 'SHAREDKEY'

Messages are encrypted with a preshared key. The salted hash of the key used is denoted in the keyHashes field of the EncOpts message. The mechanism for sharing the keys and salts is undefined.

EncMode 'WOT'

Web Of Trust publishing. Messages are encrypted with some certificate or certificate chain shared amongst trusted peers. (Spec writer's note: this is the least clearly defined option and my description here may be wildly incorrect, needs checking).

Topic Validation

Implementations MUST support attaching validators to topics.

Validators have access to the Message and can apply any logic to determine its validity. When propagating a message for a topic, implementations will invoke all validators attached to that topic, and will only continue propagation if, and only if all, validations pass.

In its simplest form, a validator is a function with signature (peer.ID, *Message) => bool, where the return value is true if validation passes, and false otherwise.

Local handling of failed validation is left up to the implementation (e.g. logging).

Implementations MAY allow dynamically adding and removing validators at runtime.

gossipsub: An extensible baseline pubsub protocol

Gossipsub logo

Gossipsub is an extensible baseline pubsub protocol, based on randomized topic meshes and gossip. It is a general purpose pubsub protocol with moderate amplification factors and good scaling properties. The protocol is designed to be extensible by more specialized routers, which may add protocol messages and gossip in order to provide behaviour optimized for specific application profiles.

If you are new to Gossipsub and/or PubSub in general, we recommend you to first:

Specification

  • gossipsub-v1.0: v1.0 of the gossipsub protocol. This is a revised specification, to use a more normative language. The original v1.0 specification is here, still a good read.
  • gossipsub-v1.1: v1.1 of the gossipsub protocol.
  • gossipsub-v1.2: v1.2 of the gossipsub protocol. This includes the aggregation of the IDONTWANT control messages to the specs.
  • (not in use) episub: a research note on a protocol building on top of gossipsub to implement epidemic broadcast trees.

Implementation status

Legend: ✅ = complete, 🏗 = in progress, ❕ = not started yet

Additional tooling:

Episub: Proximity Aware Epidemic PubSub for libp2p

Lifecycle StageMaturityStatusLatest Revision
1AWorking DraftActiver1, 2018-06-28

Authors: @vyzo

Interest Group: @yusefnapora, @raulk, @vyzo, @Stebalien, @jamesray1, @vasco-santos

Author's note:

  • This is based on an earlier research draft about an epidemic broadcast protocol for libp2p pubsub. It serves as reference for the design of episub, an extended gossipsub router optimized for single source multicast and scenarios with a few fixed sources broadcasting to a large number of clients in a topic.

Introduction

This document proposes a successor to the FloodSub protocol. It proposes a topic pubsub protocol based on the following papers:

  1. Epidemic Broadcast Trees, 2007 (PDF, DOI: 10.1109/SRDS.2007.27)
  2. HyParView: a membership protocol for reliable gossip-based broadcast, 2007 (PDF, DOI: 10.1109/DSN.2007.56)
  3. GoCast: Gossip-enhanced Overlay Multicast for Fast and Dependable Group Communication, 2005 (PDF)

The protocol implements the Plumtree algorithm from [1], with membership managed using HyParView[2] and proximity-aware overlay construction based on the scheme proposed in GoCast[3]. The marrying of proximity awareness from GoCast with Plumtree was suggested by the original authors of Plumtree in [1].

The protocol has two distinct components: the membership management protocol (subscribe) and the broadcast protocol (publish).

The membership management protocol (Peer Sampling Service in [1]) maintains two lists of peers that are subscribed to the topic. The active list contains peers with active broadcast connections. The passive list is a partial view of the overlay at large, and is used for directing new joins, replacing failed peers in the active list and optimizing the overlay. The active list is symmetric, meaning that if a node P has node Q in its active list, then Q also has P in its active list.

The broadcast protocol lazily constructs and optimizes a multicast tree using epidemic broadcast. The peer splits the active list into two sets of peers: the eager peers and the lazy peers. The eager peers form the edges of the multicast tree, while the lazy peers form a gossip mesh supporting the multicast tree.

When a new message is broadcast, it is pushed to the eager peers, while lazy peers only receive message summaries and have to pull missing messages. Initially, all peers in the active list are eager forming a connected mesh. As messages propagate, peers prune eager links when receiving duplicate messages, thus constructing a multicast tree. The tree is repaired when peers receive lazy messages that were not propagated via eager links by grafting an eager link on top of a lazy one.

In steady state, the protocol optimizes the multicast tree in two ways. Whenever a message is received via both an eager link and a lazy message summary, its hop count is compared. When the eager transmission hop count exceeds the lazy hop count by some threshold, then the lazy link can replace the eager link as a tree edge, reducing latency as measured in hops. In addition, active peers may be periodically replaced by passive peers with better network proximity, thus reducing propagation latency in time.

Membership Management Protocol

Design Parameters for View Sizes

The size of the active and passive lists is a design parameter in HyParView, dependent on the size N of the overlay:

A(N) = log(N) + c
P(N) = k * A(N)

The authors in [2] select c=1 and k=6, while fixing N to a target size of 10,000 nodes. Long term, the membership list sizes should be dynamically adjusted based on overlay size estimations. For practical purposes, we can start with a large target size, and introduce dynamic sizing later in the development cycle.

A second parameter that needs to be adjusted is the number of random and nearby neighbors in A for proximity optimizations. In [3], the authors use two parameters C_rand and C_near to set the size of the neighbor list such that

A = C_rand + C_near

In their analysis they fix C_rand=1 and C_near=5, with their rationale being that a single random link is sufficient to connect the overlay, at least in bimodal distributions, while overlays without any random links may fail to connect at all. Nonetheless, the random link parameter is directly related to the connectivity of the overlay. A higher C_rand ensures connectivity with high probability and fault tolerance. The fault-tolerance and connectivity properties of HyParView stem from the random overlay structure, so in order to preserve them and still optimize for proximity, we need to set

C_rand = log(N)

For a real-world implementation at the scale of IPFS, we can use the following starting values:

N = 10,000
C_rand = 4
C_near = 3
A = 7
P = 42

Joining the Overlay

In order to subscribe to the topic, a node P needs to locate one or more nodes in the topic and join the overlay. The initial contact nodes can be obtained via rendezvous with DHT provider records.

Once a list of initial contact nodes has been obtained, the node selects nodes randomly and sends a GETNODES message in order to obtain an up-to-date view of the overlay from the passive list of a subscribed node regardless of age of Provider records. Once an up-to-date passive view of the overlay has been obtained, the node proceeds to join.

In order to join, it picks C_rand nodes at random and sends JOIN messages to them with some initial TTL set as a design parameter.

The JOIN message propagates with a random walk until a node is willing to accept it or the TTL expires. Upon receiving a JOIN message, a node Q evaluates it with the following criteria:

  • Q tries to open a connection to P. If the connection cannot be opened (e.g. because of NAT), then it checks the TTL of the message. If it is 0, the request is dropped, otherwise Q decrements the TTL and forwards the message to a random node in its active list.
  • If the TTL of the request is 0 or if the size of Q's active list is less than A, it accepts the join, adds P to its active list and sends a NEIGHBOR message.
  • Otherwise it decrements the TTL and forwards the message to a random node in its active list.

When Q accepts P as a new neighbor, it also sends a FORWARDJOIN message to a random node in its active list. The FORWARDJOIN propagates with a random walk until its TTL is 0, while being added to the passive list of the receiving nodes.

If P fails to join because of connectivity issues, it decrements the TTL and tries another starting node. This is repeated until a TTL of zero reuses the connection in the case of NATed hosts.

Once the first links have been established, P then needs to increase its active list size to A by connecting to more nodes. This is accomplished by ordering the subscriber list by RTT and picking the nearest nodes and sending NEIGHBOR requests. The neighbor requests may be accepted by NEIGHBOR message and rejected by a DISCONNECT message.

Upon receiving a NEIGHBOR request a node Q evaluates it with the following criteria:

  • If the size of Q's active list is less than A, it accepts the new node.
  • If P does not have enough active links (less than C_rand, as specified in the message), it accepts P as a random neighbor.
  • Otherwise Q takes an RTT measurement to P. If it's closer than any near neighbors by a factor of alpha, then it evicts the near neighbor if it has enough active links and accepts P as a new near neighbor.
  • Otherwise the request is rejected.

Note that during joins, the size of the active list for some nodes may end up being larger than A. Similarly, P may end up with fewer links than A after an initial join. This follows [3] and tries to minimize fluttering in joins, leaving the active list pruning for the stabilization period of the protocol.

Leaving the Overlay

In order to unsubscribe, the node can just leave the overlay by sending DISCONNECT messages to its active neighbors. References to the node in the various passive lists scattered across the overlay will be lazily pruned over time by the passive view management component of the protocol.

In order to facilitate fast clean up of departing nodes, we can also introduce a LEAVE message that eagerly propagates across the network. A node that wants to unsubscribe from the topic, emits a LEAVE to its active list neighbors in place of DISCONNECT. Upon receiving a LEAVE, a node removes the node from its active list and passive lists. If the node was removed from one of the lists or if the TTL is greater than zero, then the LEAVE is propagated further across the active list links. This will ensure a random diffusion through the network that would clean most of the active lists eagerly, at the cost of some bandwidth.

Active View Management

The active list is generally managed reactively: failures are detected by TCP, either when a message is sent or when the connection is detected as closed.

In addition to the reactive management strategy, the active list has stabilization and optimization components that run periodically with a randomized timer, and also serve as failure detectors. The stabilization component attempts to prune active lists that are larger than A, say because of a slew of recent joins, and grow active lists that are smaller than A because of some failures or previous inability to neighbor with enough nodes.

When a node detects that its active list is too large, it queries the neighbors for their active lists.

  • If some neighbors have more than C_rand random neighbors, then links can be dropped with a DISCONNECT message until the size of the active list is A again.
  • If the list is still too large, then it checks the active lists for neighbors that are connected with each other. In this case, one of the links can be dropped with a DISCONNECT message.
  • If the list is still too large, then we cannot safely drop connections and it will remain that large until the next stabilization period.

When a node detects that its active list is too small, then it tries to open more connections by picking nodes from its passive list, as described in the Join section.

The optimization component tries to optimize the C_near connections by replacing links with closer nodes. In order to do so, it takes RTT samples from active list nodes and maintains a smoothed running average. The neighbors are reordered by RTT and the closest ones are considered the near nodes. It then checks the RTT samples of passive list nodes and selects the closest node. If the RTT is smaller by a factor of alpha than a near neighbor and it has enough random neighbors, then it disconnects and adopts the new node from the passive list as a neighbor.

Passive View Management

The passive list is managed cyclically, as per [2]. Periodically, with a randomized timer, each node performs a passive list shuffle with one of its active neighbors. The purpose of the shuffle is to update the passive lists of the nodes involved. The node that initiates the shuffle creates an exchange list that contains its id, k_a peers from its active list and k_p peers from its passive list, where k_a and k_p are protocol parameters (unspecified in [2]). It then sends a SHUFFLE request to a random neighbor, which is propagated with a random walk with an associated TTL. If the TTL is greater than 0 and the number of nodes in the receiver's active list is greater than 1, then it propagates the request further. Otherwise, it selects nodes from its passive list at random, sends back a SHUFFLEREPLY and replaces them with the shuffle contents. The originating node receiving the SHUFFLEREPLY also replaces nodes in its passive list with the contents of the message. Care should be taken for issues with transitive connectivity due to NAT. If a node cannot connect to the originating node for a SHUFFLEREPLY, then it should not perform the shuffle. Similarly, the originating node could time out waiting for a shuffle reply and try again with a lower TTL, until a TTL of zero reuses the connection in the case of NATed hosts.

In addition to shuffling, proximity awareness and leave cleanup requires that we compute RTT samples and check connectivity to nodes in the passive list. Periodically, the node selects some nodes from its passive list at random and tries to open a connection if it doesn't already have one. It then checks that the peer is still subscribed to the overlay. If the connection attempt is successful and the node is still subscribed to the topic, it then updates the RTT estimate for the peer in the list with a ping. Otherwise, it removes it from the passive list for cleanup.

Broadcast Protocol

Broadcast State

Once it has joined the overlay, the node starts its main broadcast logic loop. The loop receives messages to publish from the application, messages published from other nodes, and with notifications from the management protocol about new active neighbors and disconnections.

The state of the broadcast loop consists of two sets of peers, the eager and lazy lists, with the eager list initialized to the initial neighbors and the lazy list empty. The loop also maintains a time-based cache of recent messages, together with a queue of lazy message notifications. In addition to the cache, it maintains a list of missing messages known by lazy gossip but not yet received through the multicast tree.

Message Propagation and Multicast Tree Construction

When a node publishes a message, it broadcasts a GOSSIP message with a hopcount of 1 to all its eager peers, adds the message to the cache, and adds the message id to the lazy notification queue.

When a node receives a GOSSIP message from a neighbor, first it checks its cache to see if it has already seen this message. If the message is in the cache, it prunes the edge of the multicast graph by sending a PRUNE message to the peer, removing the peer from the eager list, and adding it to the lazy list.

If the node hasn't seen the message before, it delivers the message to the application and then adds the peer to the eager list and proceeds to broadcast. The hopcount is incremented and then the node forwards it to its eager peers, excluding the source. It also adds the message to the cache, and pushes the message id to the lazy notification queue.

The loop runs a short periodic timer, with a period in the order of 0.1s for gossiping message summaries. Every time it fires, the node flushes the lazy notification queue with all the recently received message ids in an IHAVE message to its lazy peers. The IHAVE notifications summarize recent messages the node has seen and have not propagated through the eager links.

Multicast Tree Repair

When a failure occurs, at least one multicast tree branch is affected, as messages are not transmitted by eager push. The IHAVE messages exchanged through lazy gossip are used both to recover missing messages but also to provide a quick mechanism to heal the multicast tree.

When a node receives an IHAVE message for unknown messages, it simply marks the messages as missing and places them to the missing message queue. It then starts a timer and waits to receive the message with eager push before the timer expires. The timer duration is a protocol parameter that should be configured considering the diameter of the overlay and the target recovery latency. A more realistic implementation is to use a persistent timer heartbeat to check for missing messages periodically, marking on first touch and considered missing on the second timer touch.

When a message is detected as missing, the node selects the first IHAVE announcement it has seen for the missing message and sends a GRAFT message to the peer, piggybacking other missing messages. The GRAFT message serves a dual purpose: it triggers the transmission of the missing messages and at the same time adds the link to the multicast tree, healing it.

Upon receiving a GRAFT message, a node adds the peer to the eager list and transmits the missing messages from its cache as GOSSIP. Note that the message is not removed from the missing list until it is received as a response to a GRAFT. If the message has not been received by the next timer tick, say because the grafted peer has also failed, then another graft is attempted and so on, until enough ticks have elapsed to consider the message lost.

Multicast Tree Optimization

The multicast tree is constructed lazily, following the path of the first published message from some source. Therefore, the tree may not directly take advantage of new paths that may appear in the overlay as a result of new nodes/links. The overlay may also be suboptimal for all but the first source.

To overcome these limitations and adapt the overlay to multiple sources, the authors in [1] propose an optimization: every time a message is received, it is checked against the missing list and the hopcount of messages in the list. If the eager transmission hopcount exceeds the hopcount of the lazy transmission, then the tree is candidate for optimization. If the tree were optimal, then the hopcount for messages received by eager push should be less than or equal to the hopcount of messages propagated by lazy push. Thus the eager link can be replaced by the lazy link and result to a shorter tree.

To promote stability in the tree, the authors in [1] suggest that this optimization be performed only if the difference in hopcount is greater than a threshold value. This value is a design parameter that affects the overall stability of the tree: the lower the value, the more easier the protocol will try to optimize the tree by exchanging links. But if the threshold value is too low, it may result in fluttering with multiple active sources. Thus, the value should be higher and closer to the diameter of the tree to avoid constant changes.

Active View Changes

The active peer list is maintained by the Membership Management protocol: nodes may be removed because of failure or overlay reorganization, and new nodes may be added to the list because of new connections. The Membership Management protocol communicates these changes to the broadcast loop via NeighborUp and NeighborDown notifications.

When a new node is added to the active list, the broadcast loop receives a NeighborUp notification; it simply adds the node to the eager peer list. On the other hand, when a node is removed with a NeighborDown notification, the loop has to consider if the node was an eager or lazy peer. If the node was a lazy peer, it doesn't need to do anything as the departure does not affect the multicast tree. If the node was an eager peer however, the loss of that edge may result in a disconnected tree.

There are two strategies in reaction to the loss of an eager peer. The first one is to do nothing, and wait for lazy push to repair the tree naturally with IHAVE messages in the next message broadcast. This might result in delays propagating the next few messages but is advocated by the authors in [1]. An alternative is to eagerly repair the tree by promoting lazy peers to eager with empty GRAFT messages and let the protocol prune duplicate paths naturally with PRUNE messages in the next message transmission. This may have a bit of bandwidth cost, but it is perhaps more appropriate for applications that value latency minimization which is the case for many IPFS applications.

Protocol Messages

A quick summary of referenced protocol messages and their payload. All messages are assumed to be enclosed in a suitable envelope and have a source and monotonic sequence id.

;; Initial node discovery
GETNODES {}

NODES {
 peers []peer.ID
 ttl int
}

;; Topic querying (membership check for passive view management)
GETTOPICS {}

TOPICS {
 topics []topic.ID
}

;; Membership Management protocol
JOIN {
 peer peer.ID
 ttl int
}

FORWARDJOIN {
 peer peer.ID
 ttl int
}

NEIGHBOR {
 peers []peer.ID
}

DISCONNECT {}

LEAVE {
 source peer.ID
 ttl int
}

SHUFFLE {
 peer peer.ID
 peers []peer.ID
 ttl int
}

SHUFFLEREPLY {
 peers []peer.ID
}

;; Broadcast protocol
GOSSIP {
 source peer.ID
 hops int
 msg []bytes
}

IHAVE {
 summary []MessageSummary
}

MessageSummary {
 id message.ID
 hops int
}

PRUNE {}

GRAFT {
 msgs []message.ID
}

Differences from Plumtree/HyParView

There are some noteworthy differences in the protocol described and the published Plumtree/HyParView protocols. There might be some more differences in minor details, but this document is written from a practical implementer's point of view.

Membership Management protocol:

  • The node views are managed with proximity awareness. The HyParView protocol has no provisions for proximity, these come from GoCast's implementation of proximity aware overlays; but note that we don't use UDP for RTT measurements and the increased C_rand to increase fault-tolerance at the price of some optimization.
  • Joining nodes don't get to get all A connections by kicking out extant nodes, as this would result in overlay instability in periods of high churn. Instead, nodes ensure that the first few links are created even if they oversubscribe their fanout, but they don't go out of their way to create remaining links beyond the necessary C_rand links. Nodes later bring the active list to balance with a stabilization protocol. Also noteworthy is that only C_rand JOIN messages are propagated with a random walk; the remaining joins are considered near joins and handled with normal NEIGHBOR requests. In short, the Join protocol is reworked, with the influence of GoCast.
  • There is no active view stabilization/optimization protocol in HyParView. This is very much influenced from GoCast, where the protocol allows oversubscribing and later drops extraneous connections and replaces nodes for proximity optimization.
  • NEIGHBOR messages play a dual role in the proposed protocol implementation, as they can be used for establishing active links and retrieving membership lists.
  • There is no connectivity check in HyParView and retires with reduced TTLs, but this is incredibly important in a world full of NAT.
  • There is no LEAVE provision in HyParView.

Broadcast protocol:

  • IHAVE messages are aggregated and lazily pushed via a background timer. Plumtree eagerly pushes IHAVE messages, which is wasteful and loses the opportunity for aggregation. The authors do suggest lazy aggregation as a possible optimization nonetheless.
  • GRAFT messages similarly aggregate multiple message requests.
  • Missing messages and overlay repair are managed by a single background timer instead of creating timers left and right for every missing message; that's impractical from an implementation point of view, at least in Go.
  • There is no provision for eager overlay repair on NeighborDown messages in Plumtree.

gossipsub v1.0 (OLD): An extensible baseline pubsub protocol

DISCLAIMER: This is the original specification, please refer to gossipsub-v1.0 from now on

Lifecycle StageMaturityStatusLatest Revision
3ARecommendationActiver1, 2018-08-29

Authors: @vyzo Interest Group: @yusefnapora, @raulk, @whyrusleeping, @Stebalien, @jamesray1, @vasco-santos, @daviddias, @yiannisbot

See the lifecycle document for context about the maturity level and spec status.


This is the specification for an extensible baseline pubsub protocol, based on randomized topic meshes and gossip. It is a general purpose pubsub protocol with moderate amplification factors and good scaling properties. The protocol is designed to be extensible by more specialized routers, which may add protocol messages and gossip in order to provide behaviour optimized for specific application profiles.

Contents

Context - In the beginning was floodsub

The initial pubsub experiment in libp2p was floodsub. It implements pubsub in the most basic manner, with two defining aspects:

  • ambient peer discovery; and
  • most basic routing: flooding.

Ambient Peer Discovery

With ambient peer discovery, the function is pushed outside the scope of the protocol. Instead, the mechanism for discovering peers is provided for by the environment. In practice, this can be embodied by DHT walks, rendezvous points, etc. This protocol relies on the ambient connection events produced by such mechanisms. Whenever a new peer is connected, the protocol checks to see if the peer implements floodsub and/or gossipsub, and if so, it sends it a hello packet that announces the topics that it is currently subscribing to.

This allows the peer to maintain soft overlays for all topics of interest. The overlay is maintained by exchanging subscription control messages whenever there is a change in the topic list. The subscription messages are not propagated further, so each peer maintains a topic view of its direct peers only. Whenever a peer disconnects, it is removed from the overlay.

Ambient peer discovery can be driven by arbitrary external means, which allows orthogonal development and no external dependencies for the protocol implementation.

There are a couple of options we are exploring as canonical approaches for the discovery driver:

  • DHT rendezvous using provider records; peers in the topic announce a provider record named after the topic.
  • Rendezvous through known or dynamically discovered rendezvous points.

Flood routing

With flooding, routing is almost trivial: for each incoming message, forward to all known peers in the topic. There is a bit of logic, as the router maintains a timed cache of previous messages, so that seen messages are not further forwarded. It also never forwards a message back to the source or the peer that forwarded the message.

Retrospective

Evaluating floodsub as a viable pubsub protocol reveals the following highly desirable properties:

  • it is straightforward to implement.
  • it minimizes latency; messages are delivered across minimum latency paths, modulo overlay connectivity.
  • it is highly robust; there is very little maintenance logic or state.

The problem however is that messages don't just follow the minimum latency paths; they follow all edges, thus creating a flood. The outbound degree of the network is unbounded, whereas we want it to be bounded in order to reduce bandwidth requirements and increase decentralization and scalability. In other words, this unbounded outbound degree creates a problem for individual densely connected nodes, as they may have a large number of connected peers and cannot afford the bandwidth to forward all these pubsub messages. Similary, the amplification factor is only bounded by the sum of degrees of all nodes in the overlay, which creates a scaling problem for densely connected overlays at large.

Proposed alternatives - Controlling the flood

In order to scale pubsub without excessive bandwidth waste or peer overload, we need a router that bounds the degree of each peer and globally controls the amplification factor.

randomsub: A random message router

Let's first consider the simplest bounded floodsub variant, which we call randomsub. In this construction, the router is still stateless, apart from a list of known peers in the topic. But instead of forwarding messages to all peers, it forwards to a random subset of up to D peers, where D is the desired degree of the network.

The problem with this construction is that the message propagation patterns are non-deterministic. This results in extreme message route instability, manifesting as message reordering and varying timing patterns, which is an undesirable property for many applications.

meshsub: An overlay mesh router

Nonetheless, the idea of limiting the flow of messages to a random subset of peers is solid. But instead of randomly selecting peers on a per message basis, we can form an overlay mesh where each peer forwards to a subset of its peers on a stable basis. We construct a router in this fashion, dubbed meshsub.

Each peer maintains its own view of the mesh for each topic, which is a list of bidirectional links to other peers. That is, in steady state, whenever a peer A is in the mesh of peer B, then peer B is also in the mesh of peer A.

The overlay is initially constructed in a random fashion. Whenever a peer joins a topic, then it selects D peers (in the topic) at random and adds them to the mesh, notifying them with a control message. When it leaves the topic, it notifies its peers and forgets the mesh for the topic.

The mesh is maintained with the following periodic stabilization algorithm:

at each peer:
  loop:
    if |peers| < D_low:
       select D - |peers| non-mesh peers at random and add them to the mesh
    if |peers| > D_high:
       select |peers| - D mesh peers at random and remove them from the mesh
    sleep t

The parameters of the algorithm are D which is the target degree, and two relaxed degree parameters D_low and D_high which represent admissible mesh degree bounds.

gossipsub: The gossiping mesh router

The meshsub router offers a baseline construction with good amplification control properties, which we augment with gossip about message flow. The gossip is emitted to random subsets of peers not in the mesh, similar to randomsub, and it allows us to propagate metadata about message flow throughout the network. The metadata can be arbitrary, but as a baseline we include the message ids of seen messages in the last few seconds. The messages are cached, so that peers receiving the gossip can request them for transmission with a control message.

The router can use this metadata to improve the mesh, for instance an episub router built on top of gossipsub can create epidemic broadcast trees. Beyond that, the metadata can restart message transmission at different points in the overlay to rectify downstream message loss. Or it can simply jump hops opportunistically and accelerate message transmission for peers who are at some distance in the mesh.

Essentially, gossipsub is a blend of meshsub for data and randomsub for mesh metadata. It provides bounded degree and amplification factor with the meshsub construction and augments it using gossip propagation of metadata with the randomsub technique.

Protocol Architecture - Gossipsub

We can now provide a specification of the pubsub protocol by sketching out the router implementation. The router is backwards compatible with floodsub, as it accepts floodsub peers and behaves like floodsub towards them.

If you would like to get a video presentation and visualization on Gossipsub, watch Scalable PubSub with GossipSub - Dimitris Vyzovitis from the IPFS London Hack Week of 2018 Q4.

Control messages

The protocol defines four control messages:

  • GRAFT: graft a mesh link; this notifies the peer that it has been added to the local mesh view.
  • PRUNE: prune a mesh link; this notifies the peer that it has been removed from the local mesh view.
  • IHAVE: gossip; this notifies the peer that the following messages were recently seen and are available on request.
  • IWANT: request transmission of messages announced in an IHAVE message.

Router state

The router maintains the following state:

  • peers: a set of all known peers; peers.gossipsub denotes the gossipsub peers while peers.floodsub denotes the floodsub peers.
  • mesh: the overlay meshes as a map of topics to lists of peers.
  • fanout: the mesh peers to which we are publishing to without topic membership, as a map of topics to lists of peers.
  • seen: this is the timed message ID cache, which tracks seen messages.
  • mcache: a message cache that contains the messages for the last few heartbeat ticks.

The message cache is a data structure that stores windows of message IDs and the corresponding messages. It supports the following operations:

  • mcache.put(m): adds a message to the current window and the cache.
  • mcache.get(id): retrieves a message from the cache by its ID, if it is still present.
  • mcache.window(): retrieves the message IDs for messages in the current history window.
  • mcache.shift(): shifts the current window, discarding messages older than the history length of the cache.

The seen cache is the flow control mechanism. It tracks the message IDs of seen messages for the last two minutes. It is separate from mcache for implementation reasons in Go (the seen cache is inherited from the pubsub framework), but they could be the same data structure. Note that the two minute cache interval is non-normative; a router could use a different value, chosen to approximate the propagation delay in the overlay with some healthy margin.

Topic membership

Topic membership is controlled by two operations supported by the router, as part of the pubsub api:

  • On JOIN(topic) the router joins the topic. In order to do so, if it already has D peers from the fanout peers of a topic, then it adds them to mesh[topic], and notifies them with a GRAFT(topic) control message. Otherwise, if there are less than D peers (let this number be x) in the fanout for a topic (or the topic is not in the fanout), then it still adds them as above (if there are any), and selects the remaining number of peers (D-x) from peers.gossipsub[topic], and likewise adds them to mesh[topic] and notifies them with a GRAFT(topic) control message.
  • On LEAVE(topic) the router leaves the topic. It notifies the peers in mesh[topic] with a PRUNE(topic) message and forgets mesh[topic].

Note that the router can publish messages without topic membership. In order to maintain stable routes in that case, it maintains a list of peers for each topic it has published in the fanout map. If the router does not publish any messages of a topic for some time, then the fanout peers for that topic are forgotten, so this is soft state.

Also note that as part of the pubsub api, the peer emits SUBSCRIBE and UNSUBSCRIBE control messages to all its peers whenever it joins or leaves a topic. This is provided by the the ambient peer discovery mechanism and nominally not part of the router. A standalone implementation would have to implement those control messages.

Message processing

Upon receiving a message, the router first processes the payload of the message. If it contains a valid message that has not been previously seen, then it publishes the message:

  • It forwards the message to every peer in peers.floodsub[topic], provided it's not the source of the message.
  • It forwards the message to every peer in mesh[topic], provided it's not the source of the message.

After processing the payload, it then processes the control messages in the envelope:

  • On GRAFT(topic) it adds the peer to mesh[topic] if it is subscribed to the topic. If it is not subscribed, it responds with a PRUNE(topic) control message.
  • On PRUNE(topic) it removes the peer from mesh[topic].
  • On IHAVE(ids) it checks the seen set and requests unknown messages with an IWANT message.
  • On IWANT(ids) it forwards all request messages that are present in mcache to the requesting peer.

When the router publishes a message that originates from the router itself (at the application layer), then it proceeds similarly to the payload reaction:

  • It forwards the message to every peer in peers.floodsub[topic].
  • If it is subscribed to the topic, then it must have a set of peers in mesh[topic], to which the message is forwarded.
  • If it is not subscribed to the topic, it then forwards the message to the peers in fanout[topic]. If this set is empty, it chooses D peers from peers.gossipsub[topic] to become the new fanout[topic] peers and forwards to them.

Heartbeat

The router periodically runs a heartbeat procedure, which is responsible for maintaining the mesh, emitting gossip, and shifting the message cache.

The mesh is maintained exactly as prescribed by meshsub:

for each topic in mesh:
 if |mesh[topic]| < D_low:
   select D - |mesh[topic]| peers from peers.gossipsub[topic] - mesh[topic]
    ; i.e. not including those peers that are already in the topic mesh.
   for each new peer:
     add peer to mesh[topic]
     emit GRAFT(topic) control message to peer

 if |mesh[topic]| > D_high:
   select |mesh[topic]| - D peers from mesh[topic]
   for each new peer:
     remove peer from mesh[topic]
     emit PRUNE(topic) control message to peer

The fanout map is maintained by keeping track of the last published time for each topic:

for each topic in fanout:
  if time since last published > ttl
    remove topic from fanout
  else if |fanout[topic]| < D
    select D - |fanout[topic]| peers from peers.gossipsub[topic] - fanout[topic]
    add the peers to fanout[topic]

Gossip is emitted by selecting peers for each topic that are not already part of the mesh:

for each topic in mesh+fanout:
  let mids be mcache.window[topic]
  if mids is not empty:
    select D peers from peers.gossipsub[topic] not in mesh[topic] or fanout[topic]
    for each peer
      emit IHAVE(mids)

shift the mcache

Note that we used the same parameter D as the target degree for gossip for simplicity, but this is not normative. A separate parameter D_lazy can be used to explicitly control the gossip propagation factor, which allows for tuning the tradeoff between eager and lazy transmission of messages.

Control message piggybacking

Gossip and other control messages do not have to be transmitted on their own message. Instead, they can be coalesced and piggybacked on any other message in the regular flow, for any topic. This can lead to message rate reduction whenever there is some correlated flow between topics, and can be significant for densely connected peers.

For piggyback implementation details, consult the Go implementation.

Protobuf

The protocol extends the existing RPC message structure with a new field, control. This is an instance of ControlMessage which may contain one or more control messages. The four control messages are ControlIHave for IHAVE messages, ControlIWant for IWANT messages, ControlGraft for GRAFT messages and ControlPrune for PRUNE messages.

The protobuf is as follows:

syntax = "proto2";

message RPC {
    // ...
	optional ControlMessage control = 3;
}

message ControlMessage {
	repeated ControlIHave ihave = 1;
	repeated ControlIWant iwant = 2;
	repeated ControlGraft graft = 3;
	repeated ControlPrune prune = 4;
}

message ControlIHave {
	optional string topicID = 1;
	repeated string messageIDs = 2;
}

message ControlIWant {
	repeated string messageIDs = 1;
}

message ControlGraft {
	optional string topicID = 1;
}

message ControlPrune {
	optional string topicID = 1;
}

gossipsub v1.1: Security extensions to improve on attack resilience and bootstrapping

Lifecycle StageMaturityStatusLatest Revision
2ACandidate RecommendationActiver8, 2021-12-14

Authors: @vyzo

Interest Group: @yusefnapora, @raulk, @whyrusleeping, @Stebalien, @daviddias, @protolambda, @djrtwo, @dryajov, @mpetrunic, @AgeManning, @Nashatyrev, @mhchia

See the lifecycle document for context about maturity level and spec status.


Overview

This document specifies extensions to gossipsub v1.0 intended to improve bootstrapping and protocol attack resistance. The extensions change the algorithms that prescribe local peer behaviour and are fully backwards compatible with v1.0 of the protocol. Peers that implement these extensions, advertise v1.1 of the protocol using /meshsub/1.1.0 as the protocol string.

Protocol extensions

Explicit Peering Agreements

The protocol now supports explicit peering agreements between node operators. With explicit peering, the application can specify a list of peers to remain connected to and unconditionally forward messages to each other outside of the vagaries of the peer scoring system and other defensive measures.

For every explicit peer, the router must establish and maintain a connection. The connections are initially established when the router boots, and are periodically checked for connectivity and reconnect if the connectivity is lost. The recommended period for connectivity checks is 5 minutes.

Peering agreements are established out of band and reciprocal. explicit peers exist outside the mesh: every new valid incoming message is forwarded to the direct peers, and incoming RPCs are always accepted from them. It is an error to GRAFT on an explicit peer, and such an attempt should be logged and rejected with a PRUNE.

PRUNE Backoff and Peer Exchange

Gossipsub relies on ambient peer discovery in order to find peers within a topic of interest. This puts pressure to the implementation of a scalable peer discovery service that can support the protocol. With Peer Exchange, the protocol can now bootstrap from a small set of nodes, without relying on an external peer discovery service.

Peer Exchange (PX) kicks in when pruning a mesh because of oversubscription. Instead of simply telling the pruned peer to go away, the pruning peer may provide a set of other peers where the pruned peer can connect to reform its mesh (see Peer Scoring below).

In addition, both the pruned and the pruning peer add a backoff period from each other, within which they will not try to regraft. Both the pruning and the pruned peer will immediately prune a GRAFT within the backoff period and extend it. When a peer tries to regraft too early, the pruning peer may apply a behavioural penalty for the action, and penalize the peer through P₇ (see Peer Scoring below).

When unsubscribing from a topic, the backoff period should be finished before subscribing to the topic again, otherwise a healthy mesh will be difficult to reach. A shorter backoff period can be used in case of an unsubscribe event, allowing faster resubscribing.

The recommended duration for the backoff period is 1 minute, while the recommended number of peers to exchange is larger than D_hi so that the pruned peer can reliably form a full mesh. In order to correctly synchronize the two peers, the pruning peer should include the backoff period in the PRUNE message. The peer has to wait the full backoff period before attempting to graft again —plus some slack to account for the offset until the next heartbeat that clears the backoff— otherwise it risks getting its graft rejected and being penalized in its score if it attempts to graft too early.

In order to implement PX, we extend the PRUNE control message to include an optional set of peers the pruned peer can connect to. This set of peers includes the Peer ID and a signed peer record for each peer exchanged. In order to facilitate the transition to the usage of signed peer records within the libp2p ecosystem, the emitting peer is allowed to omit the signed peer record if it doesn't have one. In this case, the pruned peer will have to rely on the ambient peer discovery service (if set up) to discover the addresses for the peer.

Protobuf

The ControlPrune message is extended with a peers field as follows.

syntax = "proto2";
message ControlPrune {
	optional string topicID = 1;
	repeated PeerInfo peers = 2; // gossipsub v1.1 PX
	optional uint64 backoff = 3; // gossipsub v1.1 backoff time (in seconds)
}

message PeerInfo {
	optional bytes peerID = 1;
	optional bytes signedPeerRecord = 2;
}

Flood Publishing

In gossipsub v1.0, peers publish new messages to the members of their mesh if they are subscribed to the topic to which they're publishing. A peer can also publish to topics they are not subscribed to, in which case they will select peers from their fanout map.

In gossipsub v1.1 publishing is (optionally) done by publishing the message to all connected peers with a score above a publish threshold (see Peer Scoring below). This applies regardless of whether the publisher is subscribed to the topic. With flood publishing enabled, the mesh is used when propagating messages from other peers, but a peer's own messages will always be published to all known peers in the topic.

This behaviour is prescribed to counter eclipse attacks and ensure that a newly published message from an honest node will reach all connected honest nodes and get out to the network at large. When flood publishing is in use there is no point in utilizing a fanout map or emitting gossip when the peer is a pure publisher not subscribed in the topic.

This behaviour also reduces message propagation latency as the message is injected to more points in the network.

Adaptive Gossip Dissemination

In gossipsub v1.0 gossip is emitted to a fixed number of peers, as specified by the D_lazy parameter. In gossipsub v1.1 the dissemination of gossip is adaptive; instead of emitting gossip to a fixed number of peers, we emit gossip to a percentage of our peers with a minimum of D_lazy peers.

The parameter controlling the emission of gossip is called the gossip factor. When a node wants to emit gossip during the heartbeat, first it selects all peers with a peer score above a gossip threshold (see Peer Scoring below). From these peers, it randomly selects gossip factor peers with a minimum of D_lazy, and emits gossip to the selected peers.

The recommended value for the gossip factor is 0.25, which with the default of 3 rounds of gossip per message ensures that each peer has at least a 50% chance of receiving gossip about a message. More specifically, for 3 rounds of gossip, the probability of a peer not receiving gossip about a fresh message is (3/4)³=27/64=0.421875. So each peer receives gossip about a fresh message with a 0.578125 probability.

This behaviour is prescribed to counter sybil attacks and ensures that a message from an honest node propagates in the network with high probability.

Outbound Mesh Quotas

In gossipsub v1.0 mesh peers are randomly selected, without any weight given to the direction of the connection. In contrast, gossipsub v1.1 implements outbound connection quotas, so that a peer tries to always maintain a number of outbound connections in the mesh.

Specifically, we define a new overlay parameter D_out, which must be set below D_lo and at most D/2, such that:

  • When the peer prunes because of oversubscription, it selects survivor peers under the constraint that at least D_out peers are outbound connections; see also Peer Scoring below.
  • When the peer receives a GRAFT while oversubscribed (with mesh degree at D_hi or higher), it only accepts the new peer in the mesh if it is an outbound connection.
  • During heartbeat maintenance, if the peer already has at least D_lo peers in the mesh but not enough outbound connections, then it selects as many needed peers to fill the quota and grafts them in the mesh.

This behaviour is prescribed to counter sybil attacks and ensures that a coordinated inbound attack can never fully take over the mesh of a target peer.

Peer Scoring

In gossipsub v1.1 we introduce a peer scoring component: each individual peer maintains a score for other peers. The score is locally computed by each individual peer based on observed behaviour and is not shared. The score is a real value, computed as a weighted mix of parameters, with pluggable application-specific scoring. The score is computed across all (configured) topics with a weighted mix, such that faulty behaviour in one topic percolates to other topics. Furthermore, the score is retained for some period of time when a peer disconnects, so that malicious peers cannot easily reset their score when it drops to negative and well behaving peers don't lose their status because of a disconnection.

The intention is to detect malicious or faulty behaviour and penalize the misbehaving peers with a negative score.

Score Thresholds

The score is plugged into various gossipsub algorithms such that peers with negative scores are removed from the mesh. Peers with a heavily negative score are further penalized or even ignored if the score drops too low.

More specifically, the following thresholds apply:

  • 0: the baseline threshold; peers with a score below this threshold are pruned from the mesh during the heartbeat and ignored when looking for peers to graft. Furthermore, no PX information is emitted towards those peers and PX is ignored from them. In addition, when performing PX only peers with non-negative scores are exchanged.
  • GossipThreshold: when a peer's score drops below this threshold, no gossip is emitted towards that peer and gossip from that peer is ignored. This threshold should be negative, such that some information can be propagated to/from mildly negatively scoring peers.
  • PublishThreshold: when a peer's score drops below this threshold, self-published messages are not propagated towards this peer when (flood) publishing. This threshold should be negative, and less than or equal to the gossip threshold.
  • GraylistThreshold: when a peer's score drops below this threshold, the peer is graylisted and its RPCs are ignored. This threshold must be negative, and less than the gossip/publish threshold.
  • AcceptPXThreshold: when a peer sends us PX information with a prune, we only accept it and connect to the supplied peers if the originating peer's score exceeds this threshold. This threshold should be non-negative and for increased security a large positive score attainable only by bootstrappers and other trusted well-connected peers.
  • OpportunisticGraftThreshold: when the median peer score in the mesh drops below this value, the router may select more peers with a score above the median to opportunistically graft on the mesh (see Opportunistic Grafting below). This threshold should be positive, with a relatively small value compared to scores achievable through topic contributions.

Heartbeat Maintenance

The score is checked explicitly during heartbeat maintenance such that:

  • Peers with a negative score are pruned from all meshes.
  • When pruning because of oversubscription, the peer keeps the best D_score scoring peers and selects the remaining peers to keep at random. This protects the mesh from takeover attacks and ensures that the best scoring peers are kept in the mesh. At the same time, we do keep some peers as random so that the protocol is responsive to new peers joining the mesh. The selection is done under the constraint that D_out peers are outbound connections; if the scoring plus random selection does not result in enough outbound connections, then we replace the random and lower scoring peers in the selection with outbound connection peers.
  • When selecting peers to graft because of undersubscription, peers with a negative score are ignored.

Opportunistic Grafting

It may be possible that the router gets stuck with a mesh of poorly performing peers, either due to churn of good peers or because of a successful large scale cold boot or covert flash attack. When this happens, the router will normally react through mesh failure penalties (see The Score Function below), but this reaction time may be slow: the peers selected to replace the negative scoring peers are selected at random among the non-negative scoring peers, which may result in multiple rounds of selections amongst a sybil poisoned pool. Furthermore, the population of sybils may be so large that the sticky mesh failure penalties completely decay before any good peers are selected, thus making sybils re-eligible for grafting.

In order to recover from such disaster scenarios and generally adaptively optimize the mesh over time, gossipsub v1.1 introduces an opportunistic grafting mechanism. Periodically, the router checks the median score of peers in the mesh against the OpportunisticGraftThreshold. If the median score is below the threshold, the router opportunistically grafts (at least) two peers with score above the median in the mesh. This improves an underperforming mesh by introducing good scoring peers that may have been gossiping at us. This also allows the router to get out of sticky disaster situations by replacing sybils attempting an eclipse with peers which have actually forwarded messages through gossip recently.

The recommended period for opportunistic grafting is 1 minute, while the router should graft 2 peers (with the default parameters) so that it has the opportunity to become a conduit between them and establish a score in the mesh. Nonetheless, the number of peers that are opportunistically grafted is controlled by the application. It may be desirable to graft more peers if the application has configured a larger mesh than the default parameters.

The Score Function

The score function is a weighted mix of parameters, 4 of them per topic and 3 of them globally applicable.

Score(p) = TopicCap(Σtᵢ*(w₁(tᵢ)*P₁(tᵢ) + w₂(tᵢ)*P₂(tᵢ) + w₃(tᵢ)*P₃(tᵢ) + w₃b(tᵢ)*P₃b(tᵢ) + w₄(tᵢ)*P₄(tᵢ))) + w₅*P₅ + w₆*P₆ + w₇*P₇

where tᵢ is the topic weight for each topic where per topic parameters apply.

The parameters are defined as follows:

  • P₁: Time in Mesh for a topic. This is the time a peer has been in the mesh, capped to a small value and mixed with a small positive weight. This is intended to boost peers already in the mesh so that they are not prematurely pruned because of oversubscription.
  • P₂: First Message Deliveries for a topic. This is the number of messages first delivered by the peer in the topic, mixed with a positive weight. This is intended to reward peers who first forward a valid message.
  • P₃: Mesh Message Delivery Rate for a topic. This parameter is a threshold for the expected message delivery rate within the mesh in the topic. If the number of deliveries is above the threshold, then the value is 0. If the number is below the threshold, then the value of the parameter is the square of the deficit. This is intended to penalize peers in the mesh who are not delivering the expected number of messages so that they can be removed from the mesh. The parameter is mixed with a negative weight.
  • P₃b: Mesh Message Delivery Failures for a topic. This is a sticky parameter that counts the number of mesh message delivery failures. Whenever a peer is pruned with a negative score, the parameter is augmented by the rate deficit at the time of prune. This is intended to keep history of prunes so that a peer that was pruned because of underdelivery cannot quickly get re-grafted into the mesh. The parameter is mixed with negative weight.
  • P₄: Invalid Messages for a topic. This is the number of invalid messages delivered in the topic. This is intended to penalize peers who transmit invalid messages, according to application-specific validation rules. It is mixed with a negative weight.
  • P₅: Application-Specific score. This is the score component assigned to the peer by the application itself, using application-specific rules. The weight is positive, but the parameter itself has an arbitrary real value, so that the application can signal misbehaviour with a negative score or gate peers before an application-specific handshake is completed.
  • P₆: IP Colocation Factor. This parameter is a threshold for the number of peers using the same IP address. If the number of peers in the same IP exceeds the threshold, then the value is the square of the surplus, otherwise it is 0. This is intended to make it difficult to carry out sybil attacks by using a small number of IPs. The parameter is mixed with a negative weight.
  • P₇: Behavioural Penalty. This parameter captures penalties applied for misbehaviour. The parameter has an associated (decaying) counter, which is explicitly incremented by the router on specific events. The value of the parameter is the square of the counter and is mixed with a negative weight.

The TopicCap function allows the application to specify an optional cap to the contribution to the score across all topics.

Topic Parameter Calculation and Decay

The topic parameters are implemented using counters maintained internally by the router whenever an event of interest occurs. The counters decay periodically so that their values are not continuously increasing and ensure that a large positive or negative score isn't sticky for the lifetime of the peer.

The decay interval is configurable by the application, with shorter intervals resulting in faster decay.

Each decaying parameter can have its own decay factor, which is a configurable parameter that controls how much the parameter will decay during each decay period.

The decay factor is a float in the range of (0.0, 1.0) that will be multiplied with the current parameter value at each decay interval update. For example, suppose the value for P₂ (First Message Deliveries) is 120, with a decay factor FirstMessageDeliveriesDecay = 0.97. At the decay interval, the value will be updated to 120 * 0.97 == 110.4.

The decay factor and interval together determine the absolute rate of decay for each parameter. With a decay interval of 1 second and a decay factor of 0.97, a parameter will decrease by 3% every second, while 0.90 would cause it lose 10%/sec, etc.

P₁: Time in Mesh

In order to compute P₁, the router records the time when the peer is GRAFTed. The time in mesh is calculated lazily during the decay update to avoid a large number of calls to gettimeofday. The parameter value is the division of the time elapsed since the GRAFT with an application configurable quantum.

For example, with a quantum of one second, a peer's P₁ value will be equal to the number of seconds elapsed since they were GRAFTed onto the mesh. With a quantum of 5 minutes, the P₁ value will be the number of 5 minute intervals elapsed since GRAFTing. The P₁ value will be capped to an application configurable maximum.

In pseudo-go:

// topic configuration parameters
var TimeInMeshQuantum time.Duration
var TimeInMeshCap     float64

// lazily updated time in mesh
var meshTime time.Duration

// P₁
p1 := float64(meshTime / TimeInMeshQuantum)
if p1 > TimeInMeshCap {
  p1 = TimeInMeshCap
}
P₂: First Message Deliveries

In order to compute P₂, the router maintains a counter that increments whenever a message is first delivered in the topic by the peer. The parameter has a cap that applies at the time of increment.

In pseudo-go:

// topic configuration parameters
var FirstMessageDeliveriesCap float64

// couner updated every time a first message delivery occurs
var firstMessageDeliveries float64

// counter update
firstMessageDeliveries += 1
if firstMessageDeliveries > FirstMessageDeliveriesCap {
  firstMessageDeliveries = FirstMessageDeliveriesCap
}

// P₂
p2 := firstMessageDeliveries
P₃ and P₃b: Mesh Message Delivery

In order to compute P₃, the router maintains a counter that increments whenever a first or near-first message delivery occurs in the topic by a peer in the mesh. A near-first message delivery is a message delivery that occurs while a message has been first received and is being validated or it has been received within a configurable window of validation of first message delivery. The window is configurable but should be small (in the order of milliseconds) to avoid allowing a mesh peer to build score by simply replaying back the messages received by the current router. The parameter has a cap that applies at the time of increment.

In order to avoid triggering the penalty too early, the parameter has an activation window. This is a configurable value that is the time that the peer must have been in the mesh before the parameter applies.

In pseudo-go:

// topic configuration parameters
var MeshMessageDeliveriesCap, MeshMessageDeliveriesThreshold     float64
var MeshMessageDeliveriesWindow, MeshMessageDeliveriesActivation time.Duration

// time in mesh, lazily updated
var meshTime time.Duration

// counter updated every time a first or near-first message delivery occurs by a mesh peer
var meshMessageDeliveries float64

// counter update
meshMessageDeliveries += 1
if meshMessageDeliveries > MeshMessageDeliveriesCap {
  meshMessageDeliveries = MeshMessageDeliveriesCap
}

// calculation of P₃
var deficit float64
if meshTime > MeshMessageDeliveriesActivation && meshMessageDeliveries < MeshMessageDeliveriesThreshold {
  deficit = MeshMessageDeliveriesThreshold - meshMessageDeliveries
}

p3 := deficit * deficit

In order to calculate P₃b, the router maintains a counter that is updated whenever the peer is pruned with an active deficit in message delivery. The parameter is uncapped.

In pseudo-go:

// counter updated at prune time
var meshFailurePenalty float64

// counter update
if meshTime > MeshMessageDeliveriesActivation && meshMessageDeliveries < MeshMessageDeliveriesThreshold {
  deficit = MeshMessageDeliveriesThreshold - meshMessageDeliveries
  meshFailurePenalty += deficit * deficit
}

// P₃b
p3b := meshFailurePenalty
P₄: Invalid Messages

In order to compute P₄, the router maintains a counter that increments whenever a message fails validation. The value of the parameter is the square of the counter, which is uncapped.

In pseudo-go:

// counter updated every time a message fails validation
var invalidMessageDeliveries float64

// counter update
invalidMessageDeliveries += 1

// P₄
p4 := invalidMessageDeliveries * invalidMessageDeliveries
Parameter Decay

The counters associated with P₂, P₃, P₃b, and P₄ decay periodically by multiplying with a configurable decay factor. When the value drops below a threshold it is considered zero.

In pseudo-go:

// decay factors
var FirstMessageDeliveriesDecay, MeshMessageDeliveriesDecay, MeshFailurePenaltyDecay, InvalidMessageDeliveriesDecay float64

// 0-threshold
var DecayToZero float64

// periodic decay of counters
firstMessageDeliveries *= FirstMessageDeliveriesDecay
if firstMessageDeliveries < DecayToZero {
  firstMessageDeliveries = 0
}

meshMessageDeliveries *= MeshMessageDeliveriesDecay
if meshMessageDeliveries < DecayToZero {
  meshMessageDeliveries = 0
}

meshFailurePenalty *= MeshFailurePenaltyDecay
if meshFailurePenalty < DecayToZero {
  meshFailurePenalty = 0
}

invalidMessageDeliveries *= InvalidMessageDeliveriesDecay
if invalidMessageDeliveries < DecayToZero {
  invalidMessageDeliveries = 0
}

Guidelines for Tuning the Scoring Function

TBD: We are currently developing multiple types of simulations that will inform us on how to best recommend tuning the Scoring function. We will update this section once that work is complete

Extended Validators

The pubsub subsystem incorporates application-specific message validators so that the application can signal invalid message delivery, and trigger the P₄ penalty. However, it is possible to have circumstances where a message should not be delivered to the application or forwarded to the network, but without triggering the P₄ penalty. A known use-case where this need exists is in the case of duplicate beacon messages or while an application is syncing its blockchain, in which case it would be unable to ascertain the validity of new messages.

In order to address this situation, all gossipsub v1.1 implementations must support extended validators with an enumerated decision interface. The outcome of extended validation can be at a minimum one of three things:

  • Accept message; in this case the message is considered valid, and it should be delivered and forwarded to the network.
  • Reject message; in this case the message is considered invalid, and it should be rejected and trigger the P₄ penalty.
  • Ignore message; in this case the message is neither delivered nor forwarded to the network, but the router does not trigger the P₄ penalty.

Overview of New Parameters

The extensions that make up gossipsub v1.1 introduce several new application configurable parameters. This section summarizes all the new parameters along with a brief description.

The following parameters apply globally:

ParameterTypeDescriptionReasonable Default
PruneBackoffDurationTime after pruning a mesh peer before we consider grafting them again.1 minute
UnsubscribeBackoffDurationBackoff to use when unsuscribing from a topic. Should not resubscribe to this topic before it expired.10 seconds
FloodPublishBooleanWhether to enable flood publishingtrue
GossipFactorFloat [0.0, 1.0]% of peers to send gossip to, if we have more than D_lazy available0.25
D_scoreIntegerNumber of peers to retain by score when pruning because of oversubscription4 or 5 for a D of 6.
D_outIntegerNumber of outbound connections to keep in the mesh. Must be less than D_lo and at most D/22 for a D of 6

The remaining parameters apply to Peer Scoring. Because many parameters are interrelated and may be application-specific, reasonable defaults are not shown here. See Guidelines for Tuning the Scoring Function to understand how tune the parameters to the needs of an application.

The following peer scoring parameters apply globally to all peers and topics:

ParameterTypeDescriptionConstraints
GossipThresholdFloatNo gossip emitted to peers below threshold; incoming gossip is ignored.Must be < 0
PublishThresholdFloatNo self-published messages are sent to peers below threshold.Must be <= GossipThreshold
GraylistThresholdFloatAll RPC messages are ignored from peers below threshold.Must be < PublishThreshold
AcceptPXThresholdFloatPX information by peers below this threshold is ignored.Must be >= 0
OpportunisticGraftThresholdFloatIf the median score in the mesh drops below this threshold, then the router may opportunistically graft better scoring peers.Must be >= 0
DecayIntervalDurationInterval at which parameter decay is calculated.
DecayToZeroFloatLimit below which we consider a decayed param to be "zero".Should be close to 0.0
RetainScoreDurationTime to remember peer scores after a peer disconnects.

The remaining peer score parameters affect how scores are computed for each peer based on their observed behavior.

Parameters with type Weight are floats that determine how much a score parameter contributes to the overall score for a peer. See The Score Function for details.

There are some parameters that apply to the peer "as a whole", regardless of which topics they are subscribed to:

ParameterTypeDescriptionConstraints
AppSpecificWeightWeightWeight of P₅, the application-specific score.Must be positive, however score values may be negative.
IPColocationFactorWeightWeightWeight of P₆, the IP colocation score.Must be negative, to penalize peers with multiple IPs.
IPColocationFactorThresholdIntegerNumber of IPs a peer may have before being penalized.Must be at least 1. Values above threshold will be penalized.
BehaviourPenaltyWeightWeightWeight of P₇, the behaviour penalty.Must be negative to penalize peers for misbehaviour.
BehaviourPenaltyDecayFloatDecay factor for P₇.Must be between 0 and 1.

The remaining parameters are applied to a peer's behavior within a single topic. Implementations should be able to accept configurations for multiple topics, keyed by topic ID string. Each topic may be configured with the following params. If a topic is not configured, a peer's behavior in that topic will not contribute to their score. If a peer is in multiple configured topics, each topic will contribute to their total score according to the TopicWeight parameter.

ParameterTypeDescriptionConstraints
TopicWeightWeightHow much does behavior in this topic contribute to the overall score?
P₁Time in Mesh
TimeInMeshWeightWeightWeight of P₁.Should be a small positive value.
TimeInMeshQuantumDurationTime a peer must be in mesh to accrue one "point" for P₁.
TimeInMeshCapFloatMaximum value for P₁.Should be a small positive value.
P₂First Message Deliveries
FirstMessageDeliveriesWeightWeightWeight of P₂.Should be positive, to reward fast peers.
FirstMessageDeliveriesDecayFloatDecay factor for P₂.
FirstMessageDeliveriesCapFloatMaximum value for P₂.
P₃Mesh Message Delivery Rate
MeshMessageDeliveriesWeightWeightWeight of P₃.Should be negative, to penalize peers below threshold.
MeshMessageDeliveriesDecayFloatDecay factor for P₃.
MeshMessageDeliveriesThresholdFloatValue for P₃ below which we start penalizing peers.Should be positive. Value depends on expected message rate for topic.
MeshMessageDeliveriesCapFloatMaximum value for P₃.Must be >= MeshMessageDeliveriesThreshold.
MeshMessageDeliveriesActivationDurationTime a peer must be in the mesh before we start applying P₃ score.
MeshMessageDeliveryWindowDurationTime after first delivery that is considered "near-first".Should be small, e.g. 1-5 ms.
P₃bMesh Message Delivery Failures
MeshFailurePenaltyWeightWeightWeight of P₃b.Should be negative, to penalize failed deliveries.
MeshFailurePenaltyDecayFloatDecay factor for P₃b.
P₄Invalid Messages
InvalidMessageDeliveriesWeightWeightWeight ofP₄.Should be negative, to penalize invalid messages.
InvalidMessageDeliveriesDecayFloatDecay factor for P₄.

Spam Protection Measures

In order counter spam that elicits responses and consumes resources, some measures have been taken:

  • GRAFT messages for unknown topics are ignored; in gossipsub v1.0 the router would always respond with a PRUNE, which opens up an avenue for flooding with spam GRAFT messages and consuming resources.
  • IWANT message responses are limited in the number of retransmissions to a certain peer; in gossipsub v1.0 the router always responds to IWANT messages when the message in the cache. In gossipsub v1.1 the router responds a limited number of times to each peer so that IWANT spam does not cause a signficant drain of resources.
  • IHAVE messages are capped to a certain number of IHAVE messages and aggregate number of message IDs advertised per heartbeat, in order to reduce the exposure to floods. If more IHAVE advertisements are received than the limit (or more messages are advertised than the limit), then additional IHAVE messages are ignored.
  • In flight IWANT requests, sent as a response to an IHAVE advertisement, are probabilistically tracked. For each IHAVE advertisement which elicits an IWANT request, the router tracks a random message ID within the advertised set. If the message is not received (from any peer) within a period of time, then a behavioural penalty is applied to the advertising peer through P₇. This measure helps protect against spam IHAVE floods by quickly flagging and graylisting peers who advertise bogus message IDs and/or do not follow up to the IWANT requests.
  • Invalid message spam, either directly transmitted or as a response to an IHAVE message is penalized by the score function. A peer transmitting lots of spam will quickly get graylisted, reducing the surface of spam-induced computation (eg validation). The application can take further steps and blacklist the peer if the spam persists after the negative score decays.

Recommendations for Network Operators

An important issue to consider when deploying gossipsub is the peer discovery mechanism, which must provide a secure way of discovering new peers. Prior to gossipsub v1.1, operators were required to utilize an external peer discovery mechanism to locate peers participating in particular topics; with gossipsub v1.1 this is now entirely optional and the network can bootstrap purely through a small set of network entry points (bootstrappers) by utilizing Peer Exchange. In other words, gossipsub 1.1 is now self-sufficient in this regard, as long as the node manages to find at least one peer participating in the topic of interest.

In order to successfully bootstrap the network without a discovery service, network operators should

  • Create and operate a set of stable bootstrapper nodes, whose addresses are known ahead of time by the application.
  • The bootstrappers should be configured without a mesh (ie set D=D_lo=D_hi=D_out=0) and with Peer Exchange enabled, utilizing Signed Peer Records.
  • The application should assign a high application-specific score to the bootstrappers and set AcceptPXThreshold to a high enough value attainable only by the bootstrappers.

In this manner, the bootstrappers act purely as gossip and peer exchange nodes that facilitate the formation and maintenance of the network. Note that the score function is still present in the bootstrappers, which ensures that invalid messages, colocation, and behavioural penalties apply to misbehaving nodes such that they do not receive PX or are advertised to the rest of the network. In addition, network operators may configure the application-specific scoring function such that the bootstrappers enforce further constraints into accepting new nodes (eg protocol handshakes, staked participation, and so on).

It should be emphasized that the security of the peer discovery service affects the ability of the system to bootstrap securely and recover from large-scale attacks. Network operators must take care to ensure that whichever peer discovery mechanism they opt to utilize is resilient to attacks and can always return some honest peers so that connections between honest peers can be established. Furthermore, it is strongly recommended that any external discovery service is augmented by bootstrappers/directory nodes configured with Peer Exchange and high application-specific scores, as outlined above.

gossipsub v1.2: TODO

Lifecycle StageMaturityStatusLatest Revision
1AWorking DraftActiver1, 2023-07-14

Authors: @Nashatyrev, @Menduist

Interest Group: @vyzo, @Nashatyrev, @Menduist

See the lifecycle document for context about maturity level and spec status.

Overview

This document aims to provide a minimal extension to the gossipsub v1.1 protocol.

The proposed extensions are backwards-compatible and aim to enhance the efficiency (minimize amplification/duplicates and decrease message latency) of the gossip mesh networks for larger messages.

In more specific terms, a new control message is introduced: IDONTWANT. It's primarily intended to notify mesh peers that the node already received a message and there is no need to send its duplicate.

Specification

Protocol Id

Nodes that support this Gossipsub extension should additionally advertise the version number 1.2.0. Gossipsub nodes can advertise their own protocol-id prefix, by default this is meshsub giving the default protocol id:

  • /meshsub/1.2.0

Parameters

This section lists the configuration parameters that needs to agreed on across clients to avoid peer penalizations

ParameterDescriptionReasonable Default
max_idontwant_messagesThe maximum number of IDONTWANT messages per heartbeat per peer???

IDONTWANT Message

Basic scenario

When the peer receives the first message instance it immediately broadcasts (not queue for later piggybacking) IDONTWANT with the messageId to all its mesh peers. This could be performed prior to the message validation to further increase the effectiveness of the approach.

On the other side a node maintains per-peer dont_send_message_ids set. Upon receiving IDONTWANT from a peer the messageId is added to the dont_send_message_ids set. When later relaying the messageId message to the mesh the peers found in dont_send_message_ids MUST be skipped.

Old entries from dont_send_message_ids SHOULD be pruned during heartbeat processing. The prune strategy is outside of the spec scope and can be decided by implementations.

IDONTWANT message is supposed to be optional for both receiver and sender. I.e. the sender MAY NOT utilize this message. The receiver in turn MAY ignore IDONTWANT: sending a message after the corresponding IDONTWANT should not be penalized.

The IDONTWANT may have negative effect on small messages as it may increase the overall traffic and CPU load. Thus it is better to utilize IDONTWANT for messages of a larger size. The exact policy of IDONTWANT appliance is outside of the spec scope. Every implementation MAY choose whatever is more appropriate for it. Possible options are either choose a message size threshold and broadcast IDONTWANT on per message basis when the size is exceeded or just use IDONTWANT for all messages on selected topics.

To prevent DoS the number of IDONTWANT control messages is limited to max_idontwant_messages per heartbeat

Cancelling IWANT

If a node requested a message via IWANT and then occasionally receives the message from other peer it MAY try to cancel its IWANT requests with the corresponding IDONTWANT message. It may work in cases when a peer delays/queues IWANT requests and the IWANT request SHOULD be removed from the queue if not processed yet

Protobuf Extension

The protobuf messages are identical to those specified in the gossipsub v1.0.0 specification with the following control message modifications:

message RPC {
 // ... see definition in the gossipsub specification
}

message ControlMessage {
    // messages from v1.0
    repeated ControlIDontWant idontwant = 5;
}

message ControlIDontWant {
    repeated bytes messageIDs = 1;
}

gossipsub v1.3: Extensions Control Message

Lifecycle StageMaturityStatusLatest Revision
3ACandidate RecommendationActiver0, 2025-06-23

Authors: @marcopolo

Interest Group: @cortze, @cskiraly, @ppopth, @jxs, @raulk, @divagant-martian

See the lifecycle document for context about the maturity level and spec status.

Overview

This version specifies a way to for gossipsub peers to describe their characteristics to each other without requiring a new protocol ID per extension.

The extensions.proto file registry MUST be updated upon introducing a new extension, either canonical or experimental, to the network.

Motivation

This version makes Gossipsub easier to extend by allowing applications to selectively make use of the extensions it would benefit from. It removes the need to make Gossipsub extensions follow a strict ordering. Finally, it allows extensions to iterate independently from Gossipsub's versioning.

The Extensions Control Message

If a peer supports any extension, the Extensions control message MUST be included in the first message on the stream. An Extensions control message MUST NOT be sent more than once. If a peer supports no extensions, it may omit sending the Extensions control message.

Extensions are not negotiated; they describe characteristics of the sending peer that can be used by the receiving peer. However, a negotiation can be implied: each peer uses the Extensions control message to advertise a set of supported values. The specification of an extension describes how each peer combines the two sets to define its behavior.

Peers MUST ignore unknown extensions.

Extensions that modify or replace core protocol functionality will be difficult to combine with other extensions that modify or replace the same functionality unless the behavior of the combination is explicitly defined. Such extensions SHOULD define their interaction with previously defined extensions modifying the same protocol components.

Protocol ID

The Gossipsub version for this specification is v1.3.0. The protocol ID is /meshsub/1.3.0.

Process to add a new extensions to this spec

Canonical Extensions

A Canonical Extension is an extension that is well defined, has multiple implementations, has shown to be useful in real networks, and has rough consensus on becoming a canonical extension. The extension specification MUST be defined in the libp2p/specs GitHub repo. After an extension meets the stated criteria, extensions.proto MUST be updated to include the extension in the ControlExtensions protobuf with a link to the extension's specification doc in a comment. The extension SHOULD use the next lowest available field number.

Any new messages defined by the extension MUST be added to RPC message definition in the extensions.proto protobuf. Extensions SHOULD minimize the number of new messages they introduce here. Try to introduce a single new message, and use that message as a container for more messages similar to the strategy used by the ControlMessage in the RPC.

All extension messages MUST be an optional field.

Experimental Extensions

In contrast with a Canonical Extension, an Experimental Extension is still being evaluated and iterated upon. Adding an experimental extension to the extensions.proto lets others see what is being tried, and ensure there are no misinterpretations of messages on the wire. A patch to this extensions.proto is not needed if experimenting with an extension in a controlled environment. A patch to extensions.proto is also not needed if you are not using the /meshsub/v1.3.0 protocol ID.

If the extension is being tested on a live network, a PR MUST be created that adds the extension to the ControlExtensions protobuf with a link to the extension's specification. Experimental extensions MUST use a large field number randomly generated to be at least 4 bytes long when varint encoded. The extension author MUST ensure this field number does not conflict with any existing field.

New messages defined by this extension should follow the same guidelines as new messages for canonical extensions. Field numbers MUST be randomly generated and be at least 4 bytes long when varint encoded.

Maintainers MUST check that the extension is well specified, in the experimental range, and that the extension will be tested on a live network. If so, maintainers SHOULD merge the change.

Protobuf

The extensions.proto file can be found at (./extensions/extensions.proto)[./extensions/extensions.proto].

Implementations MUST use the protobuf messages defined in the extensions.proto file.

Implementation status of Gossipsub versions and Extensions

This doc is meant to provide an overview of the implementation status of Gossipsub versions and Extensions.

Gossipsub Versions

1.21.3-alpha
Go libp2pOpen PR
Rust libp2pIn Progress
JS libp2pNot started
Nim libp2pNot started
Java libp2pNot started

Gossipsub Extensions

Choke ExtensionsPartial Messages
Go libp2pNot ImplementedPR
Rust libp2pNot ImplementedNot Implemented
JS libp2pNot ImplementedNot Implemented
Nim libp2pNot ImplementedNot Implemented
Java libp2pNot ImplementedNot Implemented

Gossipsub Implementation Improvements

Batch PublishingIDONTWANT on First PublishWFR Gossip
Go libp2pPR
Rust libp2pNot ImplementedNot Implemented
JS libp2pNot ImplementedNot ImplementedNot Implemented
Nim libp2pNot ImplementedNot ImplementedNot Implemented
Java libp2pNot ImplementedNot ImplementedNot Implemented

gossipsub v1.1: Functional Extension for Validation Queue Protection

Lifecycle StageMaturityStatusLatest Revision
1AWorking DraftActiver1, 2020-09-05

Authors: @vyzo

Interest Group: @yusefnapora, @raulk, @whyrusleeping, @Stebalien, @daviddias, @protolambda, @djrtwo, @dryajov, @mpetrunic, @AgeManning, @Nashatyrev, @mhchia

See the lifecycle document for context about maturity level and spec status.


Overview

This document specifies an extension to gossipsub v1.1 intended to provide a circuit breaker so that routers can withstand concerted attacks targetting the validation queue with a flood of spam. This extension does not modify the protocol in any way and works in conjuction with the defensive mechanisms of gossipsub v1.1.

Validation Queue Protection

An important aspect of gossipsub is the reliance on validators to signal acceptance of incoming messages from the application to the router. The validation is asynchronous, with a typical implementation strategy that uses of a front-end queue and a limit to the number of ongoing validations. This creates a potential target for attacks, as an attacker can overload the queue by brute force, sending spam messages at a very high rate. The effect would be that legitimate messages get dropped by the validation front end, resulting in denial of service.

In order to protect the system from this class of attacks, gossipsub v1.1 incorporates a circuit breaker that sits before the validation queue and can make informed decisions on whether to push a message into the validation queue. This defensive mechanism kicks in when the system detects an elevated rate of dropped messages, and makes decisions on whether to accept incoming messages for validation based on the statistical performance of peers in the origin IP address. The decision is probabilistic and implements a Random Early Drop (RED) strategy that drops messages with a probability that depends on the acceptance rates for messages from the origin IP. This strategy can neuter attacks on the validation queue, because messages are no longer dropped indiscriminately in a drop-tail fashion.

Random Early Drop Algorithm

The algorithm has two aspects:

  • The decision on whether to trigger RED.
  • The decision on whether to drop a message from an origin IP address.

In order to trigger RED, the circuit breaker maintains the following queue statistics:

  • a decaying counter for the number of message validations.
  • a decaying counter for the number of dropped messages.

The decision on triggering RED is based on comparing the ratio of dropped messages to validations. If the ratio exceeds an application configured threshold, then the RED algorithm triggers and a decision on whether to accept the message for validation is made based on origin IP statistics. There is also a quiet period, such that if no messages have been dropped for a while, the circuit breaker turns back off.

In order to make the actual RED decision, the circuit breaker maintains the following statistics per IP:

  • a decaying counter for the number of accepted messages.
  • a decaying counter for the number of duplicate messages, mixed with a weight W_duplicate.
  • a decaying counter for the number of ignored messages, mixed with a weight W_ignored.
  • a decaying counter for the number of rejected messages, mixed with a weight W_rejected.

The router generates a random float r and accepts the message if and only if

r < (1 + accepted) / (1 + accepted + W_duplicate * duplicate + W_ignored * ignored + W_rejected * rejected)

The number of accepted messages is biased by 1 so that a single negative event cannot sinkhole an IP. It also always gives a chance for a message to be accepted, albeit with sharply decreasing probability as negative events accumulate.

All the counters decay linearly with an application configured decay factor, so that the sytem adapts to varying network conditions.

Also note that per IP statistics are retained for a configured period of time after disconnection, so that an attacker cannot easily clear traces of misbehaviour by disconnecting.

Finally, the circuit breaker should allow the application to configure per topic accepted delivery weights, so that deliveries in priority topics can be given more weight. If a topic is not configured, then its delivery weight is 1.

RED Parameters

The circuit breaker utilizes the following application configured parameters:

ParameterPurposeDefault
ActivationThresholddropped to validated message ratio threshold for triggering the circuit breaker0.33
GlobalDecayCoefficientlinear decay coefficient for global statscomputed such that the counter decays to 1% after 2 minutes
SourceDecayCoefficientlinear decay coefficient for per IP statscomputed such that the counter decays to 1% after 1 hour
QuietIntervalinterval of no dropped message events before turning off the circuit breaker1 minute
W_duplicatecounter mixin weight for duplicate messages0.125
W_ignorecounter mixin weight for ignored messages1.0
W_rejectcoutner mixin weight for rejected messages16.0
RetentionPeriodduration of stats retention after disconnection6 hours

With the default parameters, we are rapidly penalising rejections, mildly penalising ignored messages, and softly weighting duplicate messages because they occur normally for mesh peers. The result is that clearly misbehaving peers whose messages lead to outright rejections, will make up for a substantial part of the decision to break the circuit, while underperforming peers will also factor in, but with less force.

p2p-circuit relay

Circuit Switching for libp2p, also known as TURN or Relay in Networking literature.

Specifications:

Direct Connection Upgrade through Relay

Lifecycle StageMaturityStatusLatest Revision
3ARecommendationActiver1, 2021-11-20

Authors: @vyzo

Interest Group: @raulk, @stebalien, @whyrusleeping, @mxinden, @marten-seemann

See the lifecycle document for context about maturity level and spec status.

Table of Contents

Introduction

NAT traversal is a quintessential problem in peer-to-peer networks.

We currently utilize relays, which allow us to traverse NATs by using a third party as proxy. Relays are a reliable fallback, that can connect peers behind NAT albeit with a high-latency, low-bandwidth connection. Unfortunately, they are expensive to scale and maintain if they have to carry all the NATed node traffic in the network.

It is often possible for two peers behind NAT to communicate directly by utilizing a technique called hole punching[1]. The technique relies on the two peers synchronizing and simultaneously opening connections to each other to their predicted external address. It works well for UDP, and reasonably well for TCP.

The problem in hole punching, apart from not working all the time, is the need for rendezvous and synchronization. This is usually accomplished using dedicated signaling servers [2]. However, this introduces yet another piece of infrastructure, while still requiring the use of relays as a fallback for the cases where a direct connection is not possible.

In this specification, we describe a synchronization protocol for direct connectivity with hole punching that eschews signaling servers and utilizes existing relay connections instead. That is, peers start with a relay connection and synchronize directly, without the use of a signaling server. If the hole punching attempt is successful, the peers upgrade their connection to a direct connection and they can close the relay connection. If the hole punching attempt fails, they can keep using the relay connection as they were.

The Protocol

Consider two peers, A and B. A wants to connect to B, which is behind a NAT and advertises relay addresses. A may itself be behind a NAT or be a public node.

The protocol starts with the completion of a relay connection from A to B. Upon observing the new connection, the inbound peer (here B) checks the addresses advertised by A via identify. If that set includes public addresses, then A may be reachable by a direct connection, in which case B attempts a unilateral connection upgrade by initiating a direct connection to A.

If the unilateral connection upgrade attempt fails or if A is itself a NATed peer that doesn't advertise public address, then B initiates the direct connection upgrade protocol as follows:

  1. B opens a stream to A using the /libp2p/dcutr protocol.

  2. B sends to A a Connect message containing its observed (and possibly predicted) addresses from identify and starts a timer to measure RTT of the relay connection.

  3. Upon receving the Connect, A responds back with a Connect message containing its observed (and possibly predicted) addresses.

  4. Upon receiving the Connect, B sends a Sync message and starts a timer for half the RTT measured from the time between sending the initial Connect and receiving the response. The purpose of the Sync message and B's timer is to allow the two peers to synchronize so that they perform a simultaneous open that allows hole punching to succeed.

  5. Simultaneous Connect. The two nodes follow the steps below in parallel for every address obtained from the Connect message:

    • For a TCP address:
      • Upon receiving the Sync, A immediately dials the address to B.
      • Upon expiry of the timer, B dials the address to A.
      • This will result in a TCP Simultaneous Connect. For the purpose of all protocols run on top of this TCP connection, A is assumed to be the client and B the server.
    • For a QUIC address:
      • Upon receiving the Sync, A immediately dials the address to B.
      • Upon expiry of the timer, B starts to send UDP packets filled with random bytes to A's address. Packets should be sent repeatedly in random intervals between 10 and 200 ms.
      • This will result in a QUIC connection where A is the client and B is the server.
  6. Once a single connection has been established, A SHOULD cancel all outstanding connection attempts. The peers should migrate to the established connection by prioritizing over the existing relay connection. All new streams should be opened in the direct connection, while the relay connection should be closed after a grace period. Existing long-lived streams will have to be recreated in the new connection once the relay connection is closed.

    On failure of all connection attempts go back to step (1). Inbound peers (here B) SHOULD retry twice (thus a total of 3 attempts) before considering the upgrade as failed.

RPC messages

All RPC messages sent over a stream are prefixed with the message length in bytes, encoded as an unsigned variable length integer as defined by the multiformats unsigned-varint spec.

Implementations SHOULD refuse encoded RPC messages (length prefix excluded) larger than 4 KiB.

RPC messages conform to the following protobuf schema:

syntax = "proto2";

package holepunch.pb;

message HolePunch {
  enum Type {
    CONNECT = 100;
    SYNC = 300;
  }

  required Type type=1;

  repeated bytes ObsAddrs = 2;
}

ObsAddrs is a list of multiaddrs encoded in the binary multiaddr representation. See Addressing specification for details.

FAQ

  • Why exchange CONNECT and SYNC messages once more on each retry?

    Doing an additional CONNECT and SYNC for each retry prevents a flawed RTT measurement on the first attempt to distort all following retry attempts.

References

  1. Peer-to-Peer Communication Across Network Address Translators. B. Ford and P. Srisuresh. https://pdos.csail.mit.edu/papers/p2pnat.pdf
  2. Interactive Connectivity Establishment (ICE): A Protocol for Network Address Translator (NAT) Traversal for Offer/Answer Protocols. IETF RFC 5245. https://tools.ietf.org/html/rfc5245

Circuit Relay v0.1.0

Lifecycle StageMaturityStatusLatest Revision
3ARecommendationActiver1, 2018-06-03

Authors: @daviddias

Interest Group: @lgierth, @hsanjuan, @jamesray1, @vyzo, @yusefnapora

See the lifecycle document for context about the maturity level and spec status.

Implementations

Table of Contents

Overview

The circuit relay is a means to establish connectivity between libp2p nodes (e.g. IPFS nodes) that wouldn't otherwise be able to establish a direct connection to each other.

Relay is needed in situations where nodes are behind NAT, reverse proxies, firewalls and/or simply don't support the same transports (e.g. go-ipfs vs. browser-ipfs). Even though libp2p has modules for NAT port mapping (go-libp2p-nat), this isn't always an option, nor does it always work (e.g. non-residential routers, hotspots, etc.). The circuit relay protocol exists to overcome those scenarios.

Unlike a transparent tunnel, where a libp2p peer would just proxy a communication stream to a destination (the destination being unaware of the original source), a circuit relay makes the destination aware of the original source and the circuit followed to establish communication between the two. This provides the destination side with full knowledge of the circuit which, if needed, could be rebuilt in the opposite direction. As a word of caution, dialing a peer back on its source addr:port usually won't work. However, most libp2p implementations (e.g. go-libp2p) enable SO_REUSEPORT and SO_REUSEADDR, and use the listening address as the local address when dialing, to facilitate this connection reversability.

Apart from that, this relayed connection behaves just like a regular connection would, but over an existing fully formed libp2p stream with another peer (instead of e.g. a raw TCP connection). Think of this as a "virtualized connection". This enables further resource efficiency and maximizes the utility of the underlying connection, as once a NAT'ted peer A has established a connection to a relay R, many peers (B1...Bn) can establish relayed connections to A over that single physical connection. The relay node acts like a circuit switcher over streams between the two nodes, enabling them to reach each other.

Relayed connections are end-to-end encrypted just like regular connections.

The circuit relay consists of both a (tunneled) libp2p transport and a libp2p protocol, mounted on the host. The libp2p transport is the means of establishing and accepting connections, and the libp2p protocol is the means to relaying connections.

+-----+    /ip4/.../tcp/.../ws/p2p/QmRelay    +-------+    /ip4/.../tcp/.../p2p/QmTwo       +-----+
|QmOne| <------------------------------------>|QmRelay|<----------------------------------->|QmTwo|
+-----+   (/libp2p/relay/circuit multistream) +-------+ (/libp2p/relay/circuit multistream) +-----+
     ^                                         +-----+                                         ^
     |           /p2p-circuit/QmTwo            |     |                                         |
     +-----------------------------------------+     +-----------------------------------------+

Notes for the reader:

  • In this document, we use /p2p/Qm... multiaddrs. libp2p previously used /ipfs/Qm... for multiaddrs, and you'll likely see uses of this notation in the wild. /ipfs and /p2p multiaddrs are equivalent but /ipfs is deprecated /p2p should be preferred.
  • You may also see /ipfs/Qm... used for content-addressed pathing in IPFS. These are not multiaddrs and this confusion is one of the many motivations for switching to /p2p/Qm... multiaddrs.

Dramatization

Cast:

  • QmOne, the dialing node (browser).
  • QmTwo, the listening node (go-ipfs).
  • QmRelay, a node which speaks the circuit relay protocol (go-ipfs or js-ipfs).

Scene 1:

  • QmOne wants to connect to QmTwo, and through peer routing has acquired a set of addresses of QmTwo.
  • QmTwo doesn't support any of the transports used by QmOne.
  • Awkward silence.

Scene 2:

  • All three nodes have learned to speak the /ipfs/relay/circuit protocol.
  • QmRelay is configured to allow relaying connections between other nodes.
  • QmOne is configured to use QmRelay for relaying.
  • QmOne automatically added /p2p-circuit/p2p/QmTwo to its set of QmTwo addresses.
  • QmOne tries to connect via relaying, because it shares this transport with QmTwo.
  • A lively and prolonged dialogue ensues.

Addressing

/p2p-circuit multiaddrs don't carry any meaning of their own. They need to encapsulate a /p2p address, or be encapsulated in a /p2p address, or both.

As with all other multiaddrs, encapsulation of different protocols determines which metaphorical tubes to connect to each other.

A /p2p-circuit circuit address, is formatted as following:

[<relay peer multiaddr>]/p2p-circuit/<destination peer multiaddr>

Examples:

  • /p2p-circuit/p2p/QmVT6GYwjeeAF5TR485Yc58S3xRF5EFsZ5YAF4VcP3URHt - Arbitrary relay node that can relay to QmVT6GYwjeeAF5TR485Yc58S3xRF5EFsZ5YAF4VcP3URHt (target)
  • /ip4/192.0.2.0/tcp/5002/p2p/QmdPU7PfRyKehdrP5A3WqmjyD6bhVpU1mLGKppa2FjGDjZ/p2p-circuit/p2p/QmVT6GYwjeeAF5TR485Yc58S3xRF5EFsZ5YAF4VcP3URHt - Specific relay node to relay to QmVT6GYwjeeAF5TR485Yc58S3xRF5EFsZ5YAF4VcP3URHt (target)

This opens the room for multiple hop relay, where the second relay is encapsulated in the first relay multiaddr, such that one relay relays to the next relay, in a daisy-chain fashion. Example:

<1st relay>/p2p-circuit/<2nd relay>/p2p-circuit/<dst multiaddr>

A few examples:

Using any relay available:

  • /p2p-circuit/p2p/QmTwo
    • Dial QmTwo, through any available relay node (or find one node that can relay).
    • The relay node will use peer routing to find an address for QmTwo if it doesn't have a direct connection.
  • /p2p-circuit/ip4/../tcp/../p2p/QmTwo
    • Dial QmTwo, through any available relay node, but force the relay node to use the encapsulated /ip4 multiaddr for connecting to QmTwo.

Specify a relay:

  • /p2p/QmRelay/p2p-circuit/p2p/QmTwo
    • Dial QmTwo, through QmRelay.
    • Use peer routing to find an address for QmRelay.
    • The relay node will also use peer routing, to find an address for QmTwo.
  • /ip4/../tcp/../p2p/QmRelay/p2p-circuit/p2p/QmTwo
    • Dial QmTwo, through QmRelay.
    • Includes info for connecting to QmRelay.
    • The relay node will use peer routing to find an address for QmTwo, if not already connected.

Double relay:

  • /p2p-circuit/p2p/QmTwo/p2p-circuit/p2p/QmThree
    • Dial QmThree, through a relayed connection to QmTwo.
    • The relay nodes will use peer routing to find an address for QmTwo and QmThree.
    • go-libp2p (reference implementation) does not support nested relayed connections for now, see Future Work section.

Wire format

We start the description of the Wire format by illustrating a possible flow scenario and then describing them in detail by phases.

Relay Message

Every message in the relay protocol uses the following protobuf:

syntax = "proto2";

message CircuitRelay {

  enum Status {
    SUCCESS                    = 100;
    HOP_SRC_ADDR_TOO_LONG      = 220;
    HOP_DST_ADDR_TOO_LONG      = 221;
    HOP_SRC_MULTIADDR_INVALID  = 250;
    HOP_DST_MULTIADDR_INVALID  = 251;
    HOP_NO_CONN_TO_DST         = 260;
    HOP_CANT_DIAL_DST          = 261;
    HOP_CANT_OPEN_DST_STREAM   = 262;
    HOP_CANT_SPEAK_RELAY       = 270;
    HOP_CANT_RELAY_TO_SELF     = 280;
    HOP_BACKOFF                = 290;
    STOP_SRC_ADDR_TOO_LONG     = 320;
    STOP_DST_ADDR_TOO_LONG     = 321;
    STOP_SRC_MULTIADDR_INVALID = 350;
    STOP_DST_MULTIADDR_INVALID = 351;
    STOP_RELAY_REFUSED         = 390;
    MALFORMED_MESSAGE          = 400;
  }

  enum Type { // RPC identifier, either HOP, STOP or STATUS
    HOP = 1;
    STOP = 2;
    STATUS = 3;
    CAN_HOP = 4; // is peer a relay?
  }

  message Peer {
    required bytes id = 1;    // peer id
    repeated bytes addrs = 2; // peer's known addresses
  }

  optional Type type = 1;     // Type of the message

  optional Peer srcPeer = 2;  // srcPeer and dstPeer are used when Type is HOP or STOP
  optional Peer dstPeer = 3;

  optional Status code = 4;   // Status code, used when Type is STATUS
}

High level overview of establishing a relayed connection

Setup:

  • Peers involved, A, B, R
  • A wants to connect to B, but needs to relay through R

Assumptions:

  • A has connection to R, R has connection to B

Events:

  • phase I: Open a request for a relayed stream (A to R).
    • A dials a new stream sAR to R using protocol /libp2p/circuit/relay/0.1.0.
    • A sends a CircuitRelay message with { type: 'HOP', srcPeer: '/p2p/QmA', dstPeer: '/p2p/QmB' } to R through sAR.
    • R receives stream sAR and reads the message from it.
  • phase II: Open a stream to be relayed (R to B).
    • R opens a new stream sRB to B using protocol /libp2p/circuit/relay/0.1.0.
    • R sends a CircuitRelay message with { type: 'STOP', srcPeer: '/p2p/QmA', dstPeer: '/p2p/QmB' } on sRB.
    • R sends a CircuitRelay message with { type: 'STATUS', code: 'SUCCESS' } on sAR.
  • phase III: Streams are piped together, establishing a circuit
    • B receives stream sRB and reads the message from it
    • B sends a CircuitRelay message with { type: 'STATUS', code: 'SUCCESS' } on sRB.
    • B passes stream to NewConnHandler to be handled like any other new incoming connection.

Under the microscope

  • We've defined a max length for the multiaddrs of arbitrarily 1024 bytes
  • Multiaddrs are transfered on its binary packed format
  • Peer Ids are transfered on its non base encoded format (aka byte array containing the multihash of the Public Key).

Status codes table

This is a table of status codes and sample messages that may occur during a relay setup. Codes in the 200 range are returned by the relay node. Codes in the 300 range are returned by the destination node.

CodeMessageMeaning
100"success"Relay was setup correctly
220"src address too long"
221"dst address too long"
250"failed to parse src addr: no such protocol ipfs"The <src> multiaddr in the header was invalid
251"failed to parse dst addr: no such protocol ipfs"The <dst> multiaddr in the header was invalid
260"passive relay has no connection to dst"
261"active relay couldn't dial to dst: conn refused"relay could not form new connection to target peer
262"couldn't' dial to dst"relay has conn to dst, but failed to open a stream
270"dst does not support relay"
280"can't relay to itself"The relay got its own address as destination
290"temporary backoff"The relay wants us to backoff and try again later
320"src address too long"
321"dst address too long"
350"failed to parse src addr"src multiaddr in the header was invalid
351"failed to parse dst addr"dst multiaddr in the header was invalid
390"connection refused by stop endpoint"The stop endpoint couldn't accept the connection
400"malformed message"A malformed or too long message was received

Implementation details

Interfaces

These are go-ipfs specific

As explained above, the relay is both a transport (tpt.Transport) and a mounted stream protocol (p2pnet.StreamHandler). In addition it provides a means of specifying relay nodes to listen/dial through.

Note: the usage of p2pnet.StreamHandler is a little bit off herein, but it gets the point across.

import (
  tpt "github.com/libp2p/go-libp2p-transport"
  p2phost "github.com/libp2p/go-libp2p-host"
  p2pnet "github.com/libp2p/go-libp2p-net"
  p2proto "github.com/libp2p/go-libp2p-protocol"
)

const ID p2proto.ID = "/libp2p/circuit/relay/0.1.0"

type CircuitRelay interface {
  tpt.Transport
  p2pnet.StreamHandler

  EnableRelaying(enabled bool)
}

fund NewCircuitRelay(h p2phost.Host)

Removing existing relay protocol in Go

Note that there is an existing swarm protocol colloquially called relay. It lives in the go-libp2p package and is named /ipfs/relay/line/0.1.0.

  • Introduced in ipfs/go-ipfs#478 (28-Dec-2014).
  • No changes except for ipfs/go-ipfs@de50b2156299829c000b8d2df493b4c46e3f24e9.
    • Changed to use multistream muxer.
  • Shortcomings
    • No end-to-end encryption.
    • No rate limiting (DoS by resource exhaustion).
    • Doesn't verify src id in ReadHeader(), easy to fix.
  • Capable of accepting connections, and relaying connections.
  • Not capable of connecting via relaying.

Since the existing protocol is incomplete, insecure, and certainly not used, we can safely remove it.

Future work

We have considered more features but won't be adding them on the first iteration of Circuit Relay, the features are:

  • Multihop relay - With this specification, we are only enabling single hop relays to exist. Multihop relay will come at a later stage as Packet Switching.
  • Relay discovery mechanism - At the moment we're not including a mechanism for discovering relay nodes. For the time being, they should be configured statically.

Circuit Relay v2

This is the version 2 of the libp2p Circuit Relay protocol.

Lifecycle StageMaturityStatusLatest Revision
3ARecommendationActiver3, 2023-02-28

Authors: @vyzo

Interest Group: @mxinden, @stebalien, @raulk

See the lifecycle document for context about the maturity level and spec status.

Table of Contents

Introduction

This is the specification of v2 of the p2p-circuit relay protocol.

Compared to the first version of the protocol, there are some significant departures:

  • The protocol has been split into two subprotocols, hop and stop
    • The hop protocol is client-initiated, and is used when clients send commands to relays; it is used for reserving resources in the relay and opening a switched connection to a peer through the relay.
    • The stop protocol governs the endpoints of circuit switched connections.
  • The concept of resource reservation has been introduced, whereby peers wishing to use a relay explicitly reserve resources and obtain reservation vouchers which can be distributed to their peers for routing purposes.
  • The concept of limited relaying has been introduced, whereby relays provide switched connectivity with a limited duration and data cap.

Rationale

The evolution of the protocol towards v2 has been influenced by our experience in operating open relays in the wild. The original protocol, while very flexible, has some limitations when it comes to the practicalities of relaying connections.

The main problem is that v1 has no mechanism to reserve resources in the relay, which leads to continuous over-subscription of relays and the necessity of (often ineffective) heuristics for balancing resources. In practice, running a relay proved to be an expensive proposition requiring dedicated hosts with significant hardware and bandwidth costs. In addition, there is ongoing work in Hole Punching coordination for direct connection upgrade through relays, which doesn't require an unlimited relay connection.

In order to address the situation and seamlessly support pervasive hole punching, we have introduced limited relays and slot reservations. This allows relays to effectively manage their resources and provide service at a small scale, thus enabling the deployment of an army of relays for extreme horizontal scaling without excessive bandwidth costs and dedicated hosts.

Furthermore, the original decision to conflate circuit initiation and termination in the same protocol has made it very hard to provide relay service on demand, decoupled with whether client functionality is supported by the host.

In order to address this problem, we have split the protocol into the hop and stop subprotocols. This allows us to always enable the client-side functionality in a host, while providing the option to later mount the relay service in public hosts, after the reachability of the host has been determined through AutoNAT.

The Protocol

Interaction

The following diagram illustrates the interaction between three peers, A, B, and R, in the course of establishing a relayed connection. Peer A is a private peer, which is not publicly reachable; it utilizes the services of peer R as the relay. Peer B is another peer who wishes to connect to peer A through R.

Circuit v2 Protocol Interaction

Instructions to reproduce diagram

Use https://plantuml.com/ and the specification below to reproduce the diagram.

@startuml
participant A
participant R
participant B

skinparam sequenceMessageAlign center

== Reservation ==

A -> R: [hop] RESERVE
R -> A: [hop] STATUS:OK

hnote over A: Reservation timeout approaching.
hnote over A: Refresh.

A -> R: [hop] RESERVE
R -> A: [hop] STATUS:OK

hnote over A: ...

== Circuit Establishment ==

B -> R: [hop] CONNECT to A
R -> A: [stop] CONNECT from B
A -> R: [stop] STATUS:OK
R -> B: [hop] STATUS:OK

B <-> A: Connection
@enduml

The first part of the interaction is A's reservation of a relay slot in R. This is accomplished by opening a connection to R and sending a RESERVE message in the hop protocol; if the reservation is successful, the relay responds with a STATUS:OK message and provides A with a reservation voucher. A keeps the connection to R alive for the duration of the reservation, refreshing the reservation as needed.

The second part of the interaction is the establishment of a circuit switch connection from B to A through R. It is assumed that B has obtained a circuit multiaddr for A of the form /p2p/QmR/p2p-circuit/p2p/QmA out of band using some peer discovery service (eg. the DHT or a rendezvous point).

In order to connect to A, B then connects to R, opens a hop protocol stream and sends a CONNECT message to the relay. The relay verifies that it has a reservation and connection for A and opens a stop protocol stream to A, sending a CONNECT message.

Peer A then responds to the relay with a STATUS:OK message, which responds to B with a STATUS:OK message in the open hop stream and then proceeds to bridge the two streams into a relayed connection. The relayed connection flows in the hop stream between the connection initiator and the relay and in the stop stream between the relay and the connection termination point.

B and A upgrade the relayed connection with a security protocol and a multiplexer, just like they would e.g. upgrade a TCP connection.

Hop Protocol

The Hop protocol governs interaction between clients and the relay; it uses the protocol ID /libp2p/circuit/relay/0.2.0/hop.

There are two parts of the protocol:

  • reservation, by peers that wish to receive relay service
  • connection initiation, by peers that wish to connect to a peer through the relay.

Reservation

In order to make a reservation, a peer opens a connection to the relay and sends a HopMessage with type = RESERVE:

HopMessage {
  type = RESERVE
}

The relay responds with a HopMessage of type = STATUS, indicating whether the reservation has been accepted.

If the reservation is accepted, then the message has the following form:

HopMessage {
  type = STATUS
  status = OK
  reservation = Reservation {...}
  limit = Limit {...}
}

If the reservation is rejected, the relay responds with a HopMessage of the form

HopMessage {
  type = STATUS
  status = ...
}

where the status field has a value other than OK. Common rejection status codes are:

  • PERMISSION_DENIED if the reservation is rejected because of peer filtering using ACLs.
  • RESERVATION_REFUSED if the reservation is rejected for some other reason, e.g. because there are too many reservations.

The reservation field provides information about the reservation itself; the struct has the following fields:

Reservation {
   expire = ...
   addrs = [...]
   voucher = ...
}
  • the expire field contains the expiration time as a UTC UNIX time in seconds. The reservation becomes invalid after this time and it's the responsibility of the client to refresh.
  • the addrs field contains all the public relay addrs, including the peer ID of the relay node but not the trailing p2p-circuit part; the client can use this list to construct its own p2p-circuit relay addrs for advertising by encapsulating p2p-circuit/p2p/QmPeer where QmPeer is its peer ID.
  • the voucher is the binary representation of the reservation voucher -- see Reservation Vouchers for details.

The limit field in HopMessage, if present, provides information about the limits applied by the relay in relayed connection. When omitted, it indicates that the relay does not apply any limits.

The struct has the following fields:

Limit {
  duration = ...
  data = ...
}
  • the duration field indicates the maximum duration of a relayed connection in seconds; if 0, there is no limit applied.
  • the data field indicates the maximum number of bytes allowed to be transmitted in each direction; if 0 there is no limit applied.

Note that the reservation remains valid until its expiration, as long as there is an active connection from the peer to the relay. If the peer disconnects, the reservation is no longer valid.

The server may drop a connection according to its connection management policy after all reservations expired. The expectation is that the server will make a best effort attempt to maintain the connection for the duration of any reservations and tag it to prevent accidental termination according to its connection management policy. If a relay server becomes overloaded however, it may still drop a connection with reservations in order to maintain its resource quotas.

If more data than the limit specified in the data field is transferred over the relayed connection, or the relayed connection has been open for longer than duration, the relay should reset the stream to the source and the stream to the destination.

If the reservation for the connection has expired then the relay may apply any connection management policy to the connection as normal otherwise it should retain the connection, unless doing so would prevent it from maintaining its resource quotas.

Note: Implementations should not accept reservations over already relayed connections.

Connection Initiation

In order to initiate a connection to a peer through a relay, the initiator opens a connection and sends a HopMessage of type = CONNECT:

HopMessage {
  type = CONNECT
  peer = Peer {...}
}

The peer field contains the peer ID of the target peer and optionally the address of that peer for the case of active relay:

Peer {
  id = ...
  addrs = [...]
}

Note: Active relay functionality is considered deprecated for security reasons, at least in public relays.

The protocol reserves the field nonetheless to support the functionality for the rare cases where it is actually desirable to use active relay functionality in a controlled environment.

If the relay has a reservation (and thus an active connection) from the peer, then it opens the second hop of the connection using the stop protocol; the details are not relevant for the hop protocol and the only thing that matters is whether it succeeds in opening the relay connection or not. If the relayed connection is successfully established, then the relay responds with HopMessage with type = STATUS and status = OK:

HopMessage {
  type = STATUS
  status = OK
  limit = Limit {...}
}

At this point the original hop stream becomes the relayed connection. The limit field, if present, communicates to the initiator the limits applied to the relayed connection with the semantics described above.

If the relayed connection cannot be established, then the relay responds with a HopMessage of type = STATUS and the status field having a value other than OK. Common failure status codes are:

  • PERMISSION_DENIED if the connection is rejected because of peer filtering using ACLs.
  • NO_RESERVATION if there is no active reservation for the target peer
  • RESOURCE_LIMIT_EXCEEDED if there are too many relayed connections from the initiator or to the target peer.
  • CONNECTION_FAILED if the relay failed to terminate the connection to the target peer.

Note: Implementations should not accept connection initiations over already relayed connections.

Stop Protocol

The Stop protocol governs connection termination between the relay and the target peer; it uses the protocol ID /libp2p/circuit/relay/0.2.0/stop.

In order to terminate a relayed connection, the relay opens a stream using an existing connection to the target peer. If there is no existing connection, an active relay may attempt to open one using the initiator supplied address, but as discussed in the previous section this functionality is generally deprecated.

The relay sends a StopMessage with type = CONNECT and the following form:

StopMessage {
  type = CONNECT
  peer = Peer { ID = ...}
  limit = Limit { ...}
}
  • the peer field contains a Peer struct with the peer ID of the connection initiator.
  • the limit field, if present, conveys the limits applied to the relayed connection with the semantics described above.

If the target peer accepts the connection it responds to the relay with a StopMessage of type = STATUS and status = OK:

StopMessage {
  type = STATUS
  status = OK
}

At this point the original stop stream becomes the relayed connection.

If the target fails to terminate the connection for some reason, then it responds to the relay with a StopMessage of type = STATUS and the status code set to something other than OK. Common failure status codes are:

  • CONNECTION_FAILED if the target internally failed to create the relayed connection for some reason.

Reservation Vouchers

Successful relay slot reservations should come with Reservation Vouchers. These are cryptographic certificates signed by the relay that testify that it is willing to provide service to the reserving peer. The intention is to eventually require the use of reservation vouchers for dialing relay addresses, but this is not currently enforced so the vouchers are only advisory.

The voucher itself is a Signed Envelope. The envelope domain is libp2p-relay-rsvp and uses the multicodec code 0x0302.

The payload of the envelope has the following form, in canonicalized protobuf format:

syntax = "proto3";
message Voucher {
  // These fields are marked optional for backwards compatibility with proto2.
  // Users should make sure to always set these.
  optional bytes relay = 1;
  optional bytes peer = 2;
  optional uint64 expiration = 3;
}
  • the relay field is the peer ID of the relay.
  • the peer field is the peer ID of the reserving peer.
  • the expiration field is the UNIX UTC expiration time for the reservation.

The wire representation is canonicalized, where elements of the message are written in field id order, with no unknown fields.

Protobuf

syntax = "proto3";
message HopMessage {
  enum Type {
    RESERVE = 0;
    CONNECT = 1;
    STATUS = 2;
  }

  // This field is marked optional for backwards compatibility with proto2.
  // Users should make sure to always set this.
  optional Type type = 1;

  optional Peer peer = 2;
  optional Reservation reservation = 3;
  optional Limit limit = 4;

  optional Status status = 5;
}

message StopMessage {
  enum Type {
    CONNECT = 0;
    STATUS = 1;
  }

  // This field is marked optional for backwards compatibility with proto2.
  // Users should make sure to always set this.
  optional Type type = 1;

  optional Peer peer = 2;
  optional Limit limit = 3;

  optional Status status = 4;
}

message Peer {
  // This field is marked optional for backwards compatibility with proto2.
  // Users should make sure to always set this.
  optional bytes id = 1;
  repeated bytes addrs = 2;
}

message Reservation {
  // This field is marked optional for backwards compatibility with proto2.
  // Users should make sure to always set this.
  optional uint64 expire = 1; // Unix expiration time (UTC)
  repeated bytes addrs = 2;   // relay addrs for reserving peer
  optional bytes voucher = 3; // reservation voucher
}

message Limit {
  optional uint32 duration = 1; // seconds
  optional uint64 data = 2;     // bytes
}

enum Status {
  // zero value field required for proto3 compatibility
  UNUSED                  = 0;
  OK                      = 100;
  RESERVATION_REFUSED     = 200;
  RESOURCE_LIMIT_EXCEEDED = 201;
  PERMISSION_DENIED       = 202;
  CONNECTION_FAILED       = 203;
  NO_RESERVATION          = 204;
  MALFORMED_MESSAGE       = 400;
  UNEXPECTED_MESSAGE      = 401;
}

Rendezvous Protocol

Lifecycle StageMaturityStatusLatest Revision
1AWorking DraftActiver3, 2021-07-12

Authors: @vyzo

Interest Group: @daviddias, @whyrusleeping, @Stebalien, @jacobheun, @yusefnapora, @vasco-santos

See the lifecycle document for context about the maturity level and spec status.

Table of Contents

Overview

The protocol described in this specification is intended to provide a lightweight mechanism for generalized peer discovery. It can be used for purposes like bootstrapping, real-time peer discovery, and application-specific routing. Any node implementing the rendezvous protocol can act as a rendezvous point, allowing the discovery of relevant peers in a decentralized manner.

Use Cases

Depending on the application, the protocol could be used in the following context:

  • During bootstrap, a node can use known rendezvous points to discover peers that provide critical services. For instance, rendezvous can be used to discover circuit relays for connectivity-restricted nodes.
  • During initialization, a node can use rendezvous to discover peers to connect with the rest of the application. For instance, rendezvous can discover pubsub peers within a topic.
  • In a real-time setting, applications can poll rendezvous points in order to discover new peers in a timely fashion.
  • In an application-specific routing setting, rendezvous points can be used to progressively discover peers that can answer specific queries or host shards of content.

Replacing ws-star-rendezvous

We intend to replace ws-star-rendezvous with a few rendezvous daemons and a fleet of p2p-circuit relays. Real-time applications will utilize rendezvous both for bootstrap and in a real-time setting. During bootstrap, rendezvous will be used to discover circuit relays that provide connectivity for browser nodes. Subsequently, rendezvous will be utilized throughout the application's lifetime for real-time peer discovery by registering and polling rendezvous points. This allows us to replace a fragile centralized component with a horizontally scalable ensemble of daemons.

Rendezvous and pubsub

Rendezvous can be naturally combined with pubsub for effective real-time discovery. At a basic level, rendezvous can bootstrap pubsub: nodes can utilize rendezvous to discover their peers within a topic. Alternatively, pubsub can also be used to build rendezvous services. In this scenario, several rendezvous points can federate using pubsub for internal real-time distribution while still providing a simple interface to clients.

The Protocol

The rendezvous protocol provides facilities for real-time peer discovery within application-specific namespaces. Peers connect to the rendezvous point and register their presence in one or more namespaces. It is not allowed to register arbitrary peers in a namespace; only the peer initiating the registration can register itself. The register message contains a serialized signed peer record created by the peer, which others can validate.

Other nodes can discover peers registered with the rendezvous point by querying the rendezvous point. The query specifies the namespace for limiting application scope and, optionally, a maximum number of peers to return. The namespace can be omitted in the query, which asks for all peers registered to the rendezvous point.

The query can also include a cookie obtained from the response to a previous query, such that only registrations that weren't included in the previous response will be returned. This lets peers progressively refresh their network view without overhead, simplifying real-time discovery. It also allows for the pagination of query responses so peers can manage large numbers of peer registrations.

The rendezvous protocol runs over libp2p streams using the protocol id /rendezvous/1.0.0.

Registration Lifetime

An optional TTL parameter in the REGISTER message controls the registration lifetime. If a TTL is specified, then the registration persists until the TTL expires. If no TTL was set, then a default of 2hrs is implied. There may be a rendezvous point-specific upper bound on TTL, with a maximum value of 72hrs. If the TTL of a registration is inadmissible, the rendezvous point may reject the registration with an E_INVALID_TTL status.

Peers can refresh their registrations at any time with a new REGISTER message; the TTL of the new message supersedes previous registrations. Peers can also cancel existing registrations at any time with an explicit UNREGISTER message. An UNREGISTER message does not have an explicit response. UNREGISTER messages for a namespace that a client is not registered for should be treated as a no-op.

The registration response includes the actual TTL of the registration, so that peers know when to refresh.

Interaction

Clients A and B connect to the rendezvous point R and register for namespace my-app with a REGISTER message:

A -> R: REGISTER{my-app, {QmA, AddrA}}
R -> A: {OK}
B -> R: REGISTER{my-app, {QmB, AddrB}}
R -> B: {OK}

Client C connects and registers for namespace another-app:

C -> R: REGISTER{another-app, {QmC, AddrC}}
R -> C: {OK}

Another client D can discover peers in my-app by sending a DISCOVER message; the rendezvous point responds with the list of current peer reigstrations and a cookie.

D -> R: DISCOVER{ns: my-app}
R -> D: {[REGISTER{my-app, {QmA, Addr}}
          REGISTER{my-app, {QmB, Addr}}],
         c1}

If D wants to discover all peers registered with R, then it can omit the namespace in the query:

D -> R: DISCOVER{}
R -> D: {[REGISTER{my-app, {QmA, Addr}}
          REGISTER{my-app, {QmB, Addr}}
          REGISTER{another-app, {QmC, AddrC}}],
         c2}

If D wants to poll for real-time discovery progressively, it can use the cookie obtained from a previous response only ask for new registrations.

So here we consider a new client E registering after the first query, and a subsequent query that discovers just that peer by including the cookie:

E -> R: REGISTER{my-app, {QmE, AddrE}}
R -> E: {OK}
D -> R: DISCOVER{ns: my-app, cookie: c1}
R -> D: {[REGISTER{my-app, {QmE, AddrE}}],
         c3}

Spam mitigation

The protocol, as described so far, is susceptible to spam attacks from adversarial actors who generate a large number of peer identities and register under a namespace of interest (e.g., the relay namespace).

It is TBD how exactly the protocol will mitigate such attacks. See https://github.com/libp2p/specs/issues/341 for a discussion on this topic.

Protobuf

syntax = "proto2";

message Message {
  enum MessageType {
    REGISTER = 0;
    REGISTER_RESPONSE = 1;
    UNREGISTER = 2;
    DISCOVER = 3;
    DISCOVER_RESPONSE = 4;
  }

  enum ResponseStatus {
    OK                            = 0;
    E_INVALID_NAMESPACE           = 100;
    E_INVALID_SIGNED_PEER_RECORD  = 101;
    E_INVALID_TTL                 = 102;
    E_INVALID_COOKIE              = 103;
    E_NOT_AUTHORIZED              = 200;
    E_INTERNAL_ERROR              = 300;
    E_UNAVAILABLE                 = 400;
  }

  message Register {
    optional string ns = 1;
    optional bytes signedPeerRecord = 2;
    optional uint64 ttl = 3; // in seconds
  }

  message RegisterResponse {
    optional ResponseStatus status = 1;
    optional string statusText = 2;
    optional uint64 ttl = 3; // in seconds
  }

  message Unregister {
    optional string ns = 1;
    // optional bytes id = 2; deprecated as per https://github.com/libp2p/specs/issues/335
  }

  message Discover {
    optional string ns = 1;
    optional uint64 limit = 2;
    optional bytes cookie = 3;
  }

  message DiscoverResponse {
    repeated Register registrations = 1;
    optional bytes cookie = 2;
    optional ResponseStatus status = 3;
    optional string statusText = 4;
  }

  optional MessageType type = 1;
  optional Register register = 2;
  optional RegisterResponse registerResponse = 3;
  optional Unregister unregister = 4;
  optional Discover discover = 5;
  optional DiscoverResponse discoverResponse = 6;
}

Recommendations for Rendezvous Points configurations

Rendezvous points should have well-defined configurations to enable libp2p nodes running the rendezvous protocol to have friendly defaults, as well as to guarantee the security and efficiency of a Rendezvous point. This will be particularly important in a federation, where rendezvous points should share the same expectations.

Regarding the validation of registrations, rendezvous points should have the following:

  • a minimum acceptable ttl of 2H
  • a maximum acceptable ttl of 72H
  • a maximum namespace length of 255

Rendezvous points are also recommended to allow:

  • a maximum of 1000 registration for each peer
    • defend against trivial DoS attacks
  • a maximum of 1000 peers should be returned per namespace query

SECIO 1.0.0

A stream security transport for libp2p. Streams wrapped by SECIO use secure sessions to encrypt all traffic.

SECIO is deprecated and we advise against using it. See this blog post for details.

Lifecycle StageMaturity LevelStatusLatest Revision
3DRecommendationDeprecatedr1, 2021-03-26

Authors: @jbenet, @bigs, @yusefnapora

Interest Group: @Stebalien, @richardschneider, @tomaka, @raulk

See the lifecycle document for context about maturity level and spec status.

Table of Contents

Implementations

Algorithm Support

SECIO allows participating peers to support a subset of the following algorithms.

Exchanges

The following elliptic curves are used for ephemeral key generation:

  • P-256
  • P-384
  • P-521

Ciphers

The following symmetric ciphers are used for encryption of messages once the SECIO channel is established:

  • AES-256
  • AES-128

Note that current versions of go-libp2p support the Blowfish cipher, however support for Blowfish will be dropped in future releases and should not be considered part of the SECIO spec.

Hashes

The following hash algorithms are used for key stretching and for HMACs once the SECIO channel is established:

  • SHA256
  • SHA512

Data Structures

The SECIO wire protocol features two message types defined in the version 2 syntax of the protobuf description language.

syntax = "proto2";

message Propose {
	optional bytes rand = 1;
	optional bytes pubkey = 2;
	optional string exchanges = 3;
	optional string ciphers = 4;
	optional string hashes = 5;
}

message Exchange {
	optional bytes epubkey = 1;
	optional bytes signature = 2;
}

These two messages, Propose and Exchange are the only serialized types required to implement SECIO.

Protocol

Prerequisites

Prior to undertaking the SECIO handshake described below, it is assumed that we have already established a dedicated bidirectional channel between both parties, and that both have agreed to proceed with the SECIO handshake using multistream-select or some other form of protocol negotiation.

Message framing

All messages sent over the wire are prefixed with the message length in bytes, encoded as an unsigned 32-bit Big Endian integer. The message length should always be inferior to 0x800000 (or 8MiB).

Proposal Generation

SECIO channel negotiation begins with a proposal phase.

Each side will construct a Propose protobuf message (as defined above), setting the fields as follows:

fieldvalue
randA 16 byte random nonce, generated using the most secure means available
pubkeyThe sender's public key, serialized as described in the peer-id spec
exchangesA list of supported key exchanges as a comma-separated string
ciphersA list of supported ciphers as a comma-separated string
hashesA list of supported hashes as a comma-separated string

Both parties serialize this message and send it over the wire. If either party has prior knowledge of the other party's peer id, they may attempt to validate that the given public key can be used to generate the same peer id, and may close the connection if there is a mismatch.

Determining Roles and Algorithms

Next, the peers use a deterministic formula to compute their roles in the coming exchanges. Each peer computes:

oh1 := sha256(concat(remotePeerPubKeyBytes, myNonce))
oh2 := sha256(concat(myPubKeyBytes, remotePeerNonce))

Where myNonce is the rand component of the local peer's Propose message, and remotePeerNonce is the rand field from the remote peer's proposal.

With these hashes, determine which peer's preferences to favor. This peer will be referred to as the "preferred peer". If oh1 == oh2, then the peer is communicating with itself and should return an error. If oh1 < oh2, use the remote peer's preferences. If oh1 > oh2, prefer the local peer's preferences.

Given our preference, we now sort through each of the exchanges, ciphers, and hashes provided by both peers, selecting the first item from our preferred peer's set that is also shared by the other peer.

Key Exchange

Now the peers prepare a key exchange.

Both peers generate an ephemeral keypair using the elliptic curve algorithm that was chosen from the proposed exchanges in the previous step.

With keys generated, both peers create an Exchange message. First, they start by generating a "corpus" that they will sign.

corpus := concat(myProposalBytes, remotePeerProposalBytes, ephemeralPubKey)

The corpus is then signed using the permanent private key associated with the local peer's peer id, producing a byte array signature.

fieldvalue
epubkeyThe ephemeral public key, marshaled as described below
signatureThe signature of the corpus described above

The peers serialize their Exchange messages and write them over the wire. Upon receiving the remote peer's Exchange, the local peer will compute the remote peer's expected corpus using the known proposal bytes and the ephemeral public key sent by the remote peer in the Exchange. The signature can then be validated using the permanent public key of the remote peer obtained in the initial proposal.

Peers MUST close the connection if the signature does not validate.

Key marshaling

Within the Exchange message, ephemeral public keys are marshaled into the uncompressed form specified in section 4.3.6 of ANSI X9.62.

This is the behavior provided by the go standard library's elliptic.Marshal function.

Shared Secret Generation

Peers now generate their shared secret by combining their ephemeral private key with the remote peer's ephemeral public key.

First, the remote ephemeral public key is unmarshaled into a point on the elliptic curve used in the agreed-upon exchange algorithm. If the point is not valid for the agreed-upon curve, secret generation fails and the connection must be closed.

The remote ephemeral public key is then combined with the local ephemeral private key by means of elliptic curve scalar multiplication. The result of the multiplication is the shared secret, which will then be stretched to produce MAC and cipher keys, as described in the next section.

Key Stretching

The key stretching process uses an HMAC algorithm to derive encryption and MAC keys and a stream cipher initialization vector from the shared secret.

Key stretching produces the following three values for each peer:

  • A MAC key used to initialize an HMAC algorithm for message verification
  • A cipher key used to initialize a block cipher
  • An initialization vector (IV), used to generate a CTR stream cipher from the block cipher

The key stretching function will return two data structures k1 and k2, each containing the three values above.

Before beginning the stretching process, the size of the IV and cipher key are determined according to the agreed-upon cipher algorithm. The sizes (in bytes) used are as follows:

cipher typecipher key sizeIV size
AES-1281616
AES-2563216

The generated MAC key will always have a size of 20 bytes.

Once the sizes are known, we can compute the total size of the output we need to generate as outputSize := 2 * (ivSize + cipherKeySize + macKeySize).

The stretching algorithm will then proceed as follows:

First, an HMAC instance is initialized using the agreed upon hash function and shared secret.

A fixed seed value of "key expansion" (encoded into bytes as UTF-8) is fed into the HMAC to produce an initial digest a.

Then, the following process repeats until outputSize bytes have been generated:

  • reset the HMAC instance or generate a new one using the same hash function and shared secret
  • compute digest b by feeding a and the seed value into the HMAC:
    • b := hmac_digest(concat(a, "key expansion"))
  • append b to previously generated output (if any).
    • if, after appending b, the generated output exceeds outputSize, the output is truncated to outputSize and generation ends.
  • reset the HMAC and feed a into it, producing a new value for a to be used in the next iteration
    • a = hmac_digest(a)
  • repeat until outputSize is reached

Having generated outputSize bytes, the output is then split into six parts to produce the final return values k1 and k2:

| k1.IV | k1.CipherKey | k1.MacKey | k2.IV | k2.CipherKey | k2.MacKey |

The size of each field is determined by the cipher key and IV sizes detailed above.

Creating the Cipher and HMAC signer

With k1 and k2 computed, swap the two values if the remote peer is the preferred peer. After swapping if necessary, k1 becomes the local peer's key and k2 the remote peer's key.

Each peer now generates an HMAC signer using the agreed upon algorithm and the MacKey produced by the key stretcher.

Each peer will also initialize the agreed-upon block cipher using the generated CipherKey, and will then initialize a CTR stream cipher from the block cipher using the generated initialization vector IV.

Initiate Secure Channel

With the cipher and HMAC signer created, the secure channel is ready to be opened.

Secure Message Framing

To communicate over the channel, peers send packets containing an encrypted body and an HMAC signature of the encrypted body.

The encrypted body is produced by applying the stream cipher initialized previously to an arbitrary plaintext message payload. The encrypted data is then fed into the HMAC signer to produce the HMAC signature.

Once the encrypted body and HMAC signature are known, they are concatenated together, and their combined length is prefixed to the resulting payload.

Each packet is of the form:

[uint32 length of packet | encrypted body | hmac signature of encrypted body]

The packet length is in bytes, and it is encoded as an unsigned 32-bit integer in network (big endian) byte order.

Initial Packet Verification

The first packet transmitted by each peer must be the remote peer's nonce.

Each peer will decrypt the message body and validate the HMAC signature, comparing the decrypted output to the nonce recieved in the initial Propose message. If either peer is unable to validate the initial packet against the known nonce, they must abort the connection.

If both peers successfully validate the initial packet, the secure channel has been opened and is ready for use, using the framing rules described above.

libp2p TLS Handshake

Lifecycle StageMaturityStatusLatest Revision
3ARecommendationActiver0, 2019-03-23

Authors: @marten-seemann

Interest Group: @Stebalien, @jacobheun, @raulk, @Kubuxu, @yusefnapora

See the lifecycle document for context about the maturity level and spec status.

Table of Contents

Introduction

This document describes how TLS 1.3 is used to secure libp2p connections. Endpoints authenticate to their peers by encoding their public key into a X.509 certificate extension. The protocol described here allows peers to use arbitrary key types, not constrained to those for which signing of a X.509 certificates is specified.

Handshake Protocol

The libp2p handshake uses TLS 1.3 (and higher). Endpoints MUST NOT negotiate lower TLS versions.

During the handshake, peers authenticate each other’s identity as described in Peer Authentication. Endpoints MUST verify the peer's identity. Specifically, this means that servers MUST require client authentication during the TLS handshake, and MUST abort a connection attempt if the client fails to provide the requested authentication information.

When negotiating the usage of this handshake dynamically, via a protocol agreement mechanism like multistream-select 1.0, it MUST be identified with the following protocol ID:

/tls/1.0.0

Peer Authentication

In order to be able to use arbitrary key types, peers don’t use their host key to sign the X.509 certificate they send during the handshake. Instead, the host key is encoded into the libp2p Public Key Extension, which is carried in a self-signed certificate.

The key used to generate and sign this certificate SHOULD NOT be related to the host's key. Endpoints MAY generate a new key and certificate for every connection attempt, or they MAY reuse the same key and certificate for multiple connections.

Endpoints MUST choose a key that will allow the peer to verify the certificate (i.e. choose a signature algorithm that the peer supports), and SHOULD use a key type that (a) allows for efficient signature computation, and (b) reduces the combined size of the certificate and the signature. In particular, RSA SHOULD NOT be used unless no elliptic curve algorithms are supported.

Endpoints MUST NOT send a certificate chain that contains more than one certificate. The certificate MUST have NotBefore and NotAfter fields set such that the certificate is valid at the time it is received by the peer. When receiving the certificate chain, an endpoint MUST check these conditions and abort the connection attempt if (a) the presented certificate is not yet valid, OR (b) if it is expired. Endpoints MUST abort the connection attempt if more than one certificate is received, or if the certificate’s self-signature is not valid.

The certificate MUST contain the libp2p Public Key Extension. If this extension is missing, endpoints MUST abort the connection attempt. This extension MAY be marked critical. The certificate MAY contain other extensions. Implementations MUST ignore non-critical extensions with unknown OIDs. Endpoints MUST abort the connection attempt if the certificate contains critical extensions that the endpoint does not understand.

Certificates MUST omit the deprecated subjectUniqueId and issuerUniqueId fields. Endpoints MAY abort the connection attempt if either is present.

Note for clients: Since clients complete the TLS handshake immediately after sending the certificate (and the TLS ClientFinished message), the handshake will appear as having succeeded before the server had the chance to verify the certificate. In this state, the client can already send application data. If certificate verification fails on the server side, the server will close the connection without processing any data that the client sent.

libp2p Public Key Extension

In order to prove ownership of its host key, an endpoint sends two values:

  • the public host key
  • a signature performed using the private host key

The public host key allows the peer to calculate the peer ID of the peer it is connecting to. Clients MUST verify that the peer ID derived from the certificate matches the peer ID they intended to connect to, and MUST abort the connection if there is a mismatch.

The peer signs the concatenation of the string libp2p-tls-handshake: and the encoded public key that is used to generate the certificate carrying the libp2p Public Key Extension, using its private host key. The public key is encoded as a SubjectPublicKeyInfo structure as described in RFC 5280, Section 4.1:

SubjectPublicKeyInfo ::= SEQUENCE {
  algorithm             AlgorithmIdentifier,
  subject_public_key    BIT STRING
}
AlgorithmIdentifier  ::= SEQUENCE {
  algorithm             OBJECT IDENTIFIER,
  parameters            ANY DEFINED BY algorithm OPTIONAL
}

This signature provides cryptographic proof that the peer was in possession of the private host key at the time the certificate was signed. Peers MUST verify the signature, and abort the connection attempt if signature verification fails.

The public host key and the signature are ANS.1-encoded into the SignedKey data structure, which is carried in the libp2p Public Key Extension. The libp2p Public Key Extension is a X.509 extension with the Object Identier 1.3.6.1.4.1.53594.1.1, allocated by IANA to the libp2p project at Protocol Labs.

SignedKey ::= SEQUENCE {
  publicKey OCTET STRING,
  signature OCTET STRING
}

The publicKey field of SignedKey contains the public host key of the endpoint, encoded using the following protobuf:

syntax = "proto2";

enum KeyType {
	RSA = 0;
	Ed25519 = 1;
	Secp256k1 = 2;
	ECDSA = 3;
}

message PublicKey {
	required KeyType Type = 1;
	required bytes Data = 2;
}

How the public key is encoded into the Data bytes depends on the Key Type.

  • Ed25519: Only the 32 bytes of the public key
  • Secp256k1: Only the compressed form of the public key. 33 bytes.
  • The rest of the keys are encoded as a SubjectPublicKeyInfo structure in PKIX, ASN.1 DER form.

ALPN

"libp2p" is used as the application protocol for ALPN.

The server MUST abort the handshake if it doesn't support any of the application protocols offered by the client.

Inlined Muxer Negotiation

See Multiplexer Negotiation over TLS.

Test vectors

The following items present test vectors that a compatible implementation should pass. Due to the randomness required when signing certificates, it is hard to provide testcases for generating certificates. These test cases verify that implementations can correctly parse certificates with all key types. Implementations are encouraged to also perform roundtrip tests on their own certificate generation.

All certificates in these testcases are HEX encoded.

1. Valid certificate authenticating an ED25519 Peer ID

Certificate:

308201ae30820156a0030201020204499602d2300a06082a8648ce3d040302302031123010060355040a13096c69627032702e696f310a300806035504051301313020170d3735303130313133303030305a180f34303936303130313133303030305a302031123010060355040a13096c69627032702e696f310a300806035504051301313059301306072a8648ce3d020106082a8648ce3d030107034200040c901d423c831ca85e27c73c263ba132721bb9d7a84c4f0380b2a6756fd601331c8870234dec878504c174144fa4b14b66a651691606d8173e55bd37e381569ea37c307a3078060a2b0601040183a25a0101046a3068042408011220a77f1d92fedb59dddaea5a1c4abd1ac2fbde7d7b879ed364501809923d7c11b90440d90d2769db992d5e6195dbb08e706b6651e024fda6cfb8846694a435519941cac215a8207792e42849cccc6cd8136c6e4bde92a58c5e08cfd4206eb5fe0bf909300a06082a8648ce3d0403020346003043021f50f6b6c52711a881778718238f650c9fb48943ae6ee6d28427dc6071ae55e702203625f116a7a454db9c56986c82a25682f7248ea1cb764d322ea983ed36a31b77

PeerId: 12D3KooWM6CgA9iBFZmcYAHA6A2qvbAxqfkmrYiRQuz3XEsk4Ksv

2. Valid certificate authenticating an ECDSA Peer ID

Certificate:

308201f63082019da0030201020204499602d2300a06082a8648ce3d040302302031123010060355040a13096c69627032702e696f310a300806035504051301313020170d3735303130313133303030305a180f34303936303130313133303030305a302031123010060355040a13096c69627032702e696f310a300806035504051301313059301306072a8648ce3d020106082a8648ce3d030107034200040c901d423c831ca85e27c73c263ba132721bb9d7a84c4f0380b2a6756fd601331c8870234dec878504c174144fa4b14b66a651691606d8173e55bd37e381569ea381c23081bf3081bc060a2b0601040183a25a01010481ad3081aa045f0803125b3059301306072a8648ce3d020106082a8648ce3d03010703420004bf30511f909414ebdd3242178fd290f093a551cf75c973155de0bb5a96fedf6cb5d52da7563e794b512f66e60c7f55ba8a3acf3dd72a801980d205e8a1ad29f2044730450220064ea8124774caf8f50e57f436aa62350ce652418c019df5d98a3ac666c9386a022100aa59d704a931b5f72fb9222cb6cc51f954d04a4e2e5450f8805fe8918f71eaae300a06082a8648ce3d04030203470030440220799395b0b6c1e940a7e4484705f610ab51ed376f19ff9d7c16757cfbf61b8d4302206205c03fbb0f95205c779be86581d3e31c01871ad5d1f3435bcf375cb0e5088a

PeerId: QmfXbAwNjJLXfesgztEHe8HwgVDCMMpZ9Eax1HYq6hn9uE

3. Valid certificate authenticating a secp256k1 Peer ID

Certificate:

308201ba3082015fa0030201020204499602d2300a06082a8648ce3d040302302031123010060355040a13096c69627032702e696f310a300806035504051301313020170d3735303130313133303030305a180f34303936303130313133303030305a302031123010060355040a13096c69627032702e696f310a300806035504051301313059301306072a8648ce3d020106082a8648ce3d030107034200040c901d423c831ca85e27c73c263ba132721bb9d7a84c4f0380b2a6756fd601331c8870234dec878504c174144fa4b14b66a651691606d8173e55bd37e381569ea38184308181307f060a2b0601040183a25a01010471306f0425080212210206dc6968726765b820f050263ececf7f71e4955892776c0970542efd689d2382044630440220145e15a991961f0d08cd15425bb95ec93f6ffa03c5a385eedc34ecf464c7a8ab022026b3109b8a3f40ef833169777eb2aa337cfb6282f188de0666d1bcec2a4690dd300a06082a8648ce3d0403020349003046022100e1a217eeef9ec9204b3f774a08b70849646b6a1e6b8b27f93dc00ed58545d9fe022100b00dafa549d0f03547878338c7b15e7502888f6d45db387e5ae6b5d46899cef0

PeerId: 16Uiu2HAkutTMoTzDw1tCvSRtu6YoixJwS46S1ZFxW8hSx9fWHiPs

4. Invalid certificate

This certificate has a mismatch between the Peer ID that it claims to authenticate vs the key that was used to sign it.

Certificate:

308201f73082019da0030201020204499602d2300a06082a8648ce3d040302302031123010060355040a13096c69627032702e696f310a300806035504051301313020170d3735303130313133303030305a180f34303936303130313133303030305a302031123010060355040a13096c69627032702e696f310a300806035504051301313059301306072a8648ce3d020106082a8648ce3d030107034200040c901d423c831ca85e27c73c263ba132721bb9d7a84c4f0380b2a6756fd601331c8870234dec878504c174144fa4b14b66a651691606d8173e55bd37e381569ea381c23081bf3081bc060a2b0601040183a25a01010481ad3081aa045f0803125b3059301306072a8648ce3d020106082a8648ce3d03010703420004bf30511f909414ebdd3242178fd290f093a551cf75c973155de0bb5a96fedf6cb5d52da7563e794b512f66e60c7f55ba8a3acf3dd72a801980d205e8a1ad29f204473045022100bb6e03577b7cc7a3cd1558df0da2b117dfdcc0399bc2504ebe7de6f65cade72802206de96e2a5be9b6202adba24ee0362e490641ac45c240db71fe955f2c5cf8df6e300a06082a8648ce3d0403020348003045022100e847f267f43717358f850355bdcabbefb2cfbf8a3c043b203a14788a092fe8db022027c1d04a2d41fd6b57a7e8b3989e470325de4406e52e084e34a3fd56eef0d0df

Future Extensibility

Future versions of this handshake protocol MAY use the Server Name Indication (SNI) in the ClientHello as defined in RFC 6066, section 3 to announce their support for other versions.

In order to keep this flexibility for future versions, clients that only support the version of the handshake defined in this document MUST NOT send any value in the Server Name Indication. Servers that support only this version MUST ignore this field if present.

Design considerations for the libp2p TLS Handshake

Requirements

There are two main requirements that prevent us from using the straightforward way to run a TLS handshake (which would be to simply use the host key to create a self-signed certificate).

  1. We want to use different key types: RSA, ECDSA, and Ed25519, Secp256k1 (and maybe more in the future?).
  2. We want to be able to send the key type along with the key (see https://github.com/libp2p/specs/issues/111).

The first point is problematic in practice, because Go currently only supports RSA and ECDSA certificates. Support for Ed25519 was planned for Go 1.12, but was deferred recently, and the Go team is now evaluating interest in this in order to prioritze their work, so this might or might not happen in Go 1.13. I'm not aware of any plans for Secp256k1 at the moment. The second requirement implies that we might want add some additional (free-form) information to the handshake, and we need to find a field to stuff that into.

The handshake protocol described here:

  • supports arbitrary keys, independent from what the signature algorithms implemented by the TLS library used
  • defines a way how future versions of this protocol might be negotiated without requiring any out-of-band information and additional roundtrips

Design Choices

TLS 1.3 - What about older versions?

The handshake protocol requires TLS 1.3 support. This means that the handshake between two peers that have never communicated before will typically complete in just a single roundtrip. With older TLS versions, a handshake typically takes two roundtrips. By not specifying support for older TLS versions, we increase performance and simplify the protocol.

Why we're not using the host key for the certificate

The current proposal uses a self-signed certificate to carry the host's public key in the libp2p Public Key Extension. The key used to generate the self-signed certificate has no relationship with the host key. This key can be generated for every single connection, or can be generated at boot time.

One optimisation that was considered when designing the protocol was to use the libp2p host key to generate the certificate in the case of RSA and ECDSA keys (which we can assume to be supported signature schemes by all peers). That would have allowed us to strip the host key and the signature from the key extension, in order to

  1. reduce the size of the certificate and
  2. reduce the number of signature verifications the peer has to perform from 2 to 1.

The protocol does not include this optimisation, because

  1. assuming that the peer uses an ECDSA key for generating the self-signed certificate, this only saves about ~150 bytes if the host key is an ECDSA key as well, and it even slightly increases the size of the certificate in case of a RSA host key. Furthermore, for ECDSA keys, the size of all handshake messages combined is less than 900 bytes, so having a slightly larger certificate won't require us to send more (TCP / QUIC) packets.
  2. For a client, the number of signature verifications shouldn't pose a problem, since it controls the rate of its dials. Only for servers this might be a problem, since a malicious client could force a server to waste resources on signature verification. However, this is not a particularly interesting DoS vector, since the client's certificate is sent in its second flight (after receiving the ServerHello and the server's certificate), so it requires the attacker to actually perform most of the TLS handshake, including encrypting the certificate chain with a key that's tied to that handshake.

Versioning - How we could roll out a new version of this protocol in the future

An earlier version of this document included a version negotiation mechanism. While it is a desireable property to be able to change things in the future, it also adds a lot of complexity.

To keep things simple, the current proposal does not include a version negotiation mechanism. A future version of this protocol might:

  1. Change the format in which the keys are transmitted. A x509 extension has an ID (the Objected Identifier, OID), so we can use a new OID if we want to change the way we encode information. x509 certificates allow use to include multiple extensions, so we can even send the old and the new version during a transition period. In the handshake protocol defined here, peers are required to skip over extensions that they don't understand.
  2. For more involved changes, a new version might (ab)use the SNI field in the ClientHello to announce support for new versions. To allow for this to work, the current version requires clients to send anything in the SNI field and server to completely ignore this field, no matter what its contents are.

QUIC in libp2p

Lifecycle StageMaturityStatusLatest Revision
3ARecommendationActiver1, 2022-12-30

Authors: @marten-seemann

Interest Group: @elenaf9, @MarcoPolo

See the lifecycle document for context about the maturity level and spec status.

QUIC vs. TCP

QUIC RFC9000 is, alongside TCP, one of the transports that allows non-browser libp2p nodes to establish connections to each other. Due to its inherently faster handshake latency (a single network-roundtrip), and generally better performance characteristics, it is RECOMMENDED that libp2p implementations offer QUIC as one of their transports. However, UDP is blocked in a small fraction of networks, therefore it is RECOMMENDED that libp2p nodes offer a TCP-based connection option as a fallback.

Multiaddress

A QUIC multiaddress encodes the IP address and UDP port. For example, these are valid QUIC multiaddresses:

  • /ip4/127.0.0.1/udp/1234/quic-v1: A QUIC listener running on localhost on port 1234.
  • /ip6/2001:db8:3333:4444:5555:6666:7777:8888/udp/443/quic-v1: A QUIC listener running on 2001:db8:3333:4444:5555:6666:7777:8888 on port 443.
  • /ip4/12.34.56.78/udp/4321/quic: A QUIC listener, supporting QUIC draft-29 (see below)

QUIC Versions

When IPFS first rolled out QUIC support, RFC 9000 was not finished yet. Back then, QUIC was rolled out based on IETF QUIC working group draft-29. Nodes supporting draft-29 use the /quic multiaddress component (instead of /quic-v1) to signal support for the draft version. Nodes supporting RFC 9000 use the /quic-v1 multiaddress component.

New implementations SHOULD implement support for RFC 9000. Support for draft-29 is currently being phased out of production networks, and will be deprecated at some point in the future.

ALPN

"libp2p" is used as the application protocol for ALPN. Note that QUIC enforces the use of ALPN, so the handshake will fail if both peers can't agree on the application protocol.

Peer Authentication

Peers authenticate each other using the TLS handshake logic described in the libp2p TLS spec.

WebRTC

Lifecycle StageMaturityStatusLatest Revision
2ACandidate RecommendationActiver1, 2023-04-12

Authors: @mxinden

Interest Group: @marten-seemann

WebRTC flavors in libp2p:

  1. WebRTC

    libp2p transport protocol enabling two private nodes (e.g. two browsers) to establish a direct connection.

  2. WebRTC Direct

    libp2p transport protocol without the need for trusted TLS certificates. Enable browsers to connect to public server nodes without those server nodes providing a TLS certificate within the browser's trustchain. Note that we can not do this today with our Websocket transport as the browser requires the remote to have a trusted TLS certificate. Nor can we establish a plain TCP or QUIC connection from within a browser. We can establish a WebTransport connection from the browser (see WebTransport specification).

Shared concepts

Multiplexing

The WebRTC browser APIs do not support half-closing of streams nor resets of the sending part of streams. RTCDataChannel.close() flushes the remaining messages and closes the local write and read side. After calling RTCDataChannel.close() one can no longer read from nor write to the channel. This lack of functionality is problematic, given that libp2p protocols running on top of transport protocols, like WebRTC, expect to be able to half-close or reset a stream. See Connection Establishment in libp2p.

To support half-closing and resets of streams, libp2p WebRTC uses message framing. Messages on a RTCDataChannel are embedded into the Protobuf message below and sent on the RTCDataChannel prefixed with the message length in bytes, encoded as an unsigned variable length integer as defined by the multiformats unsigned-varint spec.

It is an adaptation from the QUIC RFC. When in doubt on the semantics of these messages, consult the QUIC RFC.

syntax = "proto2";

package webrtc.pb;

message Message {
  enum Flag {
    // The sender will no longer send messages on the stream.
    FIN = 0;
    // The sender will no longer read messages on the stream. Incoming data is
    // being discarded on receipt.
    STOP_SENDING = 1;
    // The sender abruptly terminates the sending part of the stream. The
    // receiver MAY discard any data that it already received on that stream.
    RESET_STREAM = 2;
    // Sending the FIN_ACK flag acknowledges the previous receipt of a message
    // with the FIN flag set. Receiving a FIN_ACK flag gives the recipient
    // confidence that the remote has received all sent messages.
    FIN_ACK = 3;
  }

  optional Flag flag=1;

  optional bytes message = 2;
}

Note that in contrast to QUIC (see QUIC RFC - 3.5 Solicited State Transitions) a libp2p WebRTC endpoint receiving a STOP_SENDING frame SHOULD NOT send a RESET_STREAM frame in reply. The STOP_SENDING frame is used for accurate accounting of the number of bytes sent for connection-level flow control in QUIC. The libp2p WebRTC message framing is not concerned with flow-control and thus does not need the RESET_STREAM frame to be send in reply to a STOP_SENDING frame.

Encoded messages including their length prefix MUST NOT exceed 16kiB to support all major browsers. See "Understanding message size limits". Implementations MAY choose to send smaller messages, e.g. to reduce delays sending flagged messages.

Ordering

Implementations MAY expose an unordered byte stream abstraction to the user by overriding the default value of ordered true to false when creating a new data channel via RTCPeerConnection.createDataChannel.

Head-of-line blocking

WebRTC datachannels and the underlying SCTP is message-oriented and not stream-oriented (e.g. see RTCDataChannel.send() and RTCDataChannel.onmessage()). libp2p streams on the other hand are byte oriented. Thus we run into the risk of head-of-line blocking.

Given that the browser does not give us access to the MTU on a given connection, we can not make an informed decision on the optimal message size.

We follow the recommendation of QUIC, requiring "a minimum IP packet size of at least 1280 bytes". We calculate with an IPv4 minimum header size of 20 bytes and an IPv6 header size of 40 bytes. We calculate with a UDP header size of 8 bytes. An SCTP packet common header is 12 bytes long. An SCTP data chunk header size is 16 bytes.

  • IPv4: 1280 bytes - 20 bytes - 8 bytes - 12 bytes - 16 bytes = 1224 bytes
  • IPv6: 1280 bytes - 40 bytes - 8 bytes - 12 bytes - 16 bytes = 1204 bytes

Thus for payloads that would suffer from head-of-line blocking, implementations SHOULD choose a message size equal or below 1204 bytes. Or, in case the implementation can differentiate by IP version, equal or below 1224 bytes on IPv4 and 1224 bytes on IPv6.

Long term we hope to be able to give better recommendations based on real-world experiments.

RTCDataChannel negotiation

RTCDataChannels are negotiated in-band by the WebRTC user agent (e.g. Firefox, Pion, ...). In other words libp2p WebRTC implementations MUST NOT change the default value negotiated: false when creating a standard libp2p stream of type RTCDataChannel via RTCPeerConnection.createDataChannel. Setting negotiated: true is reserved only for creating Noise handshake channels under certain protocol conditions.

The WebRTC user agent (i.e. not the application) decides on the RTCDataChannel ID based on the local node's connection role. For the interested reader see RF8832 Protocol Overview. It is RECOMMENDED that user agents reuse IDs once their RTCDataChannel closes. IDs MAY be reused according to RFC 8831: "Streams are available for reuse after a reset has been performed", see RFC 8831 6.7 Closing a Data Channel . Up to 65535 (2^16) concurrent data channels can be opened at any given time.

According to RFC 8832 a RTCDataChannel initiator "MAY start sending messages containing user data without waiting for the reception of the corresponding DATA_CHANNEL_ACK message", thus using negotiated: false does not imply an additional round trip for each new RTCDataChannel.

RTCDataChannel label

RTCPeerConnection.createDataChannel() requires passing a label for the to-be-created RTCDataChannel. When calling createDataChannel implementations MUST pass an empty string. When receiving an RTCDataChannel via RTCPeerConnection.ondatachannel implementations MUST NOT require label to be an empty string. This allows future versions of this specification to make use of the RTCDataChannel label property.

Closing an RTCDataChannel

Some WebRTC implementations do not guarantee that any queued messages will be sent after a datachannel is closed. Other implementations maintain separate outgoing message and transport queues, the status of which may not be visible to the user. Consequently we must add an additional layer of signaling to ensure reliable data delivery.

When a node wishes to close a stream for writing, it MUST send a message with the FIN flag set.

If a FIN flag is received the node SHOULD respond with a FIN_ACK.

A node SHOULD only consider its write-half closed once it has received a FIN_ACK.

When a FIN_ACK and a FIN have been received, the node may close the datachannel.

The node MAY close the datachannel without receiving a FIN_ACK, for example in the case of a timeout, but there will be no guarantee that all previously sent messages have been received by the remote.

If a node has previously sent a STOP_SENDING flag to the remote node, it MUST continue to act on any flags present in received messages in order to successfully process an incoming FIN_ACK.

Example of closing an RTCDataChannel

NodeA closes for writing, NodeB delays allowing the channel to close until it also finishes writing.

sequenceDiagram
    A->>B: DATA
    A->>B: FIN
    B->>A: FIN_ACK
    B->>A: DATA
    B->>A: FIN
    A->>B: FIN_ACK

After A has received the FIN it is free to close the datachannel since it has previously received a FIN_ACK. If B receives the FIN_ACK before this it may close the channel since it previously received a FIN.

This way the channel can be closed from either end without data loss.

FAQ

  • Why use Protobuf for WebRTC message framing. Why not use our own, potentially smaller encoding schema?

    The Protobuf framing adds an overhead of 5 bytes. The unsigned-varint prefix adds another 2 bytes. On a large message the overhead is negligible ((5 bytes + 2 bytes) / (16384 bytes - 7 bytes) = 0.000427246). On a small message, e.g. a multistream-select message with ~40 bytes the overhead is high ((5 bytes + 2 bytes) / 40 bytes = 0.175) but likely irrelevant.

    Using Protobuf allows us to evolve the protocol in a backwards compatibile way going forward. Using Protobuf is consistent with the many other libp2p protocols. These benefits outweigh the drawback of additional overhead.

  • Why not use a central TURN servers? Why rely on libp2p's Circuit Relay v2 instead?

    As a peer-to-peer networking library, libp2p should rely as little as possible on central infrastructure.

WebRTC

Lifecycle StageMaturityStatusLatest Revision
2ACandidate RecommendationActiver0, 2023-04-12

Authors: [@mxinden]

Motivation

libp2p transport protocol enabling two private nodes (e.g. two browsers) to establish a direct connection.

Browser A wants to connect to Browser node B with the help of server node R. Both A and B cannot listen for incoming connections due to running in a constrained environment (i.e. a browser) with its only transport capability being the W3C WebRTC RTCPeerConnection API and being behind a NAT and/or firewall. Note that A and/or B may as well be non-browser nodes behind NATs and/or firewalls. However, for two non-browser nodes using TCP or QUIC hole punching with DCUtR will be the more efficient way to establish a direct connection.

On a historical note, this specification replaces the existing libp2p WebRTC star and libp2p WebRTC direct protocols.

Connection Establishment

  1. B advertises support for the WebRTC browser-to-browser protocol by appending /webrtc to its relayed multiaddr, meaning it takes the form of <relayed-multiaddr>/webrtc/p2p/<b-peer-id>.

  2. Upon discovery of B's multiaddress, A learns that B supports the WebRTC transport and knows how to establish a relayed connection to B to run the /webrtc-signaling/0.0.1 protocol on top.

  3. A establishes a relayed connection to B. Note that further steps depend on the relayed connection to be authenticated, i.e. that data sent on the relayed connection can be trusted.

  4. A (outbound side of relayed connection) creates an RTCPeerConnection provided by a W3C compliant WebRTC implementation (e.g. a browser). A creates a datachannel via RTCPeerConnection.createDataChannel with the label init. This channel is required to ensure that ICE information is shared in the SDP offer. See STUN section on what STUN servers to configure at creation time. A creates an SDP offer via RTCPeerConnection.createOffer(). A initiates the signaling protocol to B via the relayed connection from (1), see Signaling Protocol and sends the offer to B. Note that A being the initiator of the stream is merely a convention preventing both nodes to simultaneously initiate a new connection thus potentially resulting in two WebRTC connections. A MUST as well be able to handle an incoming signaling protocol stream to support the case where B initiates the signaling process.

  5. On reception of the incoming stream, B (inbound side of relayed connection) creates an RTCPeerConnection. Again see STUN section on what STUN servers to configure at creation time. B receives A's offer sent in (2) via the signaling protocol stream and provides the offer to its RTCPeerConnection via RTCPeerConnection.setRemoteDescription. B then creates an answer via RTCPeerConnection.createAnswer and sends it to A via the existing signaling protocol stream (see Signaling Protocol).

  6. A receives B's answer via the signaling protocol stream and sets it locally via RTCPeerConnection.setRemoteDescription.

  7. A and B send their local ICE candidates via the existing signaling protocol stream to enable trickle ICE. Both nodes continuously read from the stream, adding incoming remote candidates via RTCPeerConnection.addIceCandidate().

  8. On successful establishment of the direct connection, A closes the init data channel created in step 4, B and A close the signaling protocol stream. On failure B and A reset the signaling protocol stream.

    Behavior for transferring data on a relayed connection, in the case where the direct connection failed, is out of scope for this specification and dependent on the application.

  9. Messages on RTCDataChannels on the established RTCPeerConnection are framed using the message framing mechanism described in multiplexing.

Diagram

sequenceDiagram
    participant a as Browser A
    participant cr as CircuitRelayV2Peer
    participant b as Browser B
    participant stun as STUN Server
    b->>cr: Establish Relayed Connection (WebTransport, WebRTC)
    b-->>a: Shares its own relayed webrtc multiaddress (out of band)
    a->>b: Establishes a relayed connection to Browser 2
    a-->>a: Creates RTCPeerConnection with STUN server config, init DataChannel and SDP offer
    a->>b: Initiates libp2p /webrtc-signaling/0.0.1 protocol stream over relayed conection and sends SDP
    b-->>b: Creates RTCPeerConnection with STUN server config, sets Browser1's SDP offer, and creates SDP answer
    b->>a: Sends SDP answer over signaling stream
    a-->>a: Set SDP answer with RTCPeerConnection.setRemoteDescription
    a->>+stun: What's my public IP and port
    stun->>-a: Browser A observed ip and port: 8.8.8.1:63333
    b->>+stun: What's my public IP and port
    stun->>-b: Browser B observed ip and port: 6.6.6.1:52222
    b->a: exchange ICE candidates over signalling stream pass to RTCPeerConnection.addIceCandidate()
    b->a: Establish direct connection

STUN

A node needs to discover its public IP and port, which is forwarded to the remote node in order to connect to the local node. On non-browser libp2p nodes doing a hole punch with TCP or QUIC, the libp2p node discovers its public address via the identify protocol. One cannot use the identify protocol on browser nodes to discover ones public IP and port given that the browser uses a new port for each connection. For example say that the local browser node establishes a WebRTC connection C1 via browser-to-server to a server node and runs the identify protocol. The returned observed public port P1 will most likely (depending on the NAT) be a different port than the port observed on another connection C2. The only browser supported mechanism to discover ones public IP and port for a given WebRTC connection is the non-libp2p protocol STUN. This is why this specification depends on STUN, and thus the availability of one or more STUN servers for A and B to discovery their public addresses.

Implementations MAY use one of the publicly available STUN servers, or deploy a dedicated server for a given libp2p network. Further specification of the usage of STUN is out of scope for this specifitcation.

It is not necessary for A and B to use the same STUN server when establishing a WebRTC connection.

Signaling Protocol

The protocol id is /webrtc-signaling/0.0.1. Messages are sent prefixed with the message length in bytes, encoded as an unsigned variable length integer as defined by the multiformats unsigned-varint spec.

syntax = "proto3";

message Message {
    // Specifies type in `data` field.
    enum Type {
        // String of `RTCSessionDescription.sdp`
        SDP_OFFER = 0;
        // String of `RTCSessionDescription.sdp`
        SDP_ANSWER = 1;
        // String of `RTCIceCandidate.toJSON()`
        ICE_CANDIDATE = 2;
    }

    optional Type type = 1;
    optional string data = 2;
}

FAQ

  • Why is there no additional Noise handshake needed?

    This specification (browser-to-browser) requires A and B to exchange their SDP offer and answer over an authenticated channel. Offer and answer contain the TLS certificate fingerprint. The browser validates the TLS certificate fingerprint through the DTLS handshake during the WebRTC connection establishment.

    In contrast, the browser-to-server specification allows exchange of the server's multiaddr, containing the server's TLS certificate fingerprint, over unauthenticated channels. In other words, the browser-to-server specification does not consider the TLS certificate fingerprint in the server's multiaddr to be trusted.

  • Why use a custom signaling protocol? Why not use DCUtR?

    DCUtR offers time synchronization through a two-step protocol (first Connect, then Sync). This is not needed for WebRTC.

    DCUtR does not provide a mechanism to trickle local address candidates to the remote as they are discovered. Trickling candidates just-in-time allows for faster WebRTC connection establishment.

  • Why does A and not B initiate the signaling protocol?

    In DCUtR B (inbound side of the relayed connection) initiates the DCUtR protocol by opening the DCUtR protocol stream. The reason is that in case A is publicly reachable, B might be able to use connection reversal to connect to A directly. This reason does not apply to the WebRTC browser-to-browser protocol. Given that A and B at this point already have a relayed connection established, they might as well use it to exchange SDP, instead of using connection reversal and WebRTC browser-to-server. Thus, for the WebRTC browser-to-browser protocol, A initiates the signaling protocol by opening the signaling protocol stream.

WebRTC Direct

Lifecycle StageMaturityStatusLatest Revision
2ACandidate RecommendationActiver1, 2023-04-12

Authors: @mxinden

Interest Group: @marten-seemann

Motivation

No need for trusted TLS certificates. Enable browsers to connect to public server nodes without those server nodes providing a TLS certificate within the browser's trustchain. Note that we can not do this today with our Websocket transport as the browser requires the remote to have a trusted TLS certificate. Nor can we establish a plain TCP or QUIC connection from within a browser. We can establish a WebTransport connection from the browser (see WebTransport specification).

Addressing

WebRTC Direct multiaddresses are composed of an IP and UDP address component, followed by /webrtc-direct and a multihash of the certificate that the node uses.

Examples:

  • /ip4/1.2.3.4/udp/1234/webrtc-direct/certhash/<hash>/p2p/<peer-id>
  • /ip6/fe80::1ff:fe23:4567:890a/udp/1234/webrtc-direct/certhash/<hash>/p2p/<peer-id>

The TLS certificate fingerprint in /certhash is a multibase encoded multihash.

For compatibility implementations MUST support hash algorithm sha-256 and base encoding base64url. Implementations MAY support other hash algorithms and base encodings, but they may not be able to connect to all other nodes.

Connection Establishment

Browser to public Server

Scenario: Browser A wants to connect to server node B where B is publicly reachable but B does not have a TLS certificate trusted by A.

  1. Server node B generates a TLS certificate, listens on a UDP port and advertises the corresponding multiaddress (see [#addressing]) through some external mechanism.

    Given that B is publicly reachable, B acts as a ICE Lite agent. It binds to a UDP port waiting for incoming STUN and SCTP packets and multiplexes based on source IP and source port.

  2. Browser A discovers server node B's multiaddr, containing B's IP, UDP port, TLS certificate fingerprint and optionally libp2p peer ID (e.g. /ip6/2001:db8::/udp/1234/webrtc-direct/certhash/<hash>/p2p/<peer-id>), through some external mechanism.

  3. A instantiates a RTCPeerConnection. See RTCPeerConnection().

    A (i.e. the browser) SHOULD NOT reuse the same certificate across RTCPeerConnections. Reusing the certificate can be used to identify A across connections by on-path observers given that WebRTC uses TLS 1.2.

  4. A constructs B's SDP answer locally based on B's multiaddr.

    A generates a random string prefixed with "libp2p+webrtc+v1/". The prefix allows us to use the ufrag as an upgrade mechanism to role out a new version of the libp2p WebRTC protocol on a live network. While a hack, this might be very useful in the future. A sets the string as the username (ufrag or username fragment) and password on the SDP of the remote's answer.

    A MUST set the a=max-message-size:16384 SDP attribute. See reasoning multiplexing for rational.

    Finally A sets the remote answer via RTCPeerConnection.setRemoteDescription().

  5. A creates a local offer via RTCPeerConnection.createOffer(). A sets the same username and password on the local offer as done in (4) on the remote answer.

    A MUST set the a=max-message-size:16384 SDP attribute. See reasoning multiplexing for rational.

    Finally A sets the modified offer via RTCPeerConnection.setLocalDescription().

    Note that this process, oftentimes referred to as "SDP munging" is disallowed by the specification, but not enforced across the major browsers (Safari, Firefox, Chrome) due to use-cases in the wild. See also https://bugs.chromium.org/p/chromium/issues/detail?id=823036

  6. Once A sets the SDP offer and answer, it will start sending STUN requests to B. B reads the ufrag from the incoming STUN request's username field. B then infers A's SDP offer using the IP, port, and ufrag of the request as follows:

    1. B sets the the ice-ufrag and ice-pwd equal to the value read from the username field.

    2. B sets an arbitrary sha-256 digest as the remote fingerprint as it does not verify fingerprints at this point.

    3. B sets the connection field (c) to the IP and port of the incoming request c=IN <ip> <port>.

    4. B sets the a=max-message-size:16384 SDP attribute. See reasoning multiplexing for rational.

    B sets this offer as the remote description. B generates an answer and sets it as the local description.

    The ufrag in combination with the IP and port of A can be used by B to identify the connection, i.e. demultiplex incoming UDP datagrams per incoming connection.

    Note that this step requires B to allocate memory for each incoming STUN message from A. This could be leveraged for a DOS attack where A is sending many STUN messages with different ufrags using different UDP source ports, forcing B to allocate a new peer connection for each. B SHOULD have a rate limiting mechanism in place as a defense measure. See also https://datatracker.ietf.org/doc/html/rfc5389#section-16.1.2.

  7. A and B execute the DTLS handshake as part of the standard WebRTC connection establishment.

    At this point B does not know the TLS certificate fingerprint of A. Thus B can not verify A's TLS certificate fingerprint during the DTLS handshake. Instead B needs to disable certificate fingerprint verification (see e.g. Pion's disableCertificateFingerprintVerification option).

    On success of the DTLS handshake the connection provides confidentiality and integrity but not authenticity. The latter is guaranteed through the succeeding Noise handshake. See Connection Security section.

  8. Messages on each RTCDataChannel are framed using the message framing mechanism described in Multiplexing.

  9. The remote is authenticated via an additional Noise handshake. See Connection Security section.

WebRTC can run both on UDP and TCP. libp2p WebRTC implementations MUST support UDP and MAY support TCP.

Connection Security

Note that the below uses the message framing described in multiplexing.

While WebRTC offers confidentiality and integrity via TLS, one still needs to authenticate the remote peer by its libp2p identity.

After Connection Establishment:

  1. A and B open a WebRTC data channel with id: 0 and negotiated: true (pc.createDataChannel("", {negotiated: true, id: 0});).

  2. B starts a Noise XX handshake on the new channel. See noise-libp2p.

    A and B use the Noise Prologue mechanism. More specifically A and B set the Noise Prologue to <PREFIX><FINGERPRINT_A><FINGERPRINT_B> before starting the actual Noise handshake. <PREFIX> is the UTF-8 byte representation of the string libp2p-webrtc-noise:. <FINGERPRINT_A><FINGERPRINT_B> is the concatenation of the two TLS fingerprints of A (Noise handshake responder) and then B (Noise handshake initiator), in their multihash byte representation.

    On Chrome A can access its TLS certificate fingerprint directly via RTCCertificate#getFingerprints. Firefox does not allow A to do so. Browser compatibility can be found here. In practice, this is not an issue since the fingerprint is embedded in the local SDP string.

  3. On success of the authentication handshake, the used datachannel is closed and the plain WebRTC connection is used with its multiplexing capabilities via datachannels. See Multiplexing.

Note: WebRTC supports different hash functions to hash the TLS certificate (see https://datatracker.ietf.org/doc/html/rfc8122#section-5). The hash function used in WebRTC and the hash function used in the multiaddr /certhash component MUST be the same. On mismatch the final Noise handshake MUST fail.

A knows B's fingerprint hash algorithm through B's multiaddr. A MUST use the same hash algorithm to calculate the fingerprint of its (i.e. A's) TLS certificate. B assumes A to use the same hash algorithm it discovers through B's multiaddr. For now implementations MUST support sha-256. Future iterations of this specification may add support for other hash algorithms.

Implementations SHOULD setup all the necessary callbacks (e.g. ondatachannel) before starting the Noise handshake. This is to avoid scenarios like one where A initiates a stream before B got a chance to set the ondatachannel callback. This would result in B ignoring all the messages coming from A targeting that stream.

Implementations MAY open streams before completion of the Noise handshake. Applications MUST take special care what application data they send, since at this point the peer is not yet authenticated. Similarly, the receiving side MAY accept streams before completion of the handshake.

Test vectors

Noise prologue

All of these test vectors represent hex-encoded bytes.

Both client and server use SHA-256

Here client is A and server is B.

client_fingerprint = "3e79af40d6059617a0d83b83a52ce73b0c1f37a72c6043ad2969e2351bdca870"
server_fingerprint = "30fc9f469c207419dfdd0aab5f27a86c973c94e40548db9375cca2e915973b99"

prologue = "6c69627032702d7765627274632d6e6f6973653a12203e79af40d6059617a0d83b83a52ce73b0c1f37a72c6043ad2969e2351bdca870122030fc9f469c207419dfdd0aab5f27a86c973c94e40548db9375cca2e915973b99"

FAQ

  • Why exchange the TLS certificate fingerprint in the multiaddr? Why not base it on the libp2p public key?

    Browsers do not allow loading a custom certificate. One can only generate a certificate via rtcpeerconnection-generatecertificate.

  • Why not embed the peer ID in the TLS certificate, thus rendering the additional "peer certificate" exchange obsolete?

    Browsers do not allow editing the properties of the TLS certificate.

  • How about distributing the multiaddr in a signed peer record, thus rendering the additional "peer certificate" exchange obsolete?

    Signed peer records are not yet rolled out across the many libp2p protocols. Making the libp2p WebRTC protocol dependent on the former is not deemed worth it at this point in time. Later versions of the libp2p WebRTC protocol might adopt this optimization.

    Note, one can role out a new version of the libp2p WebRTC protocol through a new multiaddr protocol, e.g. /webrtc-direct-2.

  • Why exchange fingerprints in an additional authentication handshake on top of an established WebRTC connection? Why not only exchange signatures of ones TLS fingerprints signed with ones libp2p private key on the plain WebRTC connection?

    Once A and B established a WebRTC connection, A sends signature_libp2p_a(fingerprint_a) to B and vice versa. While this has the benefit of only requring two messages, thus one round trip, it is prone to a key compromise and replay attack. Say that E is able to attain signature_libp2p_a(fingerprint_a) and somehow compromise A's TLS private key, E can now impersonate A without knowing A's libp2p private key.

    If one requires the signatures to contain both fingerprints, e.g. signature_libp2p_a(fingerprint_a, fingerprint_b), the above attack still works, just that E can only impersonate A when talking to B.

    Adding a cryptographic identifier of the unique connection (i.e. session) to the signature (signature_libp2p_a(fingerprint_a, fingerprint_b, connection_identifier)) would protect against this attack. To the best of our knowledge the browser does not give us access to such identifier.

  • Can a browser know upfront its UDP port which it is listening for incoming connections on? Does the browser reuse the UDP port across many WebRTC connections? If that is the case one could connect to any public node, with the remote telling the local node what port it is perceived on. Thus one could use libp2p's identify and AutoNAT protocol instead of relying on STUN.

    No, a browser uses a new UDP port for each RTCPeerConnection.

  • Why not load a remote node's certificate into one's browser trust-store and then connect e.g. via WebSocket.

    This would require a mechanism to discover remote node's certificates upfront. More importantly, this does not scale with the number of connections a typical peer-to-peer application establishes.

  • Can an attacker launch an amplification attack with the STUN endpoint of the server?

    We follow the reasoning of the QUIC protocol, namely requiring:

    an endpoint MUST limit the amount of data it sends to the unvalidated address to three times the amount of data received from that address.

    https://datatracker.ietf.org/doc/html/rfc9000#section-8

    This is the case for STUN response messages which are only slight larger than the request messages. See also https://datatracker.ietf.org/doc/html/rfc5389#section-16.1.2.

  • Why does B start the Noise handshake and not A?

    Given that WebRTC uses DTLS 1.2, B is the one that can send data first.

HTTP

Lifecycle StageMaturityStatusLatest Revision
1AWorking DraftActiver0, 2023-01-23

Authors: @marten-seemann, @MarcoPolo

Interest Group: @lidel, @thomaseizinger

Introduction

This document defines how libp2p nodes can offer and use an HTTP transport alongside their other transports to support application protocols with HTTP semantics. This allows a wider variety of nodes to participate in the libp2p network, for example:

  • Browsers communicating with other libp2p nodes without needing a WebSocket, WebTransport, or WebRTC connection.
  • HTTP only edge workers can run application protocols and respond to peers on the network.
  • curl from the command line can make requests to other libp2p nodes.

The HTTP transport will also allow application protocols to make use of HTTP intermediaries such as HTTP caching, and layer 7 proxying and load balancing. This is all in addition to the existing features that libp2p provides such as:

  • Connectivity – Work on top of WebRTC, WebTransport, QUIC, TCP, or an HTTP transport.
  • Hole punching – Work with peers behind NATs.
  • Peer ID Authentication – Authenticate your peer by their libp2p peer id.
  • Peer discovery – Learn about a peer given their peer id.

HTTP Semantics vs Encodings vs Transport

HTTP is a bit of an overloaded term. This section aims to clarify what we’re talking about when we say “HTTP”.

graph TB
    subgraph "HTTP Semantics"
        HTTP
    end
    subgraph "Encoding"
        HTTP1.1[HTTP/1.1]
        HTTP2[HTTP/2]
        HTTP3[HTTP/3]
    end
    subgraph "Transports"
        Libp2p[libp2p streams]
        HTTPTransport[HTTP transport]
    end
    HTTP --- HTTP1.1
    HTTP --- HTTP1.1
    HTTP1.1 --- Libp2p
    HTTP --- HTTP2
    HTTP --- HTTP3
    HTTP1.1 --- HTTPTransport
    HTTP2 --- HTTPTransport
    HTTP3 --- HTTPTransport
  • HTTP semantics (RFC 9110) is the stateless application-level protocol that you work with when writing HTTP apis (for example).

  • HTTP encoding is the thing that takes your high level request/response defined in terms of HTTP semantics and encodes it into a form that can be sent over the wire.

  • HTTP transport is the thing that takes your encoded request/response and sends it over the wire. For HTTP/1.1 and HTTP/2, this is a TCP+TLS connection. For HTTP/3, this is a QUIC connection.

When this document says HTTP it is generally referring to HTTP semantics.

Interoperability with existing HTTP systems

A goal of this spec is to allow libp2p to be able to interoperate with existing HTTP servers and clients. Care is taken in this document to not introduce anything that would break interoperability with existing systems.

HTTP Transport

Nodes MUST use HTTPS (i.e., they MUST NOT use plaintext HTTP). It is RECOMMENDED to use HTTP/2 and HTTP/3.

Nodes signal support for their HTTP transport using the /http component in their multiaddr. E.g., /dns4/example.com/tls/http. See the HTTP multiaddr component spec for more details.

Namespace

libp2p does not squat the global namespace. libp2p application protocols can be discovered by the well-known resource .well-known/libp2p/protocols. This allows server operators to dynamically change the URLs of the application protocols offered, and not hard-code any assumptions how a certain resource is meant to be interpreted.


{
    "protocols": {
        "/kad/1.0.0": {"path": "/kademlia/"},
        "/ipfs/gateway": {"path": "/"},
    }
}

The resource contains a mapping of application protocols to a URL namespace. For example, this configuration file would tell a client

  1. The Kademlia application protocol is available with prefix /kademlia and,
  2. The IPFS Trustless Gateway API is mounted at /.

It is valid to expose a service at /. It is RECOMMENDED that implementations facilitate the coexistence of different service endpoints by ensuring that more specific URLs are resolved before less specific ones. For example, when registering handlers, more specific paths like /kademlia/foo should take precedence over less specific handler, such as /.

Peer ID Authentication

When using the HTTP Transport, Peer ID authentication is optional. You only pay for it if you need it. This benefits use cases that don’t need peer authentication (e.g., fetching content addressed data) or authenticate some other way (not tied to libp2p peer ids).

Specific authentication schemes for authenticating Peer IDs will be defined in a future spec.

Using HTTP semantics over stream transports

Application protocols using HTTP semantics can run over any libp2p stream transport. Clients open a new stream using /http/1.1 as the protocol identifer. Clients encode their HTTP request as an HTTP/1.1 message and send it over the stream. Clients parse the response as an HTTP/1.1 message and then close the stream. Clients SHOULD NOT pipeline requests over a single stream. Clients and Servers SHOULD set the Connection: close header to signal to clients that this is not a persistent connection.

HTTP/1.1 is chosen as the minimum bar for interoperability, but other encodings of HTTP semantics are possible as well and may be specified in a future update.

Multiaddr URI scheme

In places where a URI is expected, implementations SHOULD accept a multiaddr URI in addition to a standard http or https URI. A multiaddr URI is a URI with the multiaddr scheme. It is constructed by taking the "multiaddr:" string and appending the string encoded representation of the multiaddr. E.g. the multiaddr /ip4/1.2.3.4/udp/54321/quic-v1 would be represented as multiaddr:/ip4/1.2.3.4/udp/54321/quic-v1.

This URI can be extended to include HTTP paths with the /http-path component. This allows a user to make an HTTP request to a specific HTTP resource using a multiaddr. For example, a user could make a GET request to multiaddr:/ip4/1.2.3.4/udp/54321/quic-v1/p2p/12D.../http-path/.well-known%2Flibp2p. This also allows an HTTP redirect to another host and another HTTP resource.

Using other request-response semantics (not HTTP)

This document has focused on using HTTP semantics, but HTTP may not be the common divisor amongst all transports (current and future). It may be desirable to use some other request-response semantics for your application-level protocol, perhaps something like rust-libp2p’s request-response abstraction. Nothing specified in this document prohibits mapping other semantics onto HTTP semantics to keep the benefits of using an HTTP transport.

As a simple example, to support the simple request-response semantics, the request MUST be encoded within a POST request to the proper URL (as defined in the Namespace section). The response is read from the body of the HTTP response. The client MUST authenticate the server and itself before making the request. The reason to chose POST is because this mapping makes no assumptions on whether the request is cacheable. If HTTP caching is desired users should either build on HTTP semantics or chose another mapping with different assumptions.

Other mappings may also be valid and as long as nodes agree.

Peer ID Authentication over HTTP

Lifecycle StageMaturityStatusLatest Revision
1AWorking DraftActiver1, 2025-05-28

Authors: @MarcoPolo

Interest Group: @sukunrt, @achingbrain

Introduction

This spec defines an HTTP authentication scheme of libp2p Peer IDs in accordance with RFC 9110. The authentication scheme is called libp2p-PeerID.

Protocol Overview

At a high level, challenges are exchanged and signed by each peer to authenticate themselves to each other. The protocol works whether the Client provides the first challenge, or the Server provides the first challenge.

Example Diagram of Server initiated handshake

┌─────────┐                   ┌────────┐
│ Client  │                   │ Server │
└─────────┘                   └────────┘
     │   initial request           │
     ├────────────────────────────>│
     │                             │
     │   401; challenge-client     │
     │<────────────────────────────┤
     │                             │
     │   client-sig +              │
     │   challenge-server          │
     │   [client authenticated]    │
     ├────────────────────────────>│
     │                             │
     │   server-sig                │
     │   [server authenticated]    │
     │<────────────────────────────┤
     │                             │
     │   application data          │
     ├────────────────────────────>│
     │                             │
     │   resp                      │
     │<────────────────────────────┤

Example Diagram of Client initiated handshake

┌────────┐                    ┌────────┐
│ Client │                    │ Server │
└────────┘                    └────────┘
     │   challenge-server          │
     ├────────────────────────────>│
     │                             │
     │   challenge-client +        │
     │   server-sig                │
     │   [server authenticated]    │
     │<────────────────────────────┤
     │                             │
     │   client-sig +              │
     │   application data          │
     │   [client authenticated]    │
     ├────────────────────────────>│
     │                             │
     │   resp                      │
     │<────────────────────────────┤

Parameters

Param NameDescription
hostnameThe server name used in the TLS connection (SNI).
challenge-serverThe random quoted string value the client generates to challenge the server to prove its identity
challenge-clientThe random quoted string value the server generates to challenge the client to prove its identity
sigA base64 encoded signature.
public-keyA base64 encoded value of peer's public key. This MUST be the key used for the Peer's Peer ID. The key itself is encoded per the Peer ID spec.
opaqueA value opaque to the client blob generated by the server. If a client receives this it must return it. A server may use this to authenticate statelessly.

Params are encoded per RFC 9110 auth-param's ABNF. Generally it'll be something like: hostname="example.com", challenge-server="challenge-string"

Signing

Signatures sign some set of parameters prefixed by the string libp2p-PeerID. The parameters are sorted alphabetically, prepended with a varint length prefix, and concatenated together to form the data to be signed. The parameter name and value is split with a =. The parameter value is appended directly after the =. Strings MUST be UTF-8 encoded. Byte Arrays MUST be appended as-is. The signing algorithm is defined by the key type used. Refer to the Peer ID spec for specifics on the signing algorithm.

Signing Example

ParameterValue
hostnameexample.com
Server Private Key (pb encoded as hex)0801124001010101010101010101010101010101010101010101010101010101010101018a88e3dd7409f195fd52db2d3cba5d72ca6709bf1d94121bf3748801b40f6f5c
challenge-serverERERERERERERERERERERERERERERERERERERERERERE=
Client Public Key (pb encoded as hex)080112208139770ea87d175f56a35466c34c7ecccb8d8a91b4ee37a25df60f5b8fc9b394
data to sign (percent encoded)libp2p-PeerID=challenge-server=ERERERERERERERERERERERERERERERERERERERERERE=6client-public-key=%08%01%12%20%819w%0E%A8%7D%17_V%A3Tf%C3L~%CC%CB%8D%8A%91%B4%EE7%A2%5D%F6%0F%5B%8F%C9%B3%94%14hostname=example.com
data to sign (hex encoded)6c69627032702d5065657249443d6368616c6c656e67652d7365727665723d455245524552455245524552455245524552455245524552455245524552455245524552455245524552453d36636c69656e742d7075626c69632d6b65793d080112208139770ea87d175f56a35466c34c7ecccb8d8a91b4ee37a25df60f5b8fc9b39414686f73746e616d653d6578616d706c652e636f6d
signature (base64 encoded)UA88qZbLUzmAxrD9KECbDCgSKAUBAvBHrOCF2X0uPLR1uUCF7qGfLPc7dw3Olo-LaFCDpk5sXN7TkLWPVvuXAA==

Note that the = after the libp2p-PeerID scheme is the varint length of the challenge-server parameter.

Base64 Encoding

The base64 encoding follows Base 64 Encoding with URL and Filename Safe Alphabet from RFC 4648. Padding MAY be omitted. The reason this is not a multibase is to aid clients or servers who can not or prefer not to import a multibase dependency.

Public Key Encoding

The authentication below exchanges the peer's public key instead of its PeerID, as the PeerID alone may not be enough to validate a signature. The Public Key is encoded per the Peer ID spec under the section "Keys" section.

Mutual Client and Server Peer ID Authentication

The following protocol allows both the client and server to authenticate each other's Peer ID by having them each sign a challenge issued by the other. The protocol operates as follows:

Server Initiated Handshake

  1. The client makes an HTTP request to an authenticated resource.

  2. The server responds with status code 401 (Unauthorized) and sets the header:

    WWW-Authenticate: libp2p-PeerID challenge-client="<challenge-string>", public-key="<base64-encoded-public-key-bytes>", opaque="<opaque-value>"
    

    The public-key parameter is the server's public key. It is the same public key used to derive the server's peer id.

    The opaque parameter is opaque to client. The client MUST return the opaque parameter back to the server. The server MAY use the opaque parameter to encode state.

  3. The client makes another HTTP request to the same authenticated resource and sets the header:

    Authorization: libp2p-PeerID public-key="<base64-encoded-public-key-bytes>", opaque="<opaque-from-server>", challenge-server="<challenge-string>", sig="<base64-signature-bytes>"
    

    The public-key parameter is the client's public key. It is the same public key used to derive the client's peer id.

    The sig param represents a signature over the parameters:

    • challenge-client
    • server-public-key the bytes of the server's public-key encoded per the Peer ID spec.
    • hostname
  4. The server SHOULD verify the signature using the server name used in the TLS session. The server MUST return 401 Unauthorized if the server fails to validate the signature. If the signature is valid, the server has authenticated the client's public key, and thus its PeerID. The server SHOULD proceed to serve the HTTP request. The server MUST set the following response headers:

    Authentication-Info: libp2p-PeerID sig="<base64-signature-bytes>" bearer="<base64-encoded-opaque-blob>"
    

    The server MAY include an expires field which contains the expiry time of the bearer token in RFC 3339 format:

    Authentication-Info: libp2p-PeerID sig="<base64-signature-bytes>" bearer="<base64-encoded-opaque-blob>" expires="<RFC-3339-formatted-date-string>"
    

    Note that the expires field is only advisory, the server may expire the token at any time.

    The sig param represents a signature over the parameters:

    • challenge-server
    • client-public-key the bytes of the client's public-key encoded per the Peer ID spec.
    • hostname

    The bearer token allows the client to make future Peer ID authenticated requests. The value is opaque to the client, and the server may use it to store authentication state such as:

    • The client's Peer ID.
    • The hostname parameter.
    • The token creation date (to allow tokens to expire).
  5. The client MUST verify the signature. After verification the client has authenticated the server's Peer ID. The client SHOULD send the bearer token for Peer ID authenticated requests.

Client Initiated Handshake

The client initiated version of this handshake follows the same structure, except that the client sends initially sends a challenge-server and the order of who is authenticated first is reversed. The server MAY ignore the initial request, and respond by starting the Server initiated handshake.

The client initiated handshake is as follows

  1. The client makes an HTTP request to a known authenticated resource and sets the header:

    Authorization: libp2p-PeerID challenge-server="<challenge-string>", public-key="<base64-encoded-public-key-bytes>"
    
  2. The server responds with status code 401 (Unauthorized) and set the header:

    WWW-Authenticate: libp2p-PeerID challenge-client="<challenge-string>", public-key="<base64-encoded-public-key-bytes>", sig="<base64-signature-bytes>", opaque="<opaque-value>"
    

    The sig param represents a signature over the parameters:

    • challenge-server
    • client-public-key the bytes of the client's public-key encoded per the Peer ID spec.
    • hostname
  3. The client MUST verify the signature. After verification the client has authenticated the server's Peer ID.

    The client makes another HTTP request to the same authenticated resource and sets the header:

    Authorization: libp2p-PeerID opaque="<opaque-from-server>", sig="<base64-signature-bytes>"
    

    The client MAY send application data in this request.

    The sig param represents a signature over the parameters:

    • challenge-client
    • server-public-key the bytes of the server's public-key encoded per the Peer ID spec.
    • hostname
  4. The server MUST verify the signature. The server SHOULD verify the signature using the server name used in the TLS session. The server MUST return 401 Unauthorized if the server fails to validate the signature. If the signature is valid, the server has authenticated the client's public key, and thus its PeerID. The server SHOULD proceed to serve the HTTP request. The server MUST set the following response headers:

    Authentication-Info: libp2p-PeerID bearer="<base64-encoded-opaque-blob>"
    

    The bearer token allows the client to make future Peer ID authenticated requests. The value is opaque to the client, and the server MAY use it to store authentication state such as:

    • The client's Peer ID.
    • The hostname parameter.
    • The token creation date (to allow tokens to expire).

    The server MAY include an expires field which contains the expiry time of the bearer token in RFC 3339 format:

    Authentication-Info: libp2p-PeerID bearer="<base64-encoded-opaque-blob>" expires="<RFC-3339-formatted-date-string>"
    

    Note that the expires field is only advisory, the server may expire the token at any time.

  5. The client SHOULD send the bearer token for future Peer ID authenticated requests.

libp2p bearer token

The libp2p bearer token is a token given to the client from the server that allows the client (the bearer) to make Peer ID authenticated requests to the server. Once the client receives this token, they SHOULD save it and use it for future authenticated requests.

The server SHOULD return a 401 Unauthorized and follow the above Mutual authentication protocol when it wants the client to request a new libp2p bearer token.

To use the bearer token, the client MUST set the Authorization header as follows:

Authorization: libp2p-PeerID bearer="<base64-encoded-opaque-blob>"

Authentication URI Endpoint

Because the client needs to make a request to authenticate the server, and the client may not want to make the real request before authenticating the server, the server MAY provide an authentication endpoint. This authentication endpoint is like any other application protocol, and it shows up in .well-known/libp2p/protocols, but it only does the authentication flow. The client and server SHOULD NOT send any data besides what is defined in the above authentication flow. The protocol id for the authentication endpoint is /http-peer-id-auth/1.0.0.

Considerations for Implementations

  • Implementations SHOULD only authenticate over a secured connection (i.e. TLS).
  • Implementations SHOULD limit the maximum length of any variable length field.
    • The suggested Maximum length of the Authentication related header should is 2048 bytes.

Security Considerations

Protection against man-in-the-middle (MITM) type attacks is through Web PKI. If the client is in an environment where Web PKI can not be fully trusted (e.g. an enterprise network with a custom enterprise root CA installed on the client), then this authentication scheme can not protect the client from a MITM attack.

This authentication scheme is also not secure in cases where you do not own your domain name or the TLS certificate. If someone else can get a valid certificate for your domain, you may be vulnerable to a MITM attack.

Complete Server Initiated Handshake Example

The following is a complete and reproducible handshake. Generated by the current implementation of this spec in go-libp2p. This is a server-initiated handshake.

Understanding the opaque value is not necessary in order to understand the spec. Servers are free to do whatever they want with the opaque field. The opaque value represents encoded server state authenticated with an HMAC. The details can be found in the go-libp2p source.

Parameters

ParameterValue
hostnameexample.com
Server Private Key (pb encoded as hex)0801124001010101010101010101010101010101010101010101010101010101010101018a88e3dd7409f195fd52db2d3cba5d72ca6709bf1d94121bf3748801b40f6f5c
Server HMAC Key (hex)0000000000000000000000000000000000000000000000000000000000000000
Challenge ClientERERERERERERERERERERERERERERERERERERERERERE=
Client Private Key (pb encoded as hex)0801124002020202020202020202020202020202020202020202020202020202020202028139770ea87d175f56a35466c34c7ecccb8d8a91b4ee37a25df60f5b8fc9b394
Challenge ServerMzMzMzMzMzMzMzMzMzMzMzMzMzMzMzMz
"Now" time1970-01-01 00:00:00 +0000 UTC

Handshake Diagram

sequenceDiagram
Client->>Server: Initial request
Server->>Client: WWW-Authenticate=libp2p-PeerID challenge-client="ERERERERERERERERERERERERERERERERERERERERERE=", opaque="0H1Y9sq1zrfTJZCCTcTymI2tV_TF9-PzdMip2dFkiqZ7ImNoYWxsZW5nZS1jbGllbnQiOiJFUkVSRVJFUkVSRVJFUkVSRVJFUkVSRVJFUkVSRVJFUkVSRVJFUkVSRVJFPSIsImhvc3RuYW1lIjoiZXhhbXBsZS5jb20iLCJjcmVhdGVkLXRpbWUiOiIxOTY5LTEyLTMxVDE2OjAwOjAwLTA4OjAwIn0="
Client->>Server: Authorization=libp2p-PeerID public-key="CAESIIE5dw6ofRdfVqNUZsNMfszLjYqRtO43ol32D1uPybOU", challenge-server="MzMzMzMzMzMzMzMzMzMzMzMzMzMzMzMz", sig="5RT0BbFdn-hMgE4pQ_GH9tnlKpptGUQZvkh8kVLbwy81Rzli_vfiNOsuGTcMk8lyUfkmTFmk79b5XUZCR3-RBw==", opaque="0H1Y9sq1zrfTJZCCTcTymI2tV_TF9-PzdMip2dFkiqZ7ImNoYWxsZW5nZS1jbGllbnQiOiJFUkVSRVJFUkVSRVJFUkVSRVJFUkVSRVJFUkVSRVJFUkVSRVJFUkVSRVJFPSIsImhvc3RuYW1lIjoiZXhhbXBsZS5jb20iLCJjcmVhdGVkLXRpbWUiOiIxOTY5LTEyLTMxVDE2OjAwOjAwLTA4OjAwIn0="
Note left of Server: Server has authenticated Client
Server->>Client: Authentication-Info=libp2p-PeerID sig="HQ7BJRaSpRhNCORNiALNJENdwXUyq0eM2cxNoxe-XnQw6oEAMaeYnjMYaHHjgq0XNxZmy4W2ngKUcI1CgprLCQ==", bearer="YhlYjHWTMOkTleROtjMiChL7Mx15_GDYfi971mdJCqB7ImlzLXRva2VuIjp0cnVlLCJwZWVyLWlkIjoiMTJEM0tvb1dKV29hcVpoRGFvRUZzaEY3UmgxYnBZOW9oaWhGaHpjVzZkNjlMcjJOQVN1cSIsImhvc3RuYW1lIjoiZXhhbXBsZS5jb20iLCJjcmVhdGVkLXRpbWUiOiIxOTY5LTEyLTMxVDE2OjAwOjAwLTA4OjAwIn0=", public-key="CAESIIqI4910CfGV_VLbLTy6XXLKZwm_HZQSG_N0iAG0D29c"
Note right of Client: Client has authenticated Server

Note over Client: Future requests use the bearer token
Client->>Server: Authorization=libp2p-PeerID bearer="YhlYjHWTMOkTleROtjMiChL7Mx15_GDYfi971mdJCqB7ImlzLXRva2VuIjp0cnVlLCJwZWVyLWlkIjoiMTJEM0tvb1dKV29hcVpoRGFvRUZzaEY3UmgxYnBZOW9oaWhGaHpjVzZkNjlMcjJOQVN1cSIsImhvc3RuYW1lIjoiZXhhbXBsZS5jb20iLCJjcmVhdGVkLXRpbWUiOiIxOTY5LTEyLTMxVDE2OjAwOjAwLTA4OjAwIn0="

Complete Client Initiated Handshake Example

Below is the same as above, but using the client initiated handshake.

Parameters

ParameterValue
hostnameexample.com
Server Private Key (pb encoded as hex)0801124001010101010101010101010101010101010101010101010101010101010101018a88e3dd7409f195fd52db2d3cba5d72ca6709bf1d94121bf3748801b40f6f5c
Server HMAC Key (hex)0000000000000000000000000000000000000000000000000000000000000000
Challenge ClientERERERERERERERERERERERERERERERERERERERERERE=
Client Private Key (pb encoded as hex)0801124002020202020202020202020202020202020202020202020202020202020202028139770ea87d175f56a35466c34c7ecccb8d8a91b4ee37a25df60f5b8fc9b394
Challenge ServerMzMzMzMzMzMzMzMzMzMzMzMzMzMzMzMz
"Now" time1970-01-01 00:00:00 +0000 UTC

Handshake Diagram

sequenceDiagram
Client->>Server: Authorization=libp2p-PeerID challenge-server="MzMzMzMzMzMzMzMzMzMzMzMzMzMzMzMz", public-key="CAESIIE5dw6ofRdfVqNUZsNMfszLjYqRtO43ol32D1uPybOU"
Server->>Client: WWW-Authenticate=libp2p-PeerID challenge-client="ERERERERERERERERERERERERERERERERERERERERERE=", public-key="CAESIIqI4910CfGV_VLbLTy6XXLKZwm_HZQSG_N0iAG0D29c", sig="HQ7BJRaSpRhNCORNiALNJENdwXUyq0eM2cxNoxe-XnQw6oEAMaeYnjMYaHHjgq0XNxZmy4W2ngKUcI1CgprLCQ==", opaque="1JrloFj6hobNG859qexB0_odSQlwsb1QSFUMebPJLFp7ImNsaWVudC1wdWJsaWMta2V5IjoiQ0FFU0lJRTVkdzZvZlJkZlZxTlVac05NZnN6TGpZcVJ0TzQzb2wzMkQxdVB5Yk9VIiwiY2hhbGxlbmdlLWNsaWVudCI6IkVSRVJFUkVSRVJFUkVSRVJFUkVSRVJFUkVSRVJFUkVSRVJFUkVSRVJFUkU9IiwiaG9zdG5hbWUiOiJleGFtcGxlLmNvbSIsImNyZWF0ZWQtdGltZSI6IjE5NjktMTItMzFUMTY6MDA6MDAtMDg6MDAifQ=="
Note right of Client: Client has authenticated Server

Client->>Server: Authorization=libp2p-PeerID opaque="1JrloFj6hobNG859qexB0_odSQlwsb1QSFUMebPJLFp7ImNsaWVudC1wdWJsaWMta2V5IjoiQ0FFU0lJRTVkdzZvZlJkZlZxTlVac05NZnN6TGpZcVJ0TzQzb2wzMkQxdVB5Yk9VIiwiY2hhbGxlbmdlLWNsaWVudCI6IkVSRVJFUkVSRVJFUkVSRVJFUkVSRVJFUkVSRVJFUkVSRVJFUkVSRVJFUkU9IiwiaG9zdG5hbWUiOiJleGFtcGxlLmNvbSIsImNyZWF0ZWQtdGltZSI6IjE5NjktMTItMzFUMTY6MDA6MDAtMDg6MDAifQ==", sig="OrwJPO4buHKJdKXP2av8PFwv3XF_-m5MqndskeVV5UzufYzBCTm7RBaFnBS1sEhuQHZSZPh9RJgN5NmLzrUrBQ=="
Note left of Server: Server has authenticated Client
Server->>Client: Authentication-Info=libp2p-PeerID bearer="YhlYjHWTMOkTleROtjMiChL7Mx15_GDYfi971mdJCqB7ImlzLXRva2VuIjp0cnVlLCJwZWVyLWlkIjoiMTJEM0tvb1dKV29hcVpoRGFvRUZzaEY3UmgxYnBZOW9oaWhGaHpjVzZkNjlMcjJOQVN1cSIsImhvc3RuYW1lIjoiZXhhbXBsZS5jb20iLCJjcmVhdGVkLXRpbWUiOiIxOTY5LTEyLTMxVDE2OjAwOjAwLTA4OjAwIn0="
Note over Client: Future requests use the bearer token

HTTP Transport Component

Lifecycle StageMaturityStatusLatest Revision
1AWorking DraftActiver0, 2023-05-31

Authors: @marcopolo

Interest Group: @marcopolo, @mxinden, @marten-seemann

Table of Contents

Context

This document is only about advertising support for an HTTP transport. It doesn't make any assertions about how libp2p should interact with that transport. That will be defined in a future document.

This exists to clarify the role of the /http component in Multiaddrs early to avoid confusion and conflicting interpretations.

What is an HTTP transport

An HTTP transport is simply a node that can speak some standardized version of HTTP. Intuitively if you can curl it with HTTP, then it speaks HTTP.

Most environments will have a way to create an HTTP Client and Server, and the specific HTTP version used will be opaque. We use the /http component at the end of the multidadr to signal that this server supports an HTTP transport. The end user agent decides on HTTP version to use, based on the multiaddr prefix, application, server negotiation, and specific use case. This follows what existing http:// URL implementations do.

Multiaddr representation

The multiaddr of a node with an HTTP transport ends with /http and is prefixed by information that would let an HTTP client know how to reach the server (remember that multiaddrs are interpreted right to left).

The following are examples of multiaddrs for HTTP transport capable nodes:

  • /dns/example.com/tls/http
  • /ip4/1.2.3.4/tcp/443/tls/http
  • /ip6/2001:0db8:85a3:0000:0000:8a2e:0370:7334/tcp/443/tls/http
  • /ip4/1.2.3.4/udp/50781/quic-v1/http

Note: When we use /quic-v1/http or /tcp/443/tls/http (or any other transport) implementations MUST use the correct HTTP ALPN (e.g. h3 or h2 respectively) and not libp2p when using the HTTP transport.

HTTP Paths (and other HTTP Semantics)

It may be tempting to add an HTTP path to end of the multiaddr to specify some information about a user protocol. However the /http component is not a user protocol, and it doesn't accept any parameters. It only signals that a node is capable of an HTTP transport.

The HTTP Path exists in the semantics level. HTTP Semantics are transport-agnostic, and defined by RFC 9110. You can use these semantics on any transport including, but not limited to, the HTTP transports like HTTP/1.1, HTTP/2, or HTTP/3.

Recommendation on including HTTP semantics in multiaddrs

In general, it's better to keep the multiaddrs as a way of addressing an endpoint and keep the semantics independent of any specific transport. This way you can use the same semantics among many specific transports.

However, sometimes it's helpful to share a single multiaddr that contains some extra application-level data (as opposed to transport data). The recommendation is to use a new multicodec in the private range for your application. Then apply whatever application parameters to the right of your new multicodec and transport information to the left. E.g. <transport>/myapp/<parameters> or /ip4/127.0.0.1/tcp/8080/http/myapp/custom-prefix/foo%2fbar. Your application has the flexibility to handle the parameters in any way it wants (e.g. set HTTP headers, an HTTP path prefix, cookies, etc).

This is a bit cumbersome when you are trying to use multiple transports since you may end up with many multiaddrs with different transports but the same suffix. A potential solution here is to keep them separate. A list of multiaddrs for the transports being used, and another multiaddr for the application-level data. This is one suggestion, and many other strategies would work as well.

libp2p WebSockets

Lifecycle StageMaturityStatusLatest Revision
3ARecommendationActiver0, 2024-10-23

Authors: @achingbrain

Interest Group: @MarcoPolo

See the lifecycle document for context about maturity level and spec status.

Introduction

WebSockets are a way for web applications to maintain bidirectional communications with server-side processes.

All major browsers have shipped WebSocket support and the implementations are both robust and well understood.

A WebSocket request starts as a regular HTTP request, which is renegotiated as a WebSocket connection using the HTTP protocol upgrade mechanism.

Drawbacks

WebSockets suffer from head of line blocking and provide no mechanism for stream multiplexing, encryption or authentication so additional features must be added by the developer or by libp2p.

In practice they only run over TCP so are less effective with DCuTR Holepunching.

Certificates

With some exceptions browsers will prevent making connections to unencrypted WebSockets when the request is made from a Secure Context.

Given that libp2p makes extensive use of the SubtleCrypto API, and that API is only available in Secure Contexts, it's safe to assume that any incoming libp2p connections initiated over WebSockets originate from a Secure Context.

Consequently server-side processes listening for incoming libp2p connections via WebSockets must use TLS certificates that can be verified by the connecting user agent.

These must be obtained externally and configured in the same way as you would for an HTTP server.

The only exception to this is if both server and client are operating exclusively on loopback or localhost addresses such as in a testing or offline environment. Such addresses should not be shared outside of these environments.

Stream Multiplexing

WebSockets have no built in stream multiplexing. Server-side processes listening for incoming libp2p connections via WebSockets should support multi-stream select and negotiate an appropriate stream multiplexer such as yamux.

Authentication

WebSockets have no built in authentication mechanism. Server-side processes listening for incoming libp2p connections via WebSockets should support multi-stream select and negotiate an appropriate authentication mechanism such as noise.

Encryption

At the time of writing, the negotiated authentication mechanism should also be used to encrypt all traffic sent over the WebSocket even if TLS certificates are also used at the transport layer.

A mechanism to avoid this but also maintain backwards compatibility with existing server-side processes will be specified in a future revision to this spec.

Addressing

A WebSocket address contains /ws, /tls/ws or /wss and runs over TCP. If a TCP port is omitted, a secure WebSocket (e.g. /tls/ws or /wss is assumed to run on TCP port 443), an insecure WebSocket is assumed to run on TCP port 80 similar to HTTP addresses.

Examples:

  • /ip4/192.0.2.0/tcp/1234/ws (an insecure address with a TCP port)
  • /ip4/192.0.2.0/tcp/1234/tls/ws (a secure address with a TCP port)
  • /ip4/192.0.2.0/tcp/1234/tls/sni/foo.example.com/ws (a secure address with resolved DNS address with explicit SNI value for TLS)
  • /ip4/192.0.2.0/ws (an insecure address that defaults to TCP port 80)
  • /ip4/192.0.2.0/tls/ws (a secure address that defaults to TCP port 443)
  • /ip4/192.0.2.0/wss (/tls may be omitted when using /wss)
  • /dns/example.com/wss (a DNS address)
  • /dns/example.com/wss/http-path/path%2Fto%2Fendpoint (an address with a path)

libp2p WebTransport

Lifecycle StageMaturityStatusLatest Revision
2ACandidate RecommendationActiver0, 2022-10-12

Authors: @marten-seemann

Interest Group: @MarcoPolo, @mxinden, @elenaf9

See the lifecycle document for context about maturity level and spec status.

Introduction

WebTransport is a way for browsers to establish a stream-multiplexed and bidirectional connection to servers. The WebTransport protocol is currently under development at the IETF. The primary way to do this is by running on top of a HTTP/3 connection WebTransport over HTTP/3. For situations where it is not possible to establish a HTTP/3 connection (e.g. when UDP is blocked), there's a HTTP/2 fallback (WebTransport using HTTP/2).

In this document, we mean WebTransport over HTTP/3 when using the term WebTransport.

Chrome has implemented and shipped support for draft-02, and Firefox is working on WebTransport support.

The most exciting feature for libp2p (other than the numerous performance benefits that QUIC gives us) is that the W3C added a browser API allowing browsers to establish connections to nodes with self-signed certificates, provided they know the hash of the certificate in advance: serverCertificateHashes. This API is already implemented in Chrome, and Firefox is likely to implement this part of the specification as well.

Certificates

According to the w3c WebTransport specification, there are two ways for a browser to validate the certificate used on a WebTransport connection.

  1. by verifying the chain of trust of the certificate. This means that the certificate has to be signed by a CA (Certificate Authority) that the browser trusts. This is how browsers verify certificates when establishing a regular HTTPS connection.
  2. by verifying that the cryptographic hash of the certificate matches a specific value, using the serverCertificateHashes option.

libp2p nodes that possess a CA-signed TLS certificate MAY use that certificate on WebTransport connections. These nodes SHOULD NOT add a /certhash component (see Addressing) to addresses they advertise, since this will cause clients to verify the certificate by the hash (instead of verifying the certificate chain).

The rest of this section applies to nodes that wish to use self-signed certificates and make use of the verification by certificate hash.

According to the w3c specification, the validity of the certificate MUST be at most 14 days, and MUST NOT use an RSA key. Nodes then include the hash of one (or more) certificates in their multiaddr (see Addressing).

Servers need to take care of regularly renewing their certificates. In the following, the RECOMMENDED logic for rolling certificates is described. At first boot of the node, it creates one self-signed certificate with a validity of 14 days, starting immediately, and another certificate with the 14 day validity period starting on the expiry date of the first certificate. The node advertises a multiaddress containing the certificate hashes of these two certificates. Once the first certificate has expired, the node starts using the already generated next certificate. At the same time, it again generates a new certificate for the following period and updates the multiaddress it advertises.

Addressing

WebTransport multiaddresses are composed of a QUIC multiaddress, followed by /webtransport and a list of multihashes of the certificates that the server uses (if not using a CA-signed certificate). Examples:

  • /ip4/192.0.2.0/udp/443/quic/webtransport (when using a CA-signed certificates)
  • /ip4/192.0.2.0/udp/1234/quic/webtransport/certhash/<hash1> (when using a single self-signed certificate)
  • /ip6/fe80::1ff:fe23:4567:890a/udp/1234/quic/webtransport/certhash/<hash1>/certhash/<hash2>/certhash/<hash3> (when using multiple self-signed certificates)

WebTransport HTTP endpoint

WebTransport needs a HTTPS URL to establish a WebTransport session, e.g. https://example.com/webtransport. At the point of writing multiaddresses don't allow the encoding of URLs, therefore this spec standardizes the endpoint. The HTTP endpoint of a libp2p WebTransport server MUST be located at /.well-known/libp2p-webtransport.

To allow future evolution of the way we run the libp2p handshake over WebTransport, we use a URL parameter. The handshake described in this document MUST be signaled by setting the type URL parameter to noise.

Example: The WebTransport URL of a WebTransport server advertising /ip4/192.0.2.0/udp/1443/quic/webtransport/ would be https://192.0.2.0:1443/.well-known/libp2p-webtransport?type=noise.

Security Handshake

Unfortunately, the self-signed certificate doesn't allow the nodes to authenticate each others' peer IDs. It is therefore necessary to run an additional libp2p handshake on a newly established WebTransport connection. The first stream that the client opens on a new WebTransport session is used to perform a libp2p handshake using Noise (https://github.com/libp2p/specs/tree/master/noise). The client SHOULD start the handshake right after sending the CONNECT request, without waiting for the server's response.

In order to verify end-to-end encryption of the connection, the peers need to establish that no MITM intercepted the connection. To do so, the server MUST include the certificate hash of the currently used certificate as well as the certificate hashes of all future certificates it has already advertised to the network in the webtransport_certhashes Noise extension (see Noise Extension section of the Noise spec). The hash of recently used, but expired certificates SHOULD also be included.

On receipt of the webtransport_certhashes extension, the client MUST verify that the certificate hash of the certificate that was used on the connection is contained in the server's list. If the client was willing to accept multiple certificate hashes, but cannot determine which certificate was actually used to establish the connection (this will commonly be the case for browser clients), it MUST verify that all certificate hashes are contained in the server's list. If verification fails, it MUST abort the handshake.

For the client, the libp2p connection is fully established once it has sent the last Noise handshake message. For the server, processing of that message completes the handshake.

Perf

Lifecycle StageMaturityStatusLatest Revision
1AWorking DraftActiver0, 2022-11-16

Authors: @marcopolo

Interest Group: @marcopolo, @mxinden, @marten-seemann

Table of Contents

Context

The perf protocol represents a standard benchmarking protocol that we can use to talk about performance within and across libp2p implementations. This lets us analyze peformance, guide improvements, and protect against regressions.

Protocol

The /perf/1.0.0 protocol (from here on referred to as simply perf) is a client driven set of benchmarks. To not reinvent the wheel, this perf protocol is almost identical to Nick Bank's QUIC Performance Internet-Draft but adapted to libp2p. The protocol first performs an upload of a client-chosen amount of bytes. Once that upload has finished, the server sends back as many bytes as the client requested.

The bytes themselves should be a predetermined arbitrary set of bytes. Zero is fine, but so is random bytes (as long as it's not a different set of random bytes, because then you may be limited by how fast you can generate random bytes).

The protocol is as a follows:

Client:

  1. Open a libp2p stream to the server.
  2. Tell the server how many bytes we want the server to send us as a single big-endian uint64 number. Zero is a valid number, so is the max uint64 value.
  3. Write some amount of data to the stream. Zero is a valid amount.
  4. Close the write side of our stream.
  5. Read from the read side of the stream. This should be the same number of bytes as we told the server in step 2.

Server, on handling a new perf stream:

  1. Read the big-endian uint64 number. This is how many bytes we'll send back in step 3.
  2. Read from the stream until we get an EOF (client's write side was closed).
  3. Send the number of bytes defined in step 1 back to the client. This MUST NOT be run concurrently with step 2.
  4. Close the stream.

Benchmarks

The above protocol is flexible enough to run the following benchmarks and more. The exact specifics of the benchmark (e.g. how much data to download or for how long) are left up to the benchmark implementation. Consider these rough guidelines for how to run one of these benchmarks.

Other benchmarks can be run with the same protocol. The following benchmarks have immediate usefulness, but other benchmarks can be added as we find them useful. Consult the QUIC Performance Internet-Draft for some other benchmarks (called scenarios in the document).

Single connection throughput

For an upload test, the client sets the the server response size to 0 bytes, writes some amount of data and closes the stream.

For a download test, the client sets the server response size to N bytes, and closes the write side of the data.

The measurements are gathered and reported by the client by measuring how many bytes were transferred by the total time it took from stream open to stream close.

A timer based variant is also possible where we see how much data a client can upload or download within a specific time. For upload it's the same as before and the client closes the stream after the timer ends. For download, the client should request a response size of max uint64, then close the stream after the timer ends.

Handshakes per second

This benchmark measures connection setup efficiency. A transport that takes many RTTs will perform worse here than one that takes fewer.

To run this benchmark:

  1. Set up N clients
  2. Each client opens K connections/s to a single server
  3. once a connection is established, the client closes it and establishes another one.

Handshakes per second are calculated by taking the total number of connections successfully established and divide it by the time period of the test.

Security Considerations

Since this protocol lets clients ask servers to do significant work, it SHOULD NOT be enabled by default in any implementation. Users are advised not to enable this on publicly reachable nodes.

Authentacting by Peer ID could mitigate the security concern by only allowing trusted clients to use the protocol. Support for this is left to the implementation.

Prior Art

As mentioned above, this document is inspired by Nick Bank's: QUIC Performance Internet-Draft

iperf

@mxinden's libp2p perf

@marten-seemann's libp2p perf test

@vyzo's libp2p perf test

RFC 0001: Text Peer Ids as CIDs

Abstract

This is an RFC to modify Peer Id spec to alter the default string representation from Multihash to CIDv1 in Base32 and to support encoding/decoding text Peer Ids as CIDs.

Motivation

  1. Current text representation of Peer Id (multihash in Base58btc) is case-sensitive. This means we can't use it in case-insensitive contexts such as domain names (RFC1035 + RFC1123) or FAT filesystems.
  2. CID provide multibase support and base32 makes a safe default that will work in case-insensitive contexts, enabling us to put Peer Ids in domains or create files with Peer Ids as names.
  3. It's much easier to upgrade wire protocols than text. This RFC makes Peer Ids in text form fully self describing, making them more future-proof. A dedicated multicodec in text-encoded CID will indicate that it's a hash of a libp2p public key.

Detailed design

  1. Switch text encoding and decoding of Peer Ids from Multihash to CID.
  2. The new text representation should be CIDv1 with additional requirements:
    • MUST have multicodec set to libp2p-key (0x72)
    • SHOULD have multibase set to base32 (Base32 without padding, as specified by RFC4648)

Upgrade path

  1. Release support for reading Peer Id represented with CIDv1
  2. Wait three months or until the next release (whichever comes first)
  3. Switch the default Peer Id output format to CIDv1 in Base32

Backward compatibility

The old text representation (Multihash encoded as base58btc) is a valid CIDv0 and does not require any special handling.

Alternatives

We could just add a multibase prefix to multihash, but that requires more work and introduces a new format. This option was rejected as using CID enables reuse of existing serializers/deserializers and does not create any new standards.

Unresolved questions

This RFC punts pids-as-cids on the wire down the road but that's something we can revisit if it ever becomes relevant.

RFC 0002 - Signed Envelopes

Abstract

This RFC proposes a "signed envelope" structure that contains an arbitrary byte string payload, a signature of the payload, and the public key that can be used to verify the signature.

This was spun out of an earlier draft of the address records RFC, since it's generically useful.

Problem Statement

Sometimes we'd like to store some data in a public location (e.g. a DHT, etc), or make use of potentially untrustworthy intermediaries to relay information. It would be nice to have an all-purpose data container that includes a signature of the data, so we can verify that the data came from a specific peer and that it hasn't been tampered with.

Domain Separation

Signatures can be used for a variety of purposes, and a signature made for a specific purpose MUST NOT be considered valid for a different purpose.

Without this property, an attacker could convince a peer to sign a payload in one context and present it as valid in another, for example, presenting a signed address record as a pubsub message.

We separate signatures into "domains" by prefixing the data to be signed with a string unique to each domain. This string is not contained within the payload or the outer envelope structure. Instead, each libp2p subsystem that makes use of signed envelopes will provide their own domain string when constructing the envelope, and again when validating the envelope. If the domain string used to validate is different from the one used to sign, the signature validation will fail.

Domain strings may be any valid UTF-8 string, but should be fairly short and descriptive of their use case, for example "libp2p-routing-record".

Payload Type Information

The envelope record can contain an arbitrary byte string payload, which will need to be interpreted in the context of a specific use case. To assist in "hydrating" the payload into an appropriate domain object, we include a "payload type" field. This field consists of a multicodec code, optionally followed by an arbitrary byte sequence.

This allows very compact type hints that contain just a multicodec, as well as "path" multicodecs of the form /some/thing, using the "namespace" multicodec, whose binary value is equivalent to the UTF-8 / character.

Use of the payload type field is encouraged, but the field may be left empty without invalidating the envelope.

Wire Format

Since we already have a protobuf definition for public keys, we can use protobuf for this as well and easily embed the key in the envelope:

syntax = "proto3";

package record.pb;

// Envelope encloses a signed payload produced by a peer, along with the public
// key of the keypair it was signed with so that it can be statelessly validated
// by the receiver.
//
// The payload is prefixed with a byte string that determines the type, so it
// can be deserialized deterministically. Often, this byte string is a
// multicodec.
message Envelope {
  // public_key is the public key of the keypair the enclosed payload was
  // signed with.
  PublicKey public_key = 1;

  // payload_type encodes the type of payload, so that it can be deserialized
  // deterministically.
  bytes payload_type = 2;

  // payload is the actual payload carried inside this envelope.
  bytes payload = 3;

  // signature is the signature produced by the private key corresponding to
  // the enclosed public key, over the payload, prefixing a domain string for
  // additional security.
  bytes signature = 5;
}

The public_key field contains the public key whose secret counterpart was used to sign the message. This MUST be consistent with the peer id of the signing peer, as the recipient will derive the peer id of the signer from this key.

The payload_type field contains a multicodec-prefixed type indicator as described in the Payload Type Information section.

The payload field contains the arbitrary byte string payload.

The signature field contains a signature of all fields except public_key, generated as described below.

Signature Production / Verification

When signing, a peer will prepare a buffer by concatenating the following:

  • The length of the domain separation string string in bytes
  • The domain separation string, encoded as UTF-8
  • The length of the payload_type field in bytes
  • The value of the payload_type field
  • The length of the payload field in bytes
  • The value of the payload field

The length values for each field are encoded as unsigned variable-length integers as defined in the multiformats uvarint spec.

Then they will sign the buffer according to the rules in the peer id spec and set the signature field accordingly.

To verify, a peer will "inflate" the public_key into a domain object that can verify signatures, prepare a buffer as above and verify the signature field against it.

RFC 0003 - Peer Routing Records

Abstract

This RFC proposes a method for distributing peer routing records, which contain a peer's publicly reachable listen addresses, and may be extended in the future to contain additional metadata relevant to routing. This serves a similar purpose to Ethereum Node Records. Like ENR records, libp2p routing records should be extensible, so that we can add information relevant to as-yet unknown use cases.

The record described here does not include a signature, but it is expected to be serialized and wrapped in a signed envelope, which will prove the identity of the issuing peer. The dialer can then prioritize self-certified addresses over addresses from an unknown origin.

Problem Statement

All libp2p peers keep a "peer store", which maps peer ids to a set of known addresses for each peer. When the application layer wants to contact a peer, the dialer will pull addresses from the peer store and try to initiate a connection on one or more addresses.

Addresses for a peer can come from a variety of sources. If we have already made a connection to a peer, the libp2p identify protocol will inform us of other addresses that they are listening on. We may also discover their address by querying the DHT, checking a fixed "bootstrap list", or perhaps through a pubsub message or an application-specific protocol.

In the case of the identify protocol, we can be fairly certain that the addresses originate from the peer we're speaking to, assuming that we're using a secure, authenticated communication channel. However, more "ambient" discovery methods such as DHT traversal and pubsub depend on potentially untrustworthy third parties to relay address information.

Even in the case of receiving addresses via the identify protocol, our confidence that the address came directly from the peer is not actionable, because the peer store does not track the origin of an address. Once added to the peer store, all addresses are considered equally valid, regardless of their source.

We would like to have a means of distributing verifiable address records, which we can prove originated from the addressed peer itself. We also need a way to track the "provenance" of an address within libp2p's internal components such as the peer store. Once those pieces are in place, we will also need a way to prioritize addresses based on their authenticity, with the most strict strategy being to only dial certified addresses.

Complications

While producing a signed record is fairly trivial, there are a few aspects to this problem that complicate things.

  1. Addresses are not static. A given peer may have several addresses at any given time, and the set of addresses can change at arbitrary times.
  2. Peers may not know their own addresses. It's often impossible to automatically infer one's own public address, and peers may need to rely on third party peers to inform them of their observed public addresses.
  3. A peer may inadvertently or maliciously sign an address that they do not control. In other words, a signature isn't a guarantee that a given address is valid.
  4. Some addresses may be ambiguous. For example, addresses on a private subnet are valid within that subnet but are useless on the public internet.

The first point can be addressed by having records contain a sequence number that increases monotonically when new records are issued, and by having newer records replace older ones.

The other points, while worth thinking about, are out of scope for this RFC. However, we can take care to make our records extensible so that we can add additional metadata in the future. Some thoughts along these lines are in the Future Work section below.

Address Record Format

Here's a protobuf that might work:

syntax = "proto3";

package peer.pb;

// PeerRecord messages contain information that is useful to share with other peers.
// Currently, a PeerRecord contains the public listen addresses for a peer, but this
// is expected to expand to include other information in the future.
//
// PeerRecords are designed to be serialized to bytes and placed inside of
// SignedEnvelopes before sharing with other peers.
message PeerRecord {

  // AddressInfo is a wrapper around a binary multiaddr. It is defined as a
  // separate message to allow us to add per-address metadata in the future.
  message AddressInfo {
    bytes multiaddr = 1;
  }

  // peer_id contains a libp2p peer id in its binary representation.
  bytes peer_id = 1;

  // seq contains a monotonically-increasing sequence counter to order PeerRecords in time.
  uint64 seq = 2;

  // addresses is a list of public listen addresses for the peer.
  repeated AddressInfo addresses = 3;
}

The AddressInfo wrapper message is used instead of a bare multiaddr to allow us to extend addresses with additional metadata in the future.

The seq field contains a sequence number that MUST increase monotonically as new records are created. Newer records MUST have a higher seq value than older records. To avoid persisting state across restarts, implementations MAY use unix epoch time as the seq value, however they MUST NOT attempt to interpret a seq value from another peer as a valid timestamp.

Example

  {
    peer_id: "QmAlice...",
    seq: 1570215229,
    addresses: [
      {
        multiaddr: "/ip4/192.0.2.0/tcp/42/p2p/QmAlice",
      },
      {
        multiaddr: "/ip4/198.51.100.0/tcp/42/p2p/QmAlice",
      }
    ]
  }

A peer SHOULD only include addresses that it believes are routable via the public internet, ideally having confirmed that this is the case via some external mechanism such as a successful AutoNAT dial-back.

In some cases we may want to include localhost or LAN-local address; for example, when testing the DHT using many processes on a single machine. To support this, implementations may use a global runtime configuration flag or environment variable to control whether local addresses will be included.

Certification / Verification

This structure can be serialized and contained in a signed envelope, which lets us issue "self-certified" address records that are signed by the peer that the addresses belong to.

To produce a "self-certified" address, a peer will construct a RoutingState containing their listen addresses and serialize it to a byte array using a protobuf encoder. The serialized records will then be wrapped in a signed envelope, which is signed with the libp2p peer's private host key. The corresponding public key MUST be included in the envelope's public_key field.

When receiving a RoutingState wrapped in a signed envelope, a peer MUST validate the signature before deserializing the RoutingState record. If the signature is invalid, the envelope MUST be discarded without deserializing the envelope payload.

Once the signature has been verified and the RoutingState has been deserialized, the receiving peer MUST verify that the peer_id contained in the RoutingState matches the public_key from the envelope. If the public key in the envelope cannot derive the peer id contained in the routing state record, the RoutingState MUST be discarded.

Signed Envelope Domain

Signed envelopes require a "domain separation" string that defines the scope or purpose of a signature.

When wrapping a RoutingState in a signed envelope, the domain string MUST be libp2p-routing-state.

Signed Envelope Payload Type

Signed envelopes contain a payload_type field that indicates how to interpret the contents of the envelope.

Ideally, we should define a new multicodec for routing records, so that we can identify them in a few bytes. While we're still spec'ing and working on the initial implementation, we can use the UTF-8 string "/libp2p/routing-state-record" as the payload_type value.

Peer Store APIs

We will need to add a few methods to the peer store:

  • AddCertifiedAddrs(envelope) -> Maybe<Error>

    • Add a self-certified address, wrapped in a signed envelope. This should validate the envelope signature & store the envelope for future reference. If any certified addresses already exist for the peer, only accept the new envelope if it has a greater seq value than existing envelopes.
  • CertifiedAddrs(peer_id) -> Set<Multiaddr>

    • return the set of self-certified addresses for the given peer id
  • SignedRoutingState(peer_id) -> Maybe<SignedEnvelope>

    • retrieve the signed envelope that was most recently added to the peerstore for the given peer, if any exists.

And possibly:

  • IsCertified(peer_id, multiaddr) -> Boolean
    • has a particular address been self-certified by the given peer?

We'll also need a method that constructs a new RoutingState containing our listen addresses and wraps it in a signed envelope. This may belong on the Host instead of the peer store, since it needs access to the private signing key.

When adding records to the peerstore, a receiving peer MUST keep track of the latest seq value received for each peer and reject incoming RoutingState messages unless they contain a greater seq value than the last received.

After integrating the information from the RoutingState into the peerstore, implementations SHOULD retain the original signed envelope. This will allow other libp2p systems to share signed RoutingState records with other peers in the network, preserving the signature of the issuing peer. The Exchanging Records section section lists some systems that would need to retrieve the original signed record from the peerstore.

Dialing Strategies

Once self-certified addresses are available via the peer store, we can update the dialer to prefer using them when possible. Some systems may want to only dial self-certified addresses, so we should include some configuration options to control whether non-certified addresses are acceptable.

Exchanging Records

We currently have several systems in libp2p that deal with peer addressing and which could be updated to use signed routing records:

Of these, the highest priority for updating seems to be the DHT, since it's actively used by several deployed systems and is vulnerable to routing attacks by malicious peers. We should work on extending the FIND_NODE, ADD_PROVIDER, and GET_PROVIDERS RPC messages to support returning signed records in addition to the current unsigned address information they currently support.

We should also either define a new "secure peer routing" interface or extend the existing peer routing interfaces to support signed records, so that we don't end up with a bunch of similar but incompatible APIs for exchanging signed address records.

Future Work

Some things that were originally considered in this RFC were trimmed so that we can focus on delivering a basic self-certified record, which is a pressing need.

This includes a notion of "routability", which could be used to communicate whether a given address is global (reachable via the public internet), LAN-local, etc. We may also want to include some kind of confidence score or priority ranking, so that peers can communicate which addresses they would prefer other peers to use.

To allow these fields to be added in the future, we wrap multiaddrs in the AddressInfo message instead of having the addresses field be a list of "raw" multiaddrs.

Another potentially useful extension would be a compact protocol table or bloom filter that could be used to test whether a peer supports a given protocol before interacting with them directly. This could be added as a new field in the RoutingState message.