Skip to content

Latest commit

 

History

History
244 lines (200 loc) · 16.9 KB

reader-privacy.md

File metadata and controls

244 lines (200 loc) · 16.9 KB

Reader Privacy Preservation

wip

Author(s):

Maintainer(s): ​


Abstract

The lookup APIs provided by IPNI nodes are able to observe what data is being accessed by the clients. This is true regardless of whether the data itself is public or not. Because IPNI nodes continuously catalogue the content hosted by all the providers, and provide a central lookup API the need for reader privacy is amplified. This makes IPNI a difficult choice as an alternative routing system in projects such as IPFS, which use a more decentrailsed routing system that by nature reduces the possibility of mass query snooping. ​ Building on top of IPFS's reader privacy mechanism, a.k.a, double-hashing, this document specifies how a similar technique is applied to IPNI in order to preserve the reader's privacy while continuing to facilitate low-latency provider lookup. ​

Table of Contents

For more technical implementation details please see the Addendum. ​

Introduction

​ IPFS requires various privacy protections, which includes privacy for the content routing subsystem. Readers (clients accessing files) need privacy for the content they consume and writers (hosts storing and distributing content) need privacy for the content they publish. It is very easy for a content router node or a passive observer to learn which file is requested by which client during the routing process, as the potential adversary easily learns about the requested CID. A snooping actor could request the same CID and download the associated file to monitor the user’s behavior. This is obviously undesirable and has been for some time now a strong request from the community.

This specification describes how IPNI provides Readers Privacy. It prevents passive observers from tracking a user's actions as described above.

Non Goals

  • Writer, i.e. content provider or publisher, Privacy, which will be done in a separate specification;
  • Retrieval Privacy, which is out of scope for the content routing subsystem;
  • Rogue IPNI behaviour (explained in the Security section). ​

Background

​ IPNI builds its indexes by ingesting chains of Advertisements. Advertisement is a construct that allows Storage Providers to publish their CIDs in bulk (FIL deals) instead of doing that individually for each CID. A group of CIDs is represented by a ContextID that is unique per provider as can be seen on the diagram below: ​ Index building flow

Magic Values

All salts below are 64-bytes long, and represent a string padded with \x00.

  • SALT_DOUBLEHASH = bytes("CR_DOUBLEHASH\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00")
  • SALT_ENCRYPTIONKEY = bytes("CR_ENCRYPTIONKEY\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00")
  • SALT_NONCE = bytes("CR_NONCE\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00")

Definitions

  • Advertisement is IPNI Advertisement.
  • Storage Provider is a party who stores the data and wants that data to be discoverable through IPNI.
  • Publisher is a party who publishes multihashes, via Advertisements, to IPNI on behalf of a Storage Provider.
  • Client is a party who wants to find the content by its CID using IPNI for the purpose of retreiving from the Storage Provider.
  • Passive Observer is a snooping party that wants to understand what content is being looked up by observing Client-to-IPNI traffic.
  • enc is AESGCM encryption. The following notation will be used for the rest of the specification enc(passphrase, nonce, payload).
  • hash is SHA256 hashing.
  • || is concatenation of two values.
  • deriveKey is deriving a 32-byte encryption key from a passphrase that is done as hash(SALT_ENCRYPTIONKEY || passphrase).
  • Nonce is a 12-byte nonce used as Initialization Vector (IV) for the AESGCM encryption. IPNI expects an explicit instruction to delete a record (comparing to the DHT where records expire). Hence the IPNI server needs to be able to compare encrypted values without having to decrypt them as that would require a key that it is unaware of. That means that the nonce has to be deterministically chosen so that enc(passphrase, nonce, payload) produces the same output for the same passpharase + payload pair. Nonce must be calculated as hash(SALT_NONCE || passphrase || len(payload) || payload)[:12], where len(payload) is an 8-byte length of the payload encoded in Little Endian format. Choice of nonce is not enforced by the IPNI specification. The described approach will be used while IPNI encrypts Advertisements on behaf of Publishers. However once Writer Privacy is implemented, the choice of nonce will be left up to the Publisher.
  • CID is the Content IDentifier.
  • MH is the Multihash contained in a CID. It corresponds to the digest of a hash function over some content. MH is represented as a 32-byte array.
  • HASH2 is a second hash over the multihash. Second Hashes must be of Multihash format with DBL_SHA_256 codec. The digest must be calculated as hash(SALT_DOUBLEHASH || MH).
  • ProviderRecord is a data structure that contains such information about Storage Provider as ther PeerID and Addresses.
  • ProviderRecordKey is a concatentation of peerID || contextID. There is no need for explicitly encoding lengths as they are already encoded as a part of the multihash format.
  • EncProviderRecordKey is Nonce || enc(deriveKey(multihash), Nonce, ProviderRecordKey).
  • HashProviderRecordKey is a hash over ProviderRecordKey that must be calculated as hash(SALT_DOUBLEHASH || ProviderRecordKey).
  • Metadata is IPNI metadata that is supplied in Advertisements.
  • EncMetadata is Nonce || enc(deriveKey(ProviderRecordKey), Nonce, Metadata).

Specification

​ This specification improves Reader Privacy by making changes to the Step 3, depicted above, where the Client supplies the multihash (or CID) to IPNI in order to lookup corresponding Storage Providers.

  • A Client who wants to do a lookup will calculate HASH2 and use it for the lookup query;
  • In response to that IPNI will return a list of EncProviderRecordKeys that containencrypted peerID and contextID of the Storage Providers that have the content represented by the original MH. In order to make sense of that payload, a Passive Observer would need to get hold of the original MH that isn't revealed during this communication round;
  • Using the original MH, the Client decrypts EncProviderRecordKeys and uses the peerID to fetch a ProviderRecord. ProviderRecords can be cached on the Client side and hence this rountrip can be avoided most of the times. Peer addresses can also be discovered through alternative sources such as libp2p peerstore;
  • Using addresses from the ProviderRecord the Client will reach out to the Storage Provider directly and fetch the desired content;
  • The Client might choose to fetch IPNI Metadata that will require another lookup round by HashProviderRecordKey to get EncMetadata in response.

By utilising such scheme only a party that knows the original CID can decode the protocol, which is never revealed.

sequenceDiagram
    participant Client
    participant IPNI
    participant StorageProvider
    Client->>Client: calculates HASH2
    Client->>IPNI: sends a find request for HASH2
    IPNI->>Client: sends a list of [EncProviderRecordKey]
    loop EncProviderRecordKeys
        Client->>Client: decrypts EncProviderRecordKey and extracts peerID from it
        Client->>IPNI: [Optional] sends ProviderRecord lookup request for peerID
        IPNI->>Client: [Optional] sends a ProviderRecord
        Client->>IPNI: [Optional] sends EncMetadata lookup request for hash(ProviderRecordKey) 
        IPNI->>Client: [Optional] sends EncMetadata
        Client->>Client: [Optional] decrypts the EncMetadata using ProviderRecordKey
    end
    Client->>StorageProvider: reaches out to the Storage Provider for the desired content
Loading

Extended Providers

Extended Providers allow a Publisher to add an extra information to all their past and future Advertisements or to a single Advertisement with a specific ContextID. That can be done by sending just a single Advertisement without having to re-publish the whole Advertisement chain. If present Extended Providers are applied to the IPNI output on the server which results into more ProviderRecords being returned to the user. Same will not be possible for privacy preserving lookups as the required fields such as PeerID and ContextID are opaque to the server.

While the mechanics stays the same, applying Extended Providers to the decrypted values will have to be done at the Client side. If exist, Extended Providers should be included as a field in the ProviderRecord which would make them cacheable too.

Security

​ Security model of the Reader Privacy proposal boils down to inability of a Passive Observer to algorithmically derive the original MH from HASH2 that is used for IPNI lookups. IPNI Advertisments are not encrypted, but authenticated and contain plain multihash values in them. Before Writer Privacy is implemented a Passive Observer could build a map of HASH2 -> MH by re-ingesting Advertisements chain from each Publisher in order to collect all original multihashes which can then be used to decrypt EncProviderRecords and so on. Doing that will require significant resources as it involves crawling the entire network. Eliminating this requires a Writer Privacy upgrade.

Even with both Reader and Writer Privacies in place a rogue IPNI actor might abuse this security model. For example:

  • Someone wants to detect who is looking for a particular piece of content, i.e. surveilling content. For example, an IPNI endpoint that wants to know how frequently people are requesting some website it cares about;
  • Someone wants to do mass surveillance on readily accessible data. For example, a group running an IPNI endpoint also runs web crawlers looking for IPFS links, or runs a public HTTP gateway and can log those requests.

Rogue IPNI behaviour will be addressed by IPNI writer privacy and reputation system that is out of scope for this specification.

Reader Privacy is a first step towards fully private content routing protocol.

Wider security implications are discussed in the IPFS Reader Privacy specification: TODO update the link once the PR is merged.

Hashing and Encryption Function Upgrades

All multihashes contain a codec. If a hashing or encryption funciton will have to rotate then different types of multihahses can coexist together and can be processed differently by IPNI implementations. It will not be possible to apply a fix retroactivelly to the data returned by previous lookup requests, however IPNI implementations can start blocking all new ones that use a compromised scheme, allowing some transtition period.

Moving an IPNI implementation to a new hash / encryption function will require reingesting all data from the beginning of its publication. Before Writer Privacy is impemented the index can be migrated over to new functions by reingesting existing Advertisement chains. With Writer Privacy, Publishers will have to republish Advertisments using new functions (as the data in the Advertisements themselves will have to be re-hashed / re-encrypted). Both old and new scheme can coexist together for some time. The old scheme should be retired either immediately or once the indexes have been rebuilt and the users have been migrated over.

An exact operational procedure will be different for differnet IPNI implementations.

Trade Offs

  • Multiple lookups. In the best case scenario Reader Privacy protocol will require one roundtrip to get a list of peers for a given CID. Worst case scenario, when both ProviderRecords and Metadatas need to be fetched from IPNI, will require 3+ lookups.

  • Extra compute. At minimum, Clients must perform an extra hash computation per CID and decryption per ProviderRecordKey that will add some overhead to each lookup.

  • Extra storage space. Storing encrypted data will require more space due to padding, nonce, and the addition of an encrypted provider key per multihash;

  • Bulk deletes. Encrypted PeerID will be different for each multihash and hence bulk delete operations (delete everything for a provider X) will not be possible. Such deletion will require a garbage collection mechanism that rereads deleted advertisements and deletes the HASH2 for all multihashes.

  • Operational overhead. Reader Privacy roll out will be a gradual process as many clients will have to migrate over. During the transition period IPNI implementations will have to serve both plain and hashed lookups. That will involve either serving hashed and regular queries from the same instances. Servers will have to do decryption on behalf of their clients using a plain multihash that has been provided in the lookup request;

  • Data Migration. Existing indexes will have to re-ingest all index content that they want to provide Reader Privacy for.

Threat Modelling

There are three actors involved into the IPNI workflow: Publisher, Client and IPNI. Publishers makes update to indexes by publishing Advertisements. Advertisements are signed by their Publishers and can be authenticated. Advertisements are organised in a chain and are ingested strictly in order. It's not possible to reorder Advertisements wihtout having to fork the chain. Advertisements processing is idempotent - re-ingesting the same Advertismeent twice doesn't affect IPNI state. The IPNI specification is agnostic to transport protocols so particular protocol choice is up to the implementation. Compromised Publisher's identity is out of scope of this specification.

Clients consume index by performing CID (multihash) lookups. Additional hashing and encryption aims to prevent a Passive Observer from being able to infer what data is being looked up by spying at the Client-to-IPNI traffic. Withouty a writer privacy solution, a malicious indexer that keeps a map of HASH2->MH could expose a client. Therefore communicaton protocol between client and IPNI should be chosen carefully to prevent MITM attacks.

EncProviderRecords do not contain any authentication data. It is possible for a malicious IPNI to
present a wrong dataset to the Client. Clients can tackle that by excluding such IPNIs from their pool. Returning wrong datasets will eventually affect the IPNI's reputation score when a reputation tracking system is available. Data integrity is built into IPFS - Clients can verify that the data returned by a Storage Provider matches the CID. So even if an IPNI is compromised the data itself is not compromised.

Data Formats

All binary data must be b58 encoded when transferred over the wire.

Related Resources

​ TODO: link to corresponding IPFS spec once materialised. ​

Copyright

​ Copyright and related rights waived via CC0.