Skip to content

Clarification on Phillips 2021 dataset processing #9

@s-canchi

Description

@s-canchi

Hi, thank you for putting together FLAb. It is a great resource. While working with the Phillips et al. 2021
binding affinity data (phillips2021binding_*.csv), I had a few questions about how the data was processed
and wanted to check my understanding.

  1. Genotype filtering

The four Phillips CSVs appear to contain only genotypes where the first position is '1' (i.e., the first
mutation is the somatic allele). For example, phillips2021binding_cr9114_h3_kd.csv has 32,768 rows, all
starting with '1', which is exactly 2^15 (half of the full 2^16 = 65,536 combinatorial library). The same
pattern holds for the other three files.

Was this filtering intentional? The original data from the paper
contains the full genotype space including the germline sequence (all-0 genotype). I could not find
documentation for this in the README or metadata files, so I wanted to confirm.

  1. Flu B antigen

Phillips et al. measured CR9114 binding against three antigens: H1, H3, and Flu B. The Flu B data does not
appear to be included in FLAb. Was this excluded deliberately (perhaps because only 198/65,536 variants show
measurable binding)?

  1. Metadata labels

In flab_metadata.csv, the Phillips entries list the assay as SPR Kd and the units as -log( Kd [nM]) Fab.
My reading of the paper is that the measurement method is Tite-Seq (flow cytometry + deep sequencing)
rather than SPR, and that the antibody format is scFv (single-chain variable fragment on yeast display)
rather than Fab. Could you confirm whether these labels are correct, or if they should be updated?

Thanks again for maintaining this resource. Happy to discuss further.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions