Skip to content

A small step to add icode (insertion code) support and mmCIF support#212

Open
HUSRCF wants to merge 5 commits into
jensengroup:masterfrom
HUSRCF:master
Open

A small step to add icode (insertion code) support and mmCIF support#212
HUSRCF wants to merge 5 commits into
jensengroup:masterfrom
HUSRCF:master

Conversation

@HUSRCF

@HUSRCF HUSRCF commented Jun 9, 2026

Copy link
Copy Markdown

Summary

This PR adds native PDBx/mmCIF input support using gemmi, while preserving the existing PDB parsing and downstream PROPka calculation flow.

Main goals are:

  • support .cif / .mmcif inputs directly;
  • remove PDB fixed-width assumptions from internal identity handling;
  • correctly distinguish insertion-code residues;
  • support mmCIF chain IDs and residue/atom numbers that exceed legacy PDB limits;
  • keep the existing MolecularContainer -> ConformationContainer -> Group -> pKa pipeline unchanged where possible.

Main changes

Native mmCIF parser

Added read_mmcif() in propka/input.py through gemmi.

The parser reads _atom_site records and projects them into existing Atom objects. It handles:

  • ATOM / HETATM
  • auth_* fields with fallback to label_*
  • multi-character chain IDs
  • residue numbers beyond PDB fixed-width limits
  • atom IDs beyond PDB serial limits
  • insertion codes from _atom_site.pdbx_PDB_ins_code
  • model numbers from _atom_site.pdbx_PDB_model_num
  • altloc IDs from _atom_site.label_alt_id
  • required-field validation for atom identity and coordinates

The mmCIF path then reuses the existing downstream setup.

Insertion-code aware identity

Added structured identity properties to Atom:

atom.residue_key = (chain_id, res_num, icode)
atom.atom_key = (chain_id, res_num, icode, atom_name)

These are now used for internal matching where string display labels were previously used.

Updated affected logic in:

  • conformation top-up

  • molecular top-up

  • group lookup

  • desolvation same-residue exclusion

  • generated hydrogen identity

  • Group.eq

  • Iterative.eq

  • hydrogen protonation grouping

    This prevents residues such as E 48A, E 48B, E 48C with test case from being collapsed into the same (chain, resnum) identity.

Safer labels for display

Insertion codes are now shown with an explicit separator in display labels:

chain XX + icode A -> ...XX:A
chain XXA, no icode -> ...XXA

This could avoids ambiguity between multi-character mmCIF chain IDs and insertion codes.

Sorting without single-character chain assumptions

The old atom sorting key used numeric packing with ord(chain_id), which only works for one-character PDB chain IDs.
This PR replaces it with tuple sorting:

(chain_id, res_num, icode, element_order)

This supports mmCIF chain IDs such as AA, AB, etc while make code more readable.

Tests and reproducibility

Added dry-run tests for:

  • mmCIF atom-site parsing
  • large atom/residue identifiers
  • multi-character chain IDs
  • insertion-code preservation
  • required mmCIF field validation
  • invalid optional model number diagnostics
  • conformation top-up with insertion-code-separated residues
  • multi-character chain atom sorting

Added 3SGB comparison fixtures:

tests/pdb/3SGB_noicode.pdb
tests/results/3SGB_new.dat
tests/results/3SGB_new_noicode.dat

And a reproduction script:

tests/reproduce_3sgb_icode_results.sh

The script regenerates both new result files and verifies:
3SGB_new_noicode.dat matches legacy 3SGB.dat
3SGB_new.dat differs from legacy 3SGB.dat only where insertion-code-aware behavior changes pKa values exactly at where icode appears

This confirms that the legacy reference behavior is reproduced when insertion codes are explicitly removed, while the new parser correctly preserves insertion-code residue identity.

Moreoverall, some mmcif, pdb cross-verification are also runned within my forked repo, to reduce the complexity, I did not add them here, you might find them here at my forked repo:

https://github.com/HUSRCF/propka2.git

at branch main, and run with

./tests/cif_pdb_cross_verify/verify_pdb_cif_consistency.sh

Notes

This PR intentionally treats insertion code as part of residue identity. This can change pKa values for structures where insertion-code residues were previously collapsed into the same chain/residue-number slot.

@HUSRCF

HUSRCF commented Jun 9, 2026

Copy link
Copy Markdown
Author

And thank you a lot for your very work which is extremely useful in my development, look forward to your reply & comment!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant