A small step to add icode (insertion code) support and mmCIF support by HUSRCF · Pull Request #212 · jensengroup/propka

HUSRCF · 2026-06-09T17:25:12Z

Summary

This PR adds native PDBx/mmCIF input support using gemmi, while preserving the existing PDB parsing and downstream PROPka calculation flow.

Main goals are:

support .cif / .mmcif inputs directly;
remove PDB fixed-width assumptions from internal identity handling;
correctly distinguish insertion-code residues;
support mmCIF chain IDs and residue/atom numbers that exceed legacy PDB limits;
keep the existing MolecularContainer -> ConformationContainer -> Group -> pKa pipeline unchanged where possible.

Main changes

Native mmCIF parser

Added read_mmcif() in propka/input.py through gemmi.

The parser reads _atom_site records and projects them into existing Atom objects. It handles:

ATOM / HETATM
auth_* fields with fallback to label_*
multi-character chain IDs
residue numbers beyond PDB fixed-width limits
atom IDs beyond PDB serial limits
insertion codes from _atom_site.pdbx_PDB_ins_code
model numbers from _atom_site.pdbx_PDB_model_num
altloc IDs from _atom_site.label_alt_id
required-field validation for atom identity and coordinates

The mmCIF path then reuses the existing downstream setup.

Insertion-code aware identity

Added structured identity properties to Atom:

atom.residue_key = (chain_id, res_num, icode)
atom.atom_key = (chain_id, res_num, icode, atom_name)

These are now used for internal matching where string display labels were previously used.

Updated affected logic in:

conformation top-up
molecular top-up
group lookup
desolvation same-residue exclusion
generated hydrogen identity
Group.eq
Iterative.eq
hydrogen protonation grouping

This prevents residues such as E 48A, E 48B, E 48C with test case from being collapsed into the same (chain, resnum) identity.

Safer labels for display

Insertion codes are now shown with an explicit separator in display labels:

chain XX + icode A -> ...XX:A
chain XXA, no icode -> ...XXA

This could avoids ambiguity between multi-character mmCIF chain IDs and insertion codes.

Sorting without single-character chain assumptions

The old atom sorting key used numeric packing with ord(chain_id), which only works for one-character PDB chain IDs.
This PR replaces it with tuple sorting:

(chain_id, res_num, icode, element_order)

This supports mmCIF chain IDs such as AA, AB, etc while make code more readable.

Tests and reproducibility

Added dry-run tests for:

mmCIF atom-site parsing
large atom/residue identifiers
multi-character chain IDs
insertion-code preservation
required mmCIF field validation
invalid optional model number diagnostics
conformation top-up with insertion-code-separated residues
multi-character chain atom sorting

Added 3SGB comparison fixtures:

tests/pdb/3SGB_noicode.pdb
tests/results/3SGB_new.dat
tests/results/3SGB_new_noicode.dat

And a reproduction script:

tests/reproduce_3sgb_icode_results.sh

The script regenerates both new result files and verifies:
3SGB_new_noicode.dat matches legacy 3SGB.dat
3SGB_new.dat differs from legacy 3SGB.dat only where insertion-code-aware behavior changes pKa values exactly at where icode appears

This confirms that the legacy reference behavior is reproduced when insertion codes are explicitly removed, while the new parser correctly preserves insertion-code residue identity.

Moreoverall, some mmcif, pdb cross-verification are also runned within my forked repo, to reduce the complexity, I did not add them here, you might find them here at my forked repo:

https://github.com/HUSRCF/propka2.git

at branch main, and run with

./tests/cif_pdb_cross_verify/verify_pdb_cif_consistency.sh

Notes

This PR intentionally treats insertion code as part of residue identity. This can change pKa values for structures where insertion-code residues were previously collapsed into the same chain/residue-number slot.

HUSRCF · 2026-06-09T17:26:46Z

And thank you a lot for your very work which is extremely useful in my development, look forward to your reply & comment!

HUSRCF added 5 commits April 29, 2026 23:58

Add native mmCIF parsing with gemmi

e2c4f1f

Restore compact group interaction tables

a376497

Reduce mmCIF diff formatting noise

2a07b48

Merge branch 'main'

462beaa

overall test

a640eb7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A small step to add icode (insertion code) support and mmCIF support#212

A small step to add icode (insertion code) support and mmCIF support#212
HUSRCF wants to merge 5 commits into
jensengroup:masterfrom
HUSRCF:master

HUSRCF commented Jun 9, 2026

Uh oh!

HUSRCF commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

HUSRCF commented Jun 9, 2026

Summary

Main changes

Native mmCIF parser

Insertion-code aware identity

Safer labels for display

Sorting without single-character chain assumptions

Tests and reproducibility

Notes

Uh oh!

HUSRCF commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant