A small step to add icode (insertion code) support and mmCIF support#212
Open
HUSRCF wants to merge 5 commits into
Open
A small step to add icode (insertion code) support and mmCIF support#212HUSRCF wants to merge 5 commits into
HUSRCF wants to merge 5 commits into
Conversation
Author
|
And thank you a lot for your very work which is extremely useful in my development, look forward to your reply & comment! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds native PDBx/mmCIF input support using
gemmi, while preserving the existing PDB parsing and downstream PROPka calculation flow.Main goals are:
.cif/.mmcifinputs directly;MolecularContainer -> ConformationContainer -> Group -> pKapipeline unchanged where possible.Main changes
Native mmCIF parser
Added
read_mmcif()inpropka/input.pythroughgemmi.The parser reads
_atom_siterecords and projects them into existingAtomobjects. It handles:ATOM/HETATMauth_*fields with fallback tolabel_*_atom_site.pdbx_PDB_ins_code_atom_site.pdbx_PDB_model_num_atom_site.label_alt_idThe mmCIF path then reuses the existing downstream setup.
Insertion-code aware identity
Added structured identity properties to Atom:
atom.residue_key = (chain_id, res_num, icode)
atom.atom_key = (chain_id, res_num, icode, atom_name)
These are now used for internal matching where string display labels were previously used.
Updated affected logic in:
conformation top-up
molecular top-up
group lookup
desolvation same-residue exclusion
generated hydrogen identity
Group.eq
Iterative.eq
hydrogen protonation grouping
This prevents residues such as E 48A, E 48B, E 48C with test case from being collapsed into the same (chain, resnum) identity.
Safer labels for display
Insertion codes are now shown with an explicit separator in display labels:
chain XX + icode A -> ...XX:A
chain XXA, no icode -> ...XXA
This could avoids ambiguity between multi-character mmCIF chain IDs and insertion codes.
Sorting without single-character chain assumptions
The old atom sorting key used numeric packing with ord(chain_id), which only works for one-character PDB chain IDs.
This PR replaces it with tuple sorting:
(chain_id, res_num, icode, element_order)This supports mmCIF chain IDs such as AA, AB, etc while make code more readable.
Tests and reproducibility
Added dry-run tests for:
Added 3SGB comparison fixtures:
tests/pdb/3SGB_noicode.pdb
tests/results/3SGB_new.dat
tests/results/3SGB_new_noicode.dat
And a reproduction script:
tests/reproduce_3sgb_icode_results.sh
The script regenerates both new result files and verifies:
3SGB_new_noicode.dat matches legacy 3SGB.dat
3SGB_new.dat differs from legacy 3SGB.dat only where insertion-code-aware behavior changes pKa values exactly at where icode appears
This confirms that the legacy reference behavior is reproduced when insertion codes are explicitly removed, while the new parser correctly preserves insertion-code residue identity.
Moreoverall, some mmcif, pdb cross-verification are also runned within my forked repo, to reduce the complexity, I did not add them here, you might find them here at my forked repo:
https://github.com/HUSRCF/propka2.gitat branch main, and run with
./tests/cif_pdb_cross_verify/verify_pdb_cif_consistency.shNotes
This PR intentionally treats insertion code as part of residue identity. This can change pKa values for structures where insertion-code residues were previously collapsed into the same chain/residue-number slot.