kira-mmcif

Crates.iokira-mmcif
lib.rskira-mmcif
version0.1.0
created_at2026-01-18 17:17:18.7792+00
updated_at2026-01-18 17:17:18.7792+00
descriptionLow-level, streaming mmCIF parser focused on protein coordinates.
homepage
repositoryhttps://github.com/ARyaskov/kira-mmcif
max_upload_size
id2052696
size40,572
Andrei Riaskóv (ARyaskov)

documentation

README

Kira mmCIF

Low-level, streaming mmCIF parser focused on protein coordinates. The crate reads _atom_site data and exposes a stable, Gemmi-inspired API with a deterministic, protein-oriented data contract.

Scope (by design):

  • Reads mmCIF (STAR/CIF) files.
  • Extracts ATOM records from _atom_site only.
  • Single model (MODEL 1).
  • AltLoc handling: accepts . or A, ignores others.
  • Ignores symmetry, assemblies, validation, secondary structure, and other metadata.

Public API

Top-level entry point

use kira_mmcif::{read_structure, MmCifError, Structure};

let structure: Structure = read_structure("input.cif")?;

Signature:

pub fn read_structure<P: AsRef<Path>>(path: P) -> Result<Structure, MmCifError>;

Installation

Add to Cargo.toml:

[dependencies]
kira-mmcif = "*"

Data model (Gemmi-inspired, Rust-native)

pub struct Structure {
    pub models: Vec<Model>,
}

pub struct Model {
    pub chains: Vec<Chain>,
}

pub struct Chain {
    pub id: ChainId,
    pub residues: Vec<Residue>,
}

pub struct Residue {
    pub name: ResidueName,
    pub seq_id: i32,
    pub atoms: SmallVec<[Atom; 4]>,
}

pub struct Atom {
    pub name: AtomName,
    pub x: f32,
    pub y: f32,
    pub z: f32,
}

Enums and IDs:

pub enum AtomName { N, CA, C, O }

pub enum ResidueName {
    ALA, ARG, ASN, ASP, CYS, GLN, GLU, GLY, HIS, ILE,
    LEU, LYS, MET, PHE, PRO, SER, THR, TRP, TYR, VAL,
    UNK,
}

pub struct ChainId(pub u8); // 'A'..'Z' or 'a'..'z' => 0..25

Utility mapping (public methods):

impl AtomName {
    pub fn from_label_atom_id(label: &str) -> Option<Self>;
    pub fn as_u8(self) -> u8; // N=0, CA=1, C=2, O=3
}

impl ResidueName {
    pub fn from_label_comp_id(label: &str) -> Self; // unknown => UNK
    pub fn as_u8(self) -> u8; // AA index, UNK=255
}

impl ChainId {
    pub fn from_label_asym_id(label: &str) -> Option<Self>;
    pub fn as_u8(self) -> u8;
}

ProteinIR adapter

This is the stable contract for downstream analysis pipelines.

pub struct ProteinIR {
    pub atoms: AtomSoA,
    pub residues: Vec<ResidueIR>,
    pub chains: Vec<ChainIR>,
}

pub struct AtomSoA {
    pub x: Vec<f32>,
    pub y: Vec<f32>,
    pub z: Vec<f32>,
    pub residue_idx: Vec<u32>,
    pub atom_kind: Vec<u8>, // N=0, CA=1, C=2, O=3
}

pub struct ResidueIR {
    pub chain_id: u8,
    pub residue_name: u8,   // AA index
    pub residue_number: i32,
    pub atom_offset: u32,
    pub atom_count: u8,
    pub has_n: bool,
    pub has_ca: bool,
    pub has_c: bool,
    pub has_o: bool,
}

pub struct ChainIR {
    pub chain_id: u8,
    pub residue_start: u32,
    pub residue_end: u32, // inclusive
}

Adapter usage:

use kira_mmcif::{ProteinIR, Structure};

let protein_ir = ProteinIR::try_from(&structure)?;

Errors

pub enum MmCifError {
    Io(std::io::Error),
    Parse(String),
    MissingField(&'static str),
    InvalidChainId(String),
    InvalidModelCount(usize),
}

Parsing rules (strict by scope)

Required _atom_site fields:

  • _atom_site.group_PDB
  • _atom_site.label_atom_id
  • _atom_site.label_comp_id
  • _atom_site.label_asym_id
  • _atom_site.label_seq_id
  • _atom_site.Cartn_x
  • _atom_site.Cartn_y
  • _atom_site.Cartn_z

Supported extras (optional):

  • _atom_site.label_alt_id (altLoc filter)
  • _atom_site.pdbx_PDB_model_num (MODEL filter)

Filtering behavior:

  • Only group_PDB == "ATOM" is kept.
  • Only model 1 is kept if the model column is present.
  • Only altLoc . or A (and ? treated as missing) is kept if the altLoc column is present.
  • Non-backbone atoms are ignored (AtomName::from_label_atom_id must match).

Ordering guarantees:

  • Chains preserve original label_asym_id ordering as they appear in the file.
  • Residues are sorted by label_seq_id within each chain.
  • Atoms are emitted in file order within each residue.

Non-goals

  • No secondary structure, bonds, or geometry validation.
  • No exposure of CIF/STAR internals.

Example

use kira_mmcif::{read_structure, ProteinIR};

let structure = read_structure("protein.cif")?;
let protein_ir = ProteinIR::try_from(&structure)?;

println!("chains: {}", protein_ir.chains.len());
println!("residues: {}", protein_ir.residues.len());
println!("atoms: {}", protein_ir.atoms.x.len());
# Ok::<(), Box<dyn std::error::Error>>(())
Commit count: 4

cargo fmt