| Crates.io | proseg |
| lib.rs | proseg |
| version | 3.0.8 |
| created_at | 2024-04-26 01:15:38.103665+00 |
| updated_at | 2025-09-15 21:41:50.173239+00 |
| description | Probabilistic cell segmentation for in situ spatial transcriptomics |
| homepage | |
| repository | https://github.com/dcjones/proseg |
| max_upload_size | |
| id | 1220878 |
| size | 1,574,386 |
Proseg (probabilistic segmentation) is a cell segmentation method for spatial transcriptomics. Xenium, CosMx, MERSCOPE, and Visium HD platforms are currently supported, but it can be easily adapted to others.

Read the paper:
And the Research Brief:
Proseg can be built and installed with cargo by running.
cargo install proseg
The easiest way to install cargo for most is rustup.
It can also be build manually from source, which is useful mainly if you want to try a specific revision or make changes
git clone https://github.com/dcjones/proseg.git
cd proseg
cargo build --release
Proseg can then be run with:
target/release/proseg
Proseg segments cells based on the observed spatial distribution of transcripts. This generally works well at reducing spurious coexpression and transcript misassignment, but has some limitations to be aware of:
Cells with very few observed transcripts will tend to have unreliable boundaries.
Proseg relies on prior (usually image-based) segmentation to determine the number and approximate location of cells. It doesn't introduce new cells, so if the prior segmentation missed many cells, Proseg is not able to correct for that error.
In highly homogeneous regions consisting of cells with very similar expression profiles, boundaries between cells will be much less reliable.
Proseg 3 has a few changes that users of earlier versions should be aware of.
By default proseg will output a spatialdata zarr directory that can be read by the spatiadata package in python. There are many other output options still, but these are disabled by default.
import spatialdata
sdata = spatialdata.read_zarr("proseg-output.zarr")
sdata.tables["table"].write_h5ad("proseg-anndata.h5ad")
then read this h5ad file into R with zellkonverter.
* The proseg-to-baysor command now operates on these zarr directories.By default, count matrices generated by Proseg 3 are integer point estimates, not continuous expected counts, which should make some downstream analysis simpler.
Some simplifications were made to the model and sampling procedure. Now the
sampling schedule is controlled with these four arguments:
--burnin-samples, --samples giving the number of iterations, and
--burnin-voxel-size and --voxel-size giving the x/y size of the voxels
in microns. The burn in voxel size must be an integer multiple of the final
voxel size.
The --nbglayers arguments has been removed. There is now just one
--voxel-layers argument controlling how many voxels are stacked on the z-axis.
The voxel morphology prior has been changed. Instead of --perimeter-bound
and --perimeter-eta, there is one --cell-compactness argument, where
smaller numbers lead to more compact (equivalently, more circular) cells.
Proseg is run on a table of transcript positions which in some form must include preliminary assignments of transcripts to nuclei or cells. Xenium, CosMx, and MERSCOPE all provide this out of the box in some form.
Proseg is invoked from the command line like:
proseg [arguments...] /path/to/transcripts.csv.gz
The method is general purpose. There are command line arguments to tell it which
columns in the csv file to use, and how they should be interpreted, but
typically one of the presets --xenium, --cosmx, --merfish, or --visiumhd
are used for common platforms.
Proseg is a sampling method, and in its current form in non-deterministic. From run to run, results will vary slightly, but not in a way that would seriously affect the interpretation of the data.
To see a list of command line arguments, run
proseg --help
Most of these can be disregarded. The most relevent ones will be described below:
The spatialdata zarr output that proseg generates can be read with
import spatialdata
sdata = spatialdata.read_zarr("proseg-output.zarr")
This object contains:
sdata.points["transcripts"]. (This
can use significant space, so can be excluded if not needed with
--exclude-spatialdata-transcripts).sdata.shapes["cell_boundaries"]sdata.tables["table"], which contains:
obsvarXobsm["spatial"]uns["proseg_run].--nthreads N sets the number of threads to parallelize across. By default
proseg will use all available CPU cores, which may be a bad idea on a shared
machine.--output-spatialdata output.zarr: Proseg will output a spatialdata zarr directory
which can be read by the spatialdata python package and contains all metadata, count matrix, and cell geometry.--overwrite: An existing zarr directory will not be overwritten unless this argument is passed.--voxel-layers N: Number of layers on the z-axis to model 3D cells.--samples N: Run the sampler for this N iterations.--burnin-samples N: Run the sampler for a preliminary N samples at a lower resolution.--voxel-size S: Voxel size in microns on the x/y axis.--burnin-voxel-size S: Larger voxel size to use for the burn-in phase. (This must be an integer multiple of the final voxel size).In addition to the spatialdata zarr output, results can be written to separate number of tables, which can be either gzipped csv files or parquet files, and GeoJSON files giving cell boundaries.
--output-counts counts.mtx.gz: Output a cell-by-gene count matrix in gziped matrix market format. (Which can be read with e.g. mmread in scipy).--output-expected-counts expected-counts.mtx.gz: Output an expected
count matrix, where the counts are non-integer estimates from taking the
mean over multiple samples.--output-cell-metadata cell-metadata.csv.gz: Cell centroids, volume, and other information.--output-transcript-metadata transcript-metadata.csv.gz: Transcript ids, genes, revised positions, assignment probability, etc.--output-gene-metadata gene-metadata.csv.gz: Per-gene summary statistics--output-rates rates.csv.gz: Cell-by-gene Poisson rate parameters. These are essentially expected relative expression values, but may be too overly-smoothed for use in downstream analysis.Cell boundaries can be output a number of ways:
--output-cell-polygons cell-polygons.geojson.gz: 2D consensus polygons for each cell in GeoJSON format. These are flattened from 3D, which each xy position assigned to the dominant cell.--output-cell-polygon-layers cell-polygons-layers.geojson.gz: Output a separate, non-overlapping cell polygon for each z-layer, preserving 3D segmentation.A number of options can alter assumptions made by the model, which generally should not need
--ncomponents 10: Cell gene expression is a modeled as a mixture of negative binomial distributions. This parameter controls the number of mixture components. More components will tend to nudge the cells into more distinct types, but setting it too high risks manifesting cell types that are not real.--no-diffusion: By default Proseg models cells as leaky, under the assumption that some amount of RNA leaks from cells and diffuses elsewhere. This seems to be the case in much of the Xenium data we've seen, but could be a harmfully incorrect assumption in some data. This argument disables that part of the model.--diffusion-probability 0.2: Prior probability of a transcript is diffused and should be repositioned.--diffusion-sigma-far 4: Prior standard deviation on repositioning distance of diffused transcripts.--diffusion-sigma-near 1: Prior standard deviation on repositioning distance of non-diffused transcripts..--nuclear-reassignment_prob 0.2: Prior probability that the initial nuclear assignment (if any) is incorrect.--cell-compactness 0.03: Larger numbers allow less spherical cells.If the dataset consists of multiple unrelated tissue sections, it's generally safer to split these and segment them separately. This avoids potential sub-optimal modeling due to different z-coordinate distributions.
If proseg crashes without any kind of error message, or if it suddenly starts running extremely slowly, that's usually a sign that it ran out of memory. The best option would be to run on a system with more memory, but increasing the voxel size, lowering the number of voxel layers, and disabling diffusion will all reduce the memory usage, typically at the cost of lower transcript assignment accuracy.
Proseg is aided by doing local repositioning transcripts (I call this
"transcript repo" or "transcript diffusion"). If your skeptical that
transcripts leak or diffuse, it's still useful to allow some amount of
repositioning to overcome limitations with voxel resolution. Setting
--diffusion-probability 0.0 will still let transcripts make small scale adjustments,
unlike --no-diffusion which will completely disable that part of the model.
Proseg will work on Xenium transcript tables in either csv.gz or parquet format. The latter will be slightly more efficient to read.
proseg --xenium transcripts.csv.gz
# or
proseg --xenium transcripts.parquet
After segmenting with Proseg, the data can be converted back to format readable by Xenium Explorer.
proseg-to-baysor command:
sh proseg-to-baysor proseg-output.zarr \ --output-transcript-metadata proseg-to-baysor-transcript-metadata.csv --output-cell-polygons proseg-to-baysor-cell-polygons.geojson sh xeniumranger import-segmentation \ --id sample_id \ --xenium-bundle /path/to/xenium/bundle \ --transcript-assignment proseg-to-baysor-transcript-metadata.csv \ --viz-polygons proseg-to-baysor-cell-polygons.geojson \ --units=microns
where /path/to/xenium/bundle is the original xenium bundle.--id.Known issues:
Proseg works on CosMx transcript table with
proseg --cosmx sample_tx_file.csv
Earlier versions of CosMx did not automatically provide a single table of global
transcript positions. To work around this, we provide a Julia program in
extra/stitch-cosmx.jl to construct a table from the flat files downloaded from
AtoMx.
To run this, some dependencies are required, which can be installed with
julia -e 'import Pkg; Pkg.add(["Glob", "CSV", "DataFrames", "CodecZlib", "ArgParse"])'
Then the program can be run with like
julia stitch-cosmx.jl /path/to/cosmx-flatfiles transcripts.csv.gz
to output a complete transcripts table to transcripts.csv.gz.
From here proseg can be run with (note the --cosmx-micron not --cosmx, since this format differs slightly)
proseg --cosmx-micron transcripts.csv.gz
Proseg should work on the provide transcripts table with --merscope:
proseg --merscope detected_transcripts.cs.gz
Running VisiumHD on Proseg is slightly more complicated that in situ methods. First, it needs
to read a sparse count matrix across bins/squares. Instead of passing a transcripts table to proseg,
it should be pointed to the binned_outputs directory
Proseg still requires some form of registered image segmentation. The easiest way is
to use the tools provided by Space Ranger to handle segmentation in image
Visium HD,
which provides a barcode_mappings.parquet file that Proseg can read.
proseg --visiumhd
--spaceranger-barcode-mappings /path/to/barcode_mappings.parquet \
/path/to/binned_outputs \
Alternatively, it's possible, though trickier to use a Cellpose mask, assuming it's correctly registered.
proseg --visiumhd
--cellpose-masks /path/to/cellpose_masks.npy \
/path/to/binned_outputs \
There isn't a standardized cellpose output format. See the extras/cellpose-xenium.py file for code
to run cellpose and output to format proseg can read.
Typically Proseg is run with prior segmentation in the form of a transcript table with prior cell assignments. It's also possible to initialize directly using Cellpose (or similar) masks, and there is potentially some benefit to doing this, because it can provide a more detailed pixel-level prior to proseg.
This does require more effort and care, since it's critical that coordinates are properly registered and scaled from mask pixels to transcripts micron positions. There are the relevant arguments.
--cellpose-masks masks.npy.gz: An (optionally gzipped) npy file with a 2d masks array with uint32 type, where 0 represents unassigned state and [1, ncells] gives the pixels cell assignment.--cellpose-cellprobs cellprobs.npy.gz: An (optionally gzipped) npy file giving segmentation uncertainty. This should be a float32 array matching the dimensionality of the masks array.--cellpose-scale 0.324: Microns per pixel of the cellpose mask.--cellpose-x-transform a b c: Affine transformation from pixels to microns for the x-coordinate (where x_micron = a*x_pixel + b*y_pixel + c)--cellpose-y-transform a b c: Affine transformation from pixels to microns from the y-coordinate (where y_micron = a*x_pixel + b*y_pixel + c)Don't hesitate to open an issue or to email me (email is in my bio). This data is complex and varied. Weird cases do arise which may not be correctly considered by Proseg, so I appreciate being made aware of them.