| Crates.io | codec-eval |
| lib.rs | codec-eval |
| version | 0.2.0 |
| created_at | 2025-12-25 15:33:45.238715+00 |
| updated_at | 2025-12-25 15:33:45.238715+00 |
| description | Image codec comparison and evaluation library |
| homepage | |
| repository | https://github.com/imazen/codec-eval |
| max_upload_size | |
| id | 2004663 |
| size | 223,741 |
A practical guide to comparing image codecs fairly, with metrics accuracy data, viewing condition considerations, and scientific methodology.
Integrating your codec? See INTEGRATION.md for:
Want to improve this tool? See CONTRIBUTING.md. We actively want input from codec developers—you know your domain better than we do.
# Quick start
cargo add codec-eval --git https://github.com/imazen/codec-eval
# Or use the CLI
cargo install --git https://github.com/imazen/codec-eval codec-eval-cli
API-first design: You provide encode/decode callbacks, the library handles everything else.
use codec_eval::{EvalSession, EvalConfig, ViewingCondition};
let config = EvalConfig::builder()
.report_dir("./reports")
.viewing(ViewingCondition::desktop())
.quality_levels(vec![60.0, 80.0, 95.0])
.build();
let mut session = EvalSession::new(config);
session.add_codec("my-codec", "1.0", Box::new(|image, request| {
my_codec::encode(image, request.quality)
}));
let report = session.evaluate_corpus("./test_images")?;
Features:
| Metric | Correlation with Human Perception | Best For |
|---|---|---|
| PSNR | ~67% | Legacy benchmarks only |
| SSIM/DSSIM | ~82% | Quick approximation |
| Butteraugli | 80-91% | High-quality threshold (score < 1.0) |
| SSIMULACRA2 | 87-98% | Recommended — best overall accuracy |
| VMAF | ~90% | Video, large datasets |
Based on Kornel Lesiński's guide:
❌ JPEG → WebP → AVIF (each conversion adds artifacts)
✓ PNG/TIFF → WebP
✓ PNG/TIFF → AVIF
Always start from a lossless source. Converting lossy→lossy compounds artifacts and skews results.
Don't compare mozjpeg -quality 80 against cjxl -quality 80 — quality scales differ between encoders.
Instead, match by:
A single test image can favor certain codecs. Use diverse datasets:
Codec rankings change across the quality spectrum:
A codec that's 5% smaller but 100x slower may not be practical. Report:
The current best metric for perceptual quality assessment.
| Score | Quality Level | Typical Use Case |
|---|---|---|
| < 30 | Poor | Thumbnails, previews |
| 40-50 | Low | Aggressive compression |
| 50-70 | Medium | General web images |
| 70-80 | Good | Photography sites |
| 80-85 | Very High | Professional/archival |
| > 85 | Excellent | Near-lossless |
Accuracy: 87% overall, up to 98% on high-confidence comparisons.
Tool: ssimulacra2_rs
ssimulacra2 original.png compressed.jpg
Structural similarity, derived from SSIM but outputs distance (lower = better).
Accuracy: Validated against TID2013 database:
Tool: dssim
dssim original.png compressed.jpg
| DSSIM Score | Approximate Quality |
|---|---|
| < 0.001 | Visually identical |
| 0.001-0.01 | Excellent |
| 0.01-0.05 | Good |
| 0.05-0.10 | Acceptable |
| > 0.10 | Noticeable artifacts |
Note: Values are not directly comparable between DSSIM versions. Always report version.
Google's perceptual metric, good for high-quality comparisons.
Accuracy: 80-91% (varies by image type).
Best for: Determining if compression is "transparent" (score < 1.0).
Limitation: Less reliable for heavily compressed images.
Netflix's Video Multi-Method Assessment Fusion.
Accuracy: ~90% for video, slightly less for still images.
Best for: Large-scale automated testing, video frames.
Peak Signal-to-Noise Ratio — purely mathematical, ignores perception.
Accuracy: ~67% — only slightly better than chance.
Use only: For backwards compatibility with legacy benchmarks.
The number of pixels that fit in one degree of visual field. Critical for assessing when compression artifacts become visible.
| PPD | Context | Notes |
|---|---|---|
| 30 | 1080p at arm's length | Casual viewing |
| 60 | 20/20 vision threshold | Most artifacts visible |
| 80 | Average human acuity limit | Diminishing returns above this |
| 120 | 4K at close range | Overkill for most content |
| 159 | iPhone 15 Pro | "Retina" display density |
Formula:
PPD = (distance_inches × resolution_ppi × π) / (180 × viewing_distance_inches)
| Device Type | Typical PPD | Compression Tolerance |
|---|---|---|
| Desktop monitor | 40-80 | Medium quality acceptable |
| Laptop | 80-120 | Higher quality needed |
| Smartphone | 120-160 | Very high quality or artifacts visible |
| 4K TV at 3m | 30-40 | More compression acceptable |
The international standard for subjective video/image quality assessment.
Key elements:
When to use: Final validation of codec choices, publishing research.
| Method | Description | Best For |
|---|---|---|
| DSIS | Show reference, then test image | Impairment detection |
| DSCQS | Side-by-side, both unlabeled | Quality comparison |
| 2AFC | "Which is better?" forced choice | Fine discrimination |
| Flicker test | Rapid A/B alternation | Detecting subtle differences |
When metrics aren't enough, subjective testing provides ground truth. But poorly designed studies produce unreliable data.
Randomization:
Blinding:
Controls:
| Comparison Type | Minimum N | Recommended N |
|---|---|---|
| Large quality difference (obvious) | 15 | 20-30 |
| Medium difference (noticeable) | 30 | 50-80 |
| Small difference (subtle) | 80 | 150+ |
Power analysis: For 80% power to detect a 0.5 MOS difference with SD=1.0, you need ~64 participants per condition.
Pre-study:
Exclusion criteria (define before data collection):
Embed these throughout the study:
Types of attention checks:
1. Obvious pairs - Original vs heavily compressed (SSIMULACRA2 < 30)
2. Identical pairs - Same image twice (should report "same" or 50/50 split)
3. Reversed pairs - Same comparison shown twice, order flipped
4. Instructed response - "For this pair, select the LEFT image"
Threshold: Exclude participants who fail > 2 attention checks or > 20% of obvious pairs.
Position bias: Tendency to favor left/right or first/second.
Fatigue effects: Quality judgments degrade over time.
Anchoring: First few images bias subsequent judgments.
Central tendency: Avoiding extreme ratings.
For rating data (MOS):
1. Calculate mean and 95% CI per condition
2. Check normality (Shapiro-Wilk) - often violated
3. Use robust methods:
- Trimmed means (10-20% trim)
- Bootstrap confidence intervals
- Non-parametric tests (Wilcoxon, Kruskal-Wallis)
4. Report effect sizes (Cohen's d, or MOS difference)
For forced choice (2AFC):
1. Calculate preference percentage per pair
2. Binomial test for significance (H0: 50%)
3. Apply multiple comparison correction:
- Bonferroni (conservative)
- Holm-Bonferroni (less conservative)
- Benjamini-Hochberg FDR (for many comparisons)
4. Report: "Codec A preferred 67% of time (p < 0.01, N=100)"
Outlier handling:
1. Define criteria BEFORE analysis (pre-registration)
2. Report both with and without outlier exclusion
3. Use robust statistics that down-weight outliers
4. Never exclude based on "inconvenient" results
Always include:
Example:
"N=87 participants completed the study (12 excluded: 8 failed attention checks, 4 incomplete). Codec A was preferred over Codec B in 62% of comparisons (95% CI: 55-69%, p=0.003, binomial test). This corresponds to a mean quality difference of 0.4 MOS points (95% CI: 0.2-0.6)."
| Pitfall | Problem | Solution |
|---|---|---|
| Small N | Underpowered, unreliable | Power analysis before study |
| No attention checks | Can't detect random responders | Embed 10-15% check trials |
| Post-hoc exclusion | Cherry-picking results | Pre-register exclusion criteria |
| Only reporting means | Hides variability | Show distributions + CI |
| Multiple comparisons | Inflated false positives | Apply correction (Bonferroni, FDR) |
| Unbalanced design | Confounds codec with position/order | Full counterbalancing |
| Lab-only testing | May not generalize | Include diverse participants/displays |
Cloudinary 2021 (1.4 million opinions):
# 1. Encode to same file size
convert source.png -define webp:target-size=50000 output.webp
cjxl source.png output.jxl --target_size 50000
# 2. Measure with SSIMULACRA2
ssimulacra2 source.png output.webp
ssimulacra2 source.png output.jxl
| Implementation | Type | Install | Notes |
|---|---|---|---|
| ssimulacra2_rs | CLI (Rust) | cargo install ssimulacra2_rs |
Recommended |
| ssimulacra2 | Library (Rust) | cargo add ssimulacra2 |
For integration |
| ssimulacra2-cuda | GPU (CUDA) | cargo install ssimulacra2-cuda |
Fast batch processing |
| libjxl | CLI (C++) | Build from source | Original implementation |
# Install CLI
cargo install ssimulacra2_rs
# Usage
ssimulacra2_rs original.png compressed.jpg
# Output: 76.543210 (higher = better, scale 0-100)
| Implementation | Type | Install | Notes |
|---|---|---|---|
| dssim | CLI (Rust) | cargo install dssim |
Recommended |
# Install
cargo install dssim
# Basic comparison (lower = better)
dssim original.png compressed.jpg
# Output: 0.02341
# Generate difference visualization
dssim -o difference.png original.png compressed.jpg
Accuracy: Validated against TID2013 database. Spearman correlation -0.84 to -0.95 depending on distortion type.
| Implementation | Type | Install | Notes |
|---|---|---|---|
| butteraugli | CLI (C++) | Build from source | Original |
| libjxl | CLI (C++) | Build from source | Includes butteraugli |
| Implementation | Type | Install | Notes |
|---|---|---|---|
| libvmaf | CLI + Library | Package manager or build | Official Netflix implementation |
# Ubuntu/Debian
apt install libvmaf-dev
# Usage (via ffmpeg)
ffmpeg -i original.mp4 -i compressed.mp4 -lavfi libvmaf -f null -
| Tool | Purpose | Link |
|---|---|---|
| imageflow | High-performance image processing with quality calibration | Rust + C ABI |
| libvips | Fast image processing library | C + bindings |
| sharp | Node.js image processing (uses libvips) | npm |
See CONTRIBUTING.md for the full guide.
We especially want contributions from:
This project is designed to be community-driven. Fork it, experiment, share what you learn.
This guide is released under CC0 — use freely without attribution.