codec-eval

Crates.io	codec-eval
lib.rs	codec-eval
version	0.2.0
created_at	2025-12-25 15:33:45.238715+00
updated_at	2025-12-25 15:33:45.238715+00
description	Image codec comparison and evaluation library
homepage
repository	https://github.com/imazen/codec-eval
max_upload_size
id	2004663
size	223,741

Lilith River (lilith)

documentation

README

Codec Comparison Guide

A practical guide to comparing image codecs fairly, with metrics accuracy data, viewing condition considerations, and scientific methodology.

For Codec Developers

Integrating your codec? See INTEGRATION.md for:

Wiring up encode/decode callbacks
MozJPEG, Jpegli, AVIF examples
CI quality regression testing
Interpreting DSSIM thresholds

Want to improve this tool? See CONTRIBUTING.md. We actively want input from codec developers—you know your domain better than we do.

# Quick start
cargo add codec-eval --git https://github.com/imazen/codec-eval

# Or use the CLI
cargo install --git https://github.com/imazen/codec-eval codec-eval-cli

The codec-eval Library

API-first design: You provide encode/decode callbacks, the library handles everything else.

use codec_eval::{EvalSession, EvalConfig, ViewingCondition};

let config = EvalConfig::builder()
    .report_dir("./reports")
    .viewing(ViewingCondition::desktop())
    .quality_levels(vec![60.0, 80.0, 95.0])
    .build();

let mut session = EvalSession::new(config);

session.add_codec("my-codec", "1.0", Box::new(|image, request| {
    my_codec::encode(image, request.quality)
}));

let report = session.evaluate_corpus("./test_images")?;

Features:

DSSIM and PSNR metrics (SSIMULACRA2 planned)
Viewing condition modeling (desktop, mobile, retina)
Corpus management with sparse checkout for large repos
CSV import for third-party benchmark results
Pareto front analysis and BD-Rate calculation
JSON/CSV report generation

Quick Reference

Metric	Correlation with Human Perception	Best For
PSNR	~67%	Legacy benchmarks only
SSIM/DSSIM	~82%	Quick approximation
Butteraugli	80-91%	High-quality threshold (score < 1.0)
SSIMULACRA2	87-98%	Recommended — best overall accuracy
VMAF	~90%	Video, large datasets

Fair Comparison Principles

Based on Kornel Lesiński's guide:

1. Never Convert Between Lossy Formats

❌ JPEG → WebP → AVIF (each conversion adds artifacts)
✓  PNG/TIFF → WebP
✓  PNG/TIFF → AVIF

Always start from a lossless source. Converting lossy→lossy compounds artifacts and skews results.

2. Standardize Encoder Settings

Don't compare mozjpeg -quality 80 against cjxl -quality 80 — quality scales differ between encoders.

Instead, match by:

File size — encode to the same byte budget, compare quality
Quality metric — encode to the same SSIMULACRA2 score, compare file size

3. Use Multiple Images

A single test image can favor certain codecs. Use diverse datasets:

Kodak — 24 classic benchmark images
CLIC 2025 — 62 high-resolution images
CID22 — 250 perceptual quality research images

4. Test at Multiple Quality Levels

Codec rankings change across the quality spectrum:

High quality (SSIMULACRA2 > 80): Differences minimal
Medium quality (60-80): Most visible differences
Low quality (< 50): Edge cases, artifacts become dominant

5. Consider Encode/Decode Speed

A codec that's 5% smaller but 100x slower may not be practical. Report:

Encode time (CPU seconds)
Decode time (critical for web)
Memory usage

Quality Metrics Deep Dive

SSIMULACRA2 (Recommended)

The current best metric for perceptual quality assessment.

Score	Quality Level	Typical Use Case
< 30	Poor	Thumbnails, previews
40-50	Low	Aggressive compression
50-70	Medium	General web images
70-80	Good	Photography sites
80-85	Very High	Professional/archival
> 85	Excellent	Near-lossless

Accuracy: 87% overall, up to 98% on high-confidence comparisons.

Tool: ssimulacra2_rs

ssimulacra2 original.png compressed.jpg

DSSIM

Structural similarity, derived from SSIM but outputs distance (lower = better).

Accuracy: Validated against TID2013 database:

Spearman correlation: -0.84 to -0.95 (varies by distortion type)
Best on: Noise, compression artifacts, blur
Weaker on: Exotic distortions, color shifts

Tool: dssim

dssim original.png compressed.jpg

DSSIM Score	Approximate Quality
< 0.001	Visually identical
0.001-0.01	Excellent
0.01-0.05	Good
0.05-0.10	Acceptable
> 0.10	Noticeable artifacts

Note: Values are not directly comparable between DSSIM versions. Always report version.

Butteraugli

Google's perceptual metric, good for high-quality comparisons.

Accuracy: 80-91% (varies by image type).

Best for: Determining if compression is "transparent" (score < 1.0).

Limitation: Less reliable for heavily compressed images.

VMAF

Netflix's Video Multi-Method Assessment Fusion.

Accuracy: ~90% for video, slightly less for still images.

Best for: Large-scale automated testing, video frames.

PSNR (Avoid)

Peak Signal-to-Noise Ratio — purely mathematical, ignores perception.

Accuracy: ~67% — only slightly better than chance.

Use only: For backwards compatibility with legacy benchmarks.

Viewing Conditions

Pixels Per Degree (PPD)

The number of pixels that fit in one degree of visual field. Critical for assessing when compression artifacts become visible.

PPD	Context	Notes
30	1080p at arm's length	Casual viewing
60	20/20 vision threshold	Most artifacts visible
80	Average human acuity limit	Diminishing returns above this
120	4K at close range	Overkill for most content
159	iPhone 15 Pro	"Retina" display density

Formula:

PPD = (distance_inches × resolution_ppi × π) / (180 × viewing_distance_inches)

Device Categories

Device Type	Typical PPD	Compression Tolerance
Desktop monitor	40-80	Medium quality acceptable
Laptop	80-120	Higher quality needed
Smartphone	120-160	Very high quality or artifacts visible
4K TV at 3m	30-40	More compression acceptable

Practical Implications

Mobile-first sites need higher quality settings (SSIMULACRA2 > 70)
Desktop sites can use more aggressive compression (SSIMULACRA2 50-70)
Thumbnails can be heavily compressed regardless of device
Hero images on retina displays need minimal compression

Scientific Methodology

ITU-R BT.500

The international standard for subjective video/image quality assessment.

Key elements:

Controlled viewing conditions (luminance, distance, display calibration)
Non-expert viewers (15-30 recommended)
5-grade Mean Opinion Score (MOS):
- 5: Excellent
- 4: Good
- 3: Fair
- 2: Poor
- 1: Bad
Statistical analysis with confidence intervals

When to use: Final validation of codec choices, publishing research.

Presentation Methods

Method	Description	Best For
DSIS	Show reference, then test image	Impairment detection
DSCQS	Side-by-side, both unlabeled	Quality comparison
2AFC	"Which is better?" forced choice	Fine discrimination
Flicker test	Rapid A/B alternation	Detecting subtle differences

Human A/B Testing

When metrics aren't enough, subjective testing provides ground truth. But poorly designed studies produce unreliable data.

Study Design

Randomization:

Randomize presentation order (left/right, first/second)
Randomize image order across participants
Balance codec appearances to avoid order effects

Blinding:

Participants must not know which codec produced which image
Use neutral labels ("Image A" / "Image B")
Don't reveal hypothesis until after data collection

Controls:

Include known quality differences as sanity checks
Add duplicate pairs to measure participant consistency
Include "same image" pairs to detect bias

Sample Size

Comparison Type	Minimum N	Recommended N
Large quality difference (obvious)	15	20-30
Medium difference (noticeable)	30	50-80
Small difference (subtle)	80	150+

Power analysis: For 80% power to detect a 0.5 MOS difference with SD=1.0, you need ~64 participants per condition.

Participant Screening

Pre-study:

Visual acuity test (corrected 20/40 or better)
Color vision screening (Ishihara plates)
Display calibration verification

Exclusion criteria (define before data collection):

Failed attention checks (> 20% incorrect on known pairs)
Inconsistent responses (< 60% agreement on duplicate pairs)
Response time outliers (< 200ms suggests random clicking)
Incomplete sessions (< 80% of trials)

Attention Checks

Embed these throughout the study:

Types of attention checks:
1. Obvious pairs    - Original vs heavily compressed (SSIMULACRA2 < 30)
2. Identical pairs  - Same image twice (should report "same" or 50/50 split)
3. Reversed pairs   - Same comparison shown twice, order flipped
4. Instructed response - "For this pair, select the LEFT image"

Threshold: Exclude participants who fail > 2 attention checks or > 20% of obvious pairs.

Bias Detection & Correction

Position bias: Tendency to favor left/right or first/second.

Detect: Chi-square test on position choices across all trials
Correct: Counter-balance positions; exclude participants with > 70% same-side choices

Fatigue effects: Quality judgments degrade over time.

Detect: Compare accuracy on attention checks early vs late in session
Correct: Limit sessions to 15-20 minutes; analyze by time block

Anchoring: First few images bias subsequent judgments.

Detect: Compare ratings for same image shown early vs late
Correct: Use practice trials (discard data); randomize order

Central tendency: Avoiding extreme ratings.

Detect: Histogram of ratings (should use full scale)
Correct: Use forced choice (2AFC) instead of rating scales

Statistical Analysis

For rating data (MOS):

1. Calculate mean and 95% CI per condition
2. Check normality (Shapiro-Wilk) - often violated
3. Use robust methods:
   - Trimmed means (10-20% trim)
   - Bootstrap confidence intervals
   - Non-parametric tests (Wilcoxon, Kruskal-Wallis)
4. Report effect sizes (Cohen's d, or MOS difference)

For forced choice (2AFC):

1. Calculate preference percentage per pair
2. Binomial test for significance (H0: 50%)
3. Apply multiple comparison correction:
   - Bonferroni (conservative)
   - Holm-Bonferroni (less conservative)
   - Benjamini-Hochberg FDR (for many comparisons)
4. Report: "Codec A preferred 67% of time (p < 0.01, N=100)"

Outlier handling:

1. Define criteria BEFORE analysis (pre-registration)
2. Report both with and without outlier exclusion
3. Use robust statistics that down-weight outliers
4. Never exclude based on "inconvenient" results

Reporting Results

Always include:

Sample size (N) and exclusion count with reasons
Confidence intervals, not just p-values
Effect sizes in meaningful units (MOS points, % preference)
Individual data points or distributions (not just means)
Attention check pass rates
Participant demographics (if relevant to display/vision)

Example:

"N=87 participants completed the study (12 excluded: 8 failed attention checks, 4 incomplete). Codec A was preferred over Codec B in 62% of comparisons (95% CI: 55-69%, p=0.003, binomial test). This corresponds to a mean quality difference of 0.4 MOS points (95% CI: 0.2-0.6)."

Common Pitfalls

Pitfall	Problem	Solution
Small N	Underpowered, unreliable	Power analysis before study
No attention checks	Can't detect random responders	Embed 10-15% check trials
Post-hoc exclusion	Cherry-picking results	Pre-register exclusion criteria
Only reporting means	Hides variability	Show distributions + CI
Multiple comparisons	Inflated false positives	Apply correction (Bonferroni, FDR)
Unbalanced design	Confounds codec with position/order	Full counterbalancing
Lab-only testing	May not generalize	Include diverse participants/displays

Real-World Studies

Cloudinary 2021 (1.4 million opinions):

JPEG XL: 10-15% better than AVIF at web quality levels
AVIF: Best for low-bandwidth scenarios
WebP: Solid middle ground
All modern codecs beat JPEG by 25-35%

Recommended Workflow

For Quick Comparisons

# 1. Encode to same file size
convert source.png -define webp:target-size=50000 output.webp
cjxl source.png output.jxl --target_size 50000

# 2. Measure with SSIMULACRA2
ssimulacra2 source.png output.webp
ssimulacra2 source.png output.jxl

For Thorough Evaluation

Gather diverse test images from codec-corpus
Create quality ladder (10 quality levels per codec)
Compute metrics for each combination
Plot rate-distortion curves (file size vs quality)
Consider encode/decode speed
Validate with subjective testing if publishing results

Tools & Implementations

SSIMULACRA2

Implementation	Type	Install	Notes
ssimulacra2_rs	CLI (Rust)	`cargo install ssimulacra2_rs`	Recommended
ssimulacra2	Library (Rust)	`cargo add ssimulacra2`	For integration
ssimulacra2-cuda	GPU (CUDA)	`cargo install ssimulacra2-cuda`	Fast batch processing
libjxl	CLI (C++)	Build from source	Original implementation

# Install CLI
cargo install ssimulacra2_rs

# Usage
ssimulacra2_rs original.png compressed.jpg
# Output: 76.543210 (higher = better, scale 0-100)

DSSIM

Implementation	Type	Install	Notes
dssim	CLI (Rust)	`cargo install dssim`	Recommended

# Install
cargo install dssim

# Basic comparison (lower = better)
dssim original.png compressed.jpg
# Output: 0.02341

# Generate difference visualization
dssim -o difference.png original.png compressed.jpg

Accuracy: Validated against TID2013 database. Spearman correlation -0.84 to -0.95 depending on distortion type.

Butteraugli

Implementation	Type	Install	Notes
butteraugli	CLI (C++)	Build from source	Original
libjxl	CLI (C++)	Build from source	Includes butteraugli

VMAF

Implementation	Type	Install	Notes
libvmaf	CLI + Library	Package manager or build	Official Netflix implementation

# Ubuntu/Debian
apt install libvmaf-dev

# Usage (via ffmpeg)
ffmpeg -i original.mp4 -i compressed.mp4 -lavfi libvmaf -f null -

Image Processing

Tool	Purpose	Link
imageflow	High-performance image processing with quality calibration	Rust + C ABI
libvips	Fast image processing library	C + bindings
sharp	Node.js image processing (uses libvips)	npm

References

Methodology

How to Compare Images Fairly — Kornel Lesiński
ITU-R BT.500-15 — Subjective quality assessment
The Netflix Tech Blog: VMAF

Studies

Image Codec Comparison — Google
Cloudinary Image Format Study — 1.4M opinions
Are We Compressed Yet? — Video codec comparison

Test Images

codec-corpus — Reference images for calibration

Contributing

See CONTRIBUTING.md for the full guide.

We especially want contributions from:

Codec developers (mozjpeg, jpegli, libavif, webp, etc.) — integration examples, quality scale docs, edge cases
Metrics researchers — new metrics, calibration data, perception thresholds
Anyone — docs, tests, bug reports, benchmark results

This project is designed to be community-driven. Fork it, experiment, share what you learn.

License

This guide is released under CC0 — use freely without attribution.

Commit count: 0

codec-eval

documentation

README

Codec Comparison Guide

For Codec Developers

The codec-eval Library

Quick Reference

Fair Comparison Principles

1. Never Convert Between Lossy Formats

2. Standardize Encoder Settings

3. Use Multiple Images

4. Test at Multiple Quality Levels

5. Consider Encode/Decode Speed

Quality Metrics Deep Dive

SSIMULACRA2 (Recommended)

DSSIM

Butteraugli

VMAF

PSNR (Avoid)

Viewing Conditions

Pixels Per Degree (PPD)

Device Categories

Practical Implications

Scientific Methodology

ITU-R BT.500

Presentation Methods

Human A/B Testing

Study Design

Sample Size

Participant Screening

Attention Checks

Bias Detection & Correction

Statistical Analysis

Reporting Results

Common Pitfalls

Real-World Studies

Recommended Workflow

For Quick Comparisons

For Thorough Evaluation

Tools & Implementations

SSIMULACRA2

DSSIM

Butteraugli

VMAF

Image Processing

References

Methodology

Studies

Test Images

Contributing

License

cargo fmt