similar_lines

Crates.iosimilar_lines
lib.rssimilar_lines
version0.1.0
created_at2025-10-13 02:40:57.195592+00
updated_at2025-10-13 02:40:57.195592+00
descriptionDetect identical lines shared between two repositories using a suffix-array index
homepagehttps://github.com/vincentzed/inference/tree/main/similar_lines
repositoryhttps://github.com/vincentzed/inference
max_upload_size
id1879934
size8,408,233
Vincent Zhong (vincentzed)

documentation

README

Similar Lines

Detect identical lines of source code shared between two repositories using a suffix-array index backed by libsufr. The project provides both a reusable Rust library and a CLI front-end.

How it Works

  1. Walk both trees with the same ignore semantics as Git.
  2. Normalize each candidate line (trim, expand tabs) and discard lines shorter than the configured threshold.
  3. Concatenate the surviving lines into a single byte string S = L₀ ∘ 0x1E ∘ ….
  4. Build a suffix array A over S; adjacent suffixes with an LCP (longest-common-prefix) equal to the entire line correspond to duplicated lines.
  5. Group matches that contain occurrences from both repositories and report their locations.

The index construction runs in O(|S| log |S|) followed by a single linear scan of the sorted suffixes.

Requirements

  • Rust 1.80 or newer
  • Network access the first time you build, to fetch the libsufr dependency

Building

cargo build --release

Code Quality

cargo fmt
cargo clippy --all-targets --all-features -- -D warnings
cargo test

CLI Usage

cargo run --release -- \
  /path/to/repo-a \
  /path/to/repo-b \
  --min-length 40 \
  --max-results 25 \
  --format json \
  --output matches.json

Key flags:

  • --min-length — minimum normalized line length (defaults to 30)
  • --max-results — optional cap on reported groups
  • --formattext (default) or json
  • --output — write to a file instead of stdout

Library Usage

use similar_lines::{find_similar_lines, Config};

# fn example() -> anyhow::Result<()> {
let results = find_similar_lines(&Config {
    repo_a: "../repo-a".into(),
    repo_b: "../repo-b".into(),
    min_length: 48,
    max_results: Some(5),
})?;

for group in results {
    println!("{} ({} hits)", group.content, group.occurrences.len());
}
# Ok(())
# }

The MatchGroup values returned contain the duplicated line and all Occurrences with repository name, relative path, and line number.

Notes

  • Binary files are skipped using content_inspector heuristics.
  • Paths within a repository are reported relative to the provided root.
  • The CLI reports matches only when both repositories contribute at least one occurrence.
Commit count: 0

cargo fmt