| Crates.io | similarity-generic |
| lib.rs | similarity-generic |
| version | 0.4.1 |
| created_at | 2025-06-30 14:07:31.738824+00 |
| updated_at | 2025-08-13 17:44:36.970873+00 |
| description | Generic language similarity analyzer using tree-sitter |
| homepage | |
| repository | https://github.com/mizchi/similarity |
| max_upload_size | |
| id | 1731879 |
| size | 123,889 |
A generic code similarity analyzer using tree-sitter parsers. This tool provides configurable similarity detection for languages without dedicated implementations.
Tree-sitter grammars vary significantly between languages - each defines its own node types, field names, and AST structure. Creating dedicated parsers for every language requires:
function_declaration in Go vs method_declaration in JavaThis tool solves these challenges by:
For languages where performance is critical (Python, TypeScript, Rust), we provide optimized dedicated implementations. For everything else, similarity-generic offers a practical solution.
⚠️ This tool only supports languages with pre-installed tree-sitter parsers. It cannot analyze arbitrary file extensions (e.g., .xyz) without corresponding tree-sitter grammar support in the binary.
To add support for a new language:
Out of the box, similarity-generic supports:
go)java)c)cpp, c++)csharp, cs)ruby, rb)For Python, TypeScript/JavaScript, and Rust, please use the dedicated implementations:
similarity-py - Optimized Python analyzersimilarity-ts - Optimized TypeScript/JavaScript analyzersimilarity-rs - (planned) Optimized Rust analyzercargo install similarity-generic
The binary includes the following tree-sitter parsers:
tree-sitter-gotree-sitter-javatree-sitter-ctree-sitter-cpptree-sitter-c-sharptree-sitter-rubyThese are compiled into the binary, so no additional runtime dependencies are required.
# Analyze Go code
similarity-generic path/to/file.go --language go
# Analyze Java code with custom threshold
similarity-generic src/Main.java --language java --threshold 0.9
# Show all functions in a file
similarity-generic file.cpp --language cpp --show-functions
You can provide custom language configurations using JSON files:
similarity-generic path/to/code --config my-language.json
--language, -l - Specify the language (go, java, c, cpp, csharp, ruby)--config, -c - Path to custom language configuration JSON--threshold, -t - Similarity threshold (0.0-1.0, default: 0.85)--show-functions - Display all extracted functions--supported - Show list of supported languages--show-config - Display example configuration for a languagesimilarity-generic --supported
# Show Go language configuration
similarity-generic --show-config go
# Show C++ configuration
similarity-generic --show-config cpp
Language configurations are JSON files that define how to parse and extract functions from source code.
{
"language": "string", // Language identifier
"function_nodes": ["string"], // AST node types representing functions
"type_nodes": ["string"], // AST node types representing types/classes
"field_mappings": { // Field names in AST nodes
"name_field": "string", // Field containing function/type name
"params_field": "string", // Field containing parameters
"body_field": "string", // Field containing function body
"decorator_field": "string", // Optional: Field for decorators
"class_field": "string" // Optional: Field for parent class
},
"value_nodes": ["string"], // Node types to extract text from
"test_patterns": { // Optional: Patterns to identify tests
"attribute_patterns": ["string"], // Attribute patterns
"name_prefixes": ["string"], // Function name prefixes
"name_suffixes": ["string"] // Function name suffixes
}
}
{
"language": "go",
"function_nodes": [
"function_declaration",
"method_declaration"
],
"type_nodes": [
"type_declaration",
"struct_type",
"interface_type"
],
"field_mappings": {
"name_field": "name",
"params_field": "parameters",
"body_field": "body"
},
"value_nodes": [
"identifier",
"interpreted_string_literal",
"raw_string_literal",
"int_literal",
"float_literal",
"true",
"false",
"nil"
],
"test_patterns": {
"attribute_patterns": [],
"name_prefixes": ["Test", "Benchmark"],
"name_suffixes": ["_test"]
}
}
For a hypothetical language:
{
"language": "mylang",
"function_nodes": [
"function_definition",
"lambda_expression"
],
"type_nodes": [
"class_definition",
"trait_definition"
],
"field_mappings": {
"name_field": "identifier",
"params_field": "parameter_list",
"body_field": "block",
"decorator_field": "annotations"
},
"value_nodes": [
"identifier",
"string_literal",
"number_literal"
],
"test_patterns": {
"attribute_patterns": ["@test", "@Test"],
"name_prefixes": ["test_"],
"name_suffixes": ["_test", "Test"]
}
}
If the language's tree-sitter parser is already included in the binary, you can create a custom configuration:
To add support for a language not currently included:
# In Cargo.toml
tree-sitter-yourlang = "0.x"
src/main.rs and generic_tree_sitter_parser.rsNote: You cannot simply create a configuration file for an arbitrary language. The tree-sitter parser must be compiled into the binary first.
To discover the node types for your language:
# Find similar functions in a Go file
similarity-generic main.go --language go
# Example output:
# Comparing functions for similarity...
# calculateSum <-> computeTotal: 92.50%
# Analyze entire Go project
find . -name "*.go" -exec similarity-generic {} --language go \;
# Show functions in the example file
$ similarity-generic examples/sample.go --language go --show-functions
Found 4 functions:
calculateSum examples/sample.go:6-12
computeTotal examples/sample.go:14-20
printMessage examples/sample.go:23-25
TestCalculateSum examples/sample.go:28-33
# Detect similar functions
$ similarity-generic examples/sample.go --language go
Comparing functions for similarity...
calculateSum <-> computeTotal: 91.30%
# Get the current Go configuration
similarity-generic --show-config go > my-go-config.json
# Edit my-go-config.json to customize behavior
# For example, add more test patterns or change node types
# Use the custom configuration
similarity-generic main.go --config my-go-config.json
Note: Custom configurations only work for languages already supported by the binary. You cannot analyze .kt (Kotlin) files unless tree-sitter-kotlin is added to the project.
The generic parser uses tree-sitter, which is generally slower than specialized parsers. For languages with dedicated implementations (Python, TypeScript, Rust), use those tools for better performance:
similarity-py - ~10x faster for Pythonsimilarity-ts - Uses oxc_parser for superior TypeScript/JavaScript performancesimilarity-rs - (planned) Optimized Rust implementationTo contribute support for a new language:
Cargo.toml:
tree-sitter-yourlang = { workspace = true }
language_configs/yourlang.json:
{
"language": "yourlang",
"function_nodes": ["function_declaration"],
"type_nodes": ["class_declaration"],
"field_mappings": {
"name_field": "name",
"params_field": "parameters",
"body_field": "body"
},
"value_nodes": ["identifier", "string_literal"]
}
main.rs and generic_tree_sitter_parser.rsAdd dependency in Cargo.toml:
tree-sitter-kotlin = "0.3"
Create config in language_configs/kotlin.json:
{
"language": "kotlin",
"function_nodes": ["function_declaration"],
"type_nodes": ["class_declaration", "object_declaration"],
"field_mappings": {
"name_field": "simple_identifier",
"params_field": "value_parameters",
"body_field": "function_body"
},
"value_nodes": ["simple_identifier", "string_literal"],
"test_patterns": {
"attribute_patterns": ["@Test"],
"name_prefixes": ["test"],
"name_suffixes": []
}
}
Update parser in src/main.rs:
"kotlin" => tree_sitter_kotlin::LANGUAGE.into(),
The build process automatically embeds all JSON files from language_configs/ into the binary, making them available at runtime without external file dependencies.
MIT