dset

Crates.io	dset
lib.rs	dset
version	0.1.12
created_at	2025-02-15 19:25:01.212763+00
updated_at	2025-03-18 15:15:22.444074+00
description	A Rust library for processing and managing dataset-related files, with a focus on machine learning datasets, captions, and safetensors files
homepage	https://github.com/rakki194/dset
repository	https://github.com/rakki194/dset
max_upload_size
id	1557036
size	190,333

(rakki194)

documentation

https://docs.rs/dset

README

dset

A Rust library for processing and managing dataset-related files, particularly for machine learning datasets, captions, and safetensors files.

Overview

SafeTensors Processing
- Extract metadata from safetensors files
- Decode nested JSON structures
- Memory-mapped file handling for efficiency
- Pretty-printed JSON output
- Inspect state dictionary: Read and inspect the state dictionary of a targeted safensor file.
Caption File Handling
- Support for multiple formats (JSON, text)
- Validate caption content
- Extract and process tags
- Batch processing capabilities
File Concatenation
- Merge multiple files with different extensions
- Remove duplicate tags
- Deduplicate based on content hashing
- Customizable input and output formats
- Predefined presets for common configurations
File Operations
- Rename files with standardized patterns
- Check file existence and content
- Validate file types and contents
JSON Processing
- Format validation and formatting
- Deep JSON decoding
- Handle nested structures
Content Processing
- Smart content splitting (tags vs. text)
- Tag filtering and formatting
- Batch processing with async support
Performance Features
- Async/await support for I/O operations
- Memory-mapped file handling
- Optimized parsing techniques
Error Handling
- Comprehensive error context
- Recovery mechanisms
- Detailed logging

Inspecting State Dictionary

You can now inspect the state dictionary of a targeted safensor file using the inspect_state_dict function:

use dset::st::inspect_state_dict;
use std::path::Path;
use anyhow::Result;

async fn example() -> Result<()> {
    let state_dict = inspect_state_dict(Path::new("model.safetensors")).await?;
    println!("State Dictionary: {:?}", state_dict);
    Ok(())
}

This function reads the state dictionary from the specified safensor file and returns it as a JSON value.

E621 Caption Processing

The library excels at processing e621 JSON post data into standardized caption files, ideal for creating training datasets. The configuration is highly customizable using E621Config:

use dset::caption::{E621Config, process_e621_json_file};
use std::path::Path;
use std::collections::HashMap;
use anyhow::Result;

async fn process_with_custom_config() -> Result<()> {
    // Create custom rating conversions
    let mut custom_ratings = HashMap::new();
    custom_ratings.insert("s".to_string(), "safe".to_string());
    custom_ratings.insert("q".to_string(), "maybe".to_string());
    custom_ratings.insert("e".to_string(), "nsfw".to_string());

    let config = E621Config::new()
        .with_filter_tags(false)  // Disable tag filtering
        .with_rating_conversions(Some(custom_ratings))  // Custom rating names
        .with_format(Some("Rating: {rating}\nArtists: {artists}\nTags: {general}".to_string()));  // Custom format

    process_e621_json_file(Path::new("e621_post.json"), Some(config)).await
}

Available Options

Tag Filtering (filter_tags: bool, default: true)
- When enabled, filters out noise tags
- Can be disabled to include all tags
Rating Conversions (rating_conversions: Option<HashMap<String, String>>)
- Default conversions:
  - "s" → "safe"
  - "q" → "questionable"
  - "e" → "explicit"
- Can be customized or disabled (set to None to use raw ratings)
Artist Formatting
- artist_prefix: Option<String> (default: Some("by "))
- artist_suffix: Option<String> (default: None)
- Customize how artist names are formatted
- Set both to None for raw artist names
- Examples:
  - Default: "by artist_name" → "by artist name"
  - Custom prefix: "drawn by artist_name" → "drawn by artist name"
  - Custom suffix: "artist_name (Artist)" → "artist name (Artist)"
  - Both: "art by artist_name (verified)" → "art by artist name (verified)"
  - None: "artist_name" → "artist name"
Format String (format: Option<String>)
- Default: "{rating}, {artists}, {characters}, {species}, {copyright}, {general}, {meta}"
- Available placeholders:
  - {rating} - The rating (after conversion)
  - {artists} - Artist tags (with configured formatting)
  - {characters} - Character tags
  - {species} - Species tags
  - {copyright} - Copyright tags
  - {general} - General tags
  - {meta} - Meta tags
- Each tag group is internally joined with ", "

Tag Processing

Artist Tags
- Configurable prefix (default: "by ")
- Optional suffix
- Underscores replaced with spaces
- "(artist)" suffix removed from source
- Examples:
  - Default: "artist_name (artist)" → "by artist name"
  - Custom: "artist_name" → "drawn by artist name (verified)"
  - Raw: "artist_name" → "artist name"
Character Tags
- Underscores replaced with spaces
- Original character names preserved
- Example: "character_name" → "character name"
Species Tags
- Included as-is with spaces
- Useful for dataset filtering
Copyright Tags
- Source material references preserved
- Underscores replaced with spaces
General Tags
- Common descriptive tags
- Underscores replaced with spaces
- Filtered to remove noise
Meta Tags
- Selected important meta information
- Art medium and style information preserved

Tag Filtering

Tag filtering is enabled by default but can be disabled. When enabled, it automatically filters out:

Year tags (e.g., "2023")
Aspect ratio tags (e.g., "16:9")
Conditional DNP tags
Empty or whitespace-only tags

The tag filtering system uses regular expressions for pattern matching and will:

Skip invalid patterns gracefully (returning false)
Handle regex compilation errors by panicking (this should never happen with the built-in patterns)
Provide clear error context for debugging

To disable filtering, pass Some(false) as the filter_tags parameter.

Caption File Generation

Creates .txt files from e621 JSON posts
Filename derived from post's image MD5
Format: [rating], [artist tags], [character tags], [other tags]
Skips generation if no valid tags remain after filtering (when filtering is enabled)

Example Usage

use dset::caption::{E621Config, process_e621_json_file};
use std::path::Path;
use anyhow::Result;

async fn process_e621() -> Result<()> {
    // Process with default settings
    process_e621_json_file(Path::new("e621_post.json"), None).await?;
    
    // Process with custom format
    let config = E621Config::new()
        .with_format(Some("{rating}\nBy: {artists}\nTags: {general}".to_string()));
    process_e621_json_file(Path::new("e621_post.json"), Some(config)).await?;
    
    // Process with raw ratings (no conversion)
    let config = E621Config::new()
        .with_rating_conversions(None);
    process_e621_json_file(Path::new("e621_post.json"), Some(config)).await?;
    
    Ok(())
}

Example Outputs

With default settings:

safe, by artist name, character name, species, tag1, tag2

With custom format:

Rating: safe
Artists: by artist name
Tags: tag1, tag2

With raw ratings:

s, by artist name, character name, species, tag1, tag2

Batch Processing Example

use dset::caption::{E621Config, process_e621_json_file};
use std::path::Path;
use anyhow::Result;
use tokio::fs;

async fn batch_process_e621() -> Result<()> {
    // Optional: customize processing for all files
    let config = E621Config::new()
        .with_filter_tags(false)
        .with_format(Some("{rating}\n{artists}\n{general}".to_string()));
    
    let entries = fs::read_dir("e621_posts").await?;
    
    for entry in entries {
        if let Ok(entry) = entry {
            let path = entry.path();
            if path.extension().map_or(false, |ext| ext == "json") {
                process_e621_json_file(&path, Some(config.clone())).await?;
            }
        }
    }
    Ok(())
}

🤖 AI Reasoning Dataset

The library provides comprehensive support for managing AI reasoning datasets, particularly useful for training language models in structured reasoning tasks. This functionality helps maintain consistent formatting and organization of reasoning data.

Data Structure

The reasoning dataset format consists of three main components:

Messages - Individual conversation messages:

Message {
    content: String,  // The message content
    role: String,     // The role (e.g., "user", "reasoning", "assistant")
}

Reasoning Entries - Complete reasoning interactions:

ReasoningEntry {
    user: String,         // The user's question/request
    reasoning: String,    // Detailed step-by-step reasoning
    assistant: String,    // Final summarized response
    template: String,     // Structured template combining all roles
    conversations: Vec<Message>,  // Complete conversation history
}

Dataset Collection - Collection of reasoning entries:

ReasoningDataset {
    entries: Vec<ReasoningEntry>
}

Features

Structured Data Management
- Organize reasoning data in a consistent format
- Maintain conversation history with role attribution
- Track detailed reasoning steps separately from final responses
Template Generation
- Automatic creation of structured templates
- Consistent formatting with <|im_start|> and <|im_end|> tokens
- Clear separation of user input, reasoning, and responses
File Operations
- Asynchronous JSON file loading and saving
- Pretty-printed output for readability
- Error handling with detailed context
Dataset Manipulation
- Add new entries to existing datasets
- Query dataset size and emptiness
- Efficient memory management

Reasoning Example Usage

Creating and Managing Datasets

use dset::reasoning::{ReasoningDataset, ReasoningEntry, Message};
use anyhow::Result;

async fn manage_dataset() -> Result<()> {
    // Create a new dataset
    let mut dataset = ReasoningDataset::new();

    // Create an entry
    let entry = ReasoningEntry {
        user: "What motivates Luna?".to_string(),
        reasoning: "Luna's motivations can be analyzed based on several factors:\n1. Desire for acceptance\n2. Self-expression needs\n3. Personal growth aspirations".to_string(),
        assistant: "Luna is motivated by acceptance, self-expression, and personal growth.".to_string(),
        template: ReasoningDataset::create_template(
            "What motivates Luna?",
            "Luna's motivations can be analyzed...",
            "Luna is motivated by acceptance, self-expression, and personal growth."
        ),
        conversations: vec![
            Message {
                content: "What motivates Luna?".to_string(),
                role: "user".to_string(),
            },
            Message {
                content: "Luna's motivations can be analyzed...".to_string(),
                role: "reasoning".to_string(),
            },
            Message {
                content: "Luna is motivated by acceptance, self-expression, and personal growth.".to_string(),
                role: "assistant".to_string(),
            },
        ],
    };

    // Add entry to dataset
    dataset.add_entry(entry);

    // Save dataset to file
    dataset.save("reasoning_data.json").await?;

    // Load dataset from file
    let loaded_dataset = ReasoningDataset::load("reasoning_data.json").await?;
    assert_eq!(loaded_dataset.len(), 1);

    Ok(())
}

Working with Templates

use dset::reasoning::ReasoningDataset;

// Create a template string
let template = ReasoningDataset::create_template(
    "What is the best approach?",
    "Let's analyze this step by step...",
    "Based on the analysis, the best approach is..."
);

// Template output format:
// <|im_start|>user
// What is the best approach?
// <|im_end|>
// <|im_start|>reasoning
// Let's analyze this step by step...
// <|im_end|>
// <|im_start|>assistant
// Based on the analysis, the best approach is...
// <|im_end|>

JSON Output Format

The dataset is saved in a structured JSON format:

{
  "entries": [
    {
      "user": "What motivates Luna?",
      "reasoning": "Luna's motivations can be analyzed...",
      "assistant": "Luna is motivated by acceptance, self-expression, and personal growth.",
      "template": "<|im_start|>user\n...<|im_end|>...",
      "conversations": [
        {
          "content": "What motivates Luna?",
          "role": "user"
        },
        {
          "content": "Luna's motivations can be analyzed...",
          "role": "reasoning"
        },
        {
          "content": "Luna is motivated by acceptance, self-expression, and personal growth.",
          "role": "assistant"
        }
      ]
    }
  ]
}

Best Practices

Structured Reasoning
- Keep reasoning steps clear and organized
- Use numbered lists or bullet points for complex analyses
- Maintain consistent formatting across entries
Role Attribution
- Use clear and consistent role names
- Standard roles: "user", "reasoning", "assistant"
- Consider adding custom roles for specific use cases
Template Management
- Use the provided template generation function
- Maintain consistent token usage (<|im_start|> and <|im_end|>)
- Include all relevant conversation components
Error Handling
- Use the Result type for error propagation
- Handle file operations with proper error context
- Validate data before saving
Async Operations
- Use async/await for file operations
- Consider batch processing for large datasets
- Implement proper error handling for async operations

Installation

cargo add dset

Logging Configuration

The library uses the log crate for logging. To enable logging in your application:

Add a logging implementation like env_logger to your project:
```
cargo add env_logger
```

Initialize the logger in your application:

use env_logger;

fn main() {
    env_logger::init();
    // Your code here...
}

Set the log level using the RUST_LOG environment variable:

export RUST_LOG=info    # Show info and error messages
export RUST_LOG=debug   # Show debug, info, and error messages
export RUST_LOG=trace   # Show all log messages

The library uses different log levels:

error: For unrecoverable errors
warn: For recoverable errors or unexpected conditions
info: For important operations and successful processing
debug: For detailed processing information
trace: For very detailed debugging information

Core Functions Reference

SafeTensors Functions

`process_safetensors_file(path: &Path) -> Result<()>`

Processes a safetensors file by extracting its metadata and saving it as a JSON file.

Parameters:
- path: Path to the safetensors file
Returns: Result indicating success or failure
Error Handling: Provides detailed context for failures including file opening issues, memory mapping errors, and metadata extraction failures
Performance: Uses memory mapping for efficient file access without loading the entire file into memory

Example:

process_safetensors_file(Path::new("model.safetensors")).await?;
// Creates model.safetensors.metadata.json

`get_json_metadata(path: &Path) -> Result<Value>`

Extracts and parses JSON metadata from a safetensors file.

Parameters:
- path: Path to the safetensors file
Returns: The extracted metadata as a serde_json Value
Error Handling: Provides context for file opening, memory mapping, and JSON parsing errors
Performance: Uses memory mapping for efficient handling of large files

Example:

let metadata = get_json_metadata(Path::new("model.safetensors")).await?;
println!("Model metadata: {}", metadata);

`decode_json_strings(value: Value) -> Value`

Recursively decodes JSON-encoded strings within a serde_json::Value.

Parameters:
- value: JSON value potentially containing encoded strings
Returns: Decoded JSON value with nested structures properly parsed
Behavior:
- Converts string "None" to JSON null
- Tries to parse strings starting with '{' or '[' as JSON objects or arrays
- Recursively processes all nested values

Example:

let raw_json = json!({"config": "{\"param\": 123}"});
let decoded = decode_json_strings(raw_json);
// Results in: {"config": {"param": 123}}

`extract_training_metadata(raw_metadata: &Value) -> Value`

Extracts and processes training metadata from raw safetensors metadata.

Parameters:
- raw_metadata: Raw metadata from a safetensors file
Returns: Processed metadata with decoded JSON strings
Behavior:
- Looks for a __metadata__ field first
- Falls back to decoding the entire metadata object if not found
- Returns an empty object if no valid metadata is found

Example:

let raw_meta = get_json_metadata(Path::new("model.safetensors")).await?;
let training_meta = extract_training_metadata(&raw_meta);

Caption Processing Functions

`process_file(path: &Path) -> Result<()>`

Processes a caption file in either JSON or plain text format.

Parameters:
- path: Path to the caption file
Returns: Result indicating success or failure
Behavior:
- Auto-detects file format based on extension and content
- For JSON files, extracts caption text and formats it
- For text files, formats the text content
Error Handling: Provides context for file I/O errors and JSON parsing failures

Example:

process_file(Path::new("caption.json")).await?;

`json_to_text(json: &Value) -> Result<String>`

Extracts caption text from a JSON value.

Parameters:
- json: JSON value containing caption data
Returns: Extracted caption text
Behavior:
- If the JSON is a string, returns the string
- If the JSON is an object with a "caption" field, returns that field
- Fails for other JSON formats
Error Handling: Returns error for unsupported JSON structures

Example:

let json = serde_json::from_str("{\"caption\": \"A beautiful landscape\"}")?;
let text = json_to_text(&json)?;
// text = "A beautiful landscape"

`caption_file_exists_and_not_empty(path: &Path) -> bool`

Checks if a caption file exists and has content.

Parameters:
- path: Path to the caption file
Returns: Boolean indicating if the file exists and is not empty
Performance: Uses efficient file operations to avoid unnecessary reads

Example:

if caption_file_exists_and_not_empty(Path::new("caption.txt")).await {
    println!("Caption file is valid");
}

`process_e621_json_file(file_path: &Path, config: Option<E621Config>) -> Result<()>`

Processes an e621 JSON file and creates a caption file.

Parameters:
- file_path: Path to the e621 JSON file
- config: Optional configuration for customizing processing
Returns: Result indicating success or failure
Behavior:
- Reads and parses the e621 JSON file
- Extracts tags according to configuration
- Creates a caption file with the processed tags
Error Handling: Provides context for file I/O errors and JSON parsing failures

Example:

let config = E621Config::new().with_filter_tags(true);
process_e621_json_file(Path::new("post.json"), Some(config)).await?;

`process_e621_json_data(data: &Value, file_path: &Arc<PathBuf>, config: Option<E621Config>) -> Result<()>`

Processes e621 JSON data and creates a caption file.

Parameters:
- data: e621 JSON data as a serde_json Value
- file_path: Path to the JSON file (used for output path calculation)
- config: Optional configuration for customizing processing
Returns: Result indicating success or failure
Behavior:
- Extracts rating and tags from the e621 JSON data
- Formats tags according to configuration
- Creates a caption file with the processed tags

Example:

let json_data = serde_json::from_str(json_str)?;
let path = Arc::new(PathBuf::from("post.json"));
process_e621_json_data(&json_data, &path, None).await?;

`format_text_content(content: &str) -> Result<String>`

Formats text content by normalizing whitespace.

Parameters:
- content: Text content to format
Returns: Formatted text
Behavior:
- Trims leading and trailing whitespace
- Replaces multiple spaces with a single space
- Replaces newlines with spaces

Example:

let formatted = format_text_content("  Multiple    spaces   \n\n  and newlines  ")?;
// formatted = "Multiple spaces and newlines"

`replace_string(path: &Path, search: &str, replace: &str) -> Result<()>`

Replaces occurrences of a string in a file.

Parameters:
- path: Path to the file
- search: String to search for
- replace: String to replace with
Returns: Result indicating success or failure
Performance: Reads the entire file into memory, so be cautious with very large files
Error Handling: Provides context for file I/O errors

Example:

replace_string(Path::new("caption.txt"), "old text", "new text").await?;

`replace_special_chars(path: PathBuf) -> Result<()>`

Replaces special characters in a file with standard ASCII equivalents.

Parameters:
- path: Path to the file
Returns: Result indicating success or failure
Behavior:
- Replaces smart quotes with standard quotes
- Replaces other special characters with standard equivalents
Error Handling: Provides context for file I/O errors

Example:

replace_special_chars(PathBuf::from("document.txt")).await?;

Tag Processing Functions

`should_ignore_e621_tag(tag: &str) -> bool`

Determines if an e621 tag should be ignored.

Parameters:
- tag: Tag to check
Returns: Boolean indicating if the tag should be ignored
Behavior:
- Checks against predefined patterns (years, aspect ratios, etc.)
- Returns true for tags that should be filtered out
Performance: Uses precompiled regex patterns for efficiency

Example:

if !should_ignore_e621_tag("2023") {
    tags.push("2023");
}

`process_e621_tags(tags_dict: &Value, config: Option<&E621Config>) -> Vec<String>`

Processes e621 tags from a JSON dictionary.

Parameters:
- tags_dict: JSON dictionary containing e621 tags
- config: Optional configuration for customizing processing
Returns: Vector of processed tags
Behavior:
- Extracts tags from different categories (artist, character, etc.)
- Formats tags according to configuration
- Filters tags if enabled in configuration

Example:

let tags = process_e621_tags(&tags_json, Some(&config));

Reasoning Dataset Functions

`ReasoningDataset::new() -> Self`

Creates a new empty reasoning dataset.

Returns: Empty ReasoningDataset
Example:
```
let dataset = ReasoningDataset::new();
```

`ReasoningDataset::load<P: AsRef<Path>>(path: P) -> Result<Self>`

Loads a reasoning dataset from a JSON file.

Parameters:
- path: Path to the JSON file
Returns: Loaded ReasoningDataset
Error Handling: Provides context for file I/O errors and JSON parsing failures

Example:

let dataset = ReasoningDataset::load("dataset.json").await?;

`ReasoningDataset::save<P: AsRef<Path>>(&self, path: P) -> Result<()>`

Saves the reasoning dataset to a JSON file.

Parameters:
- path: Path to save the JSON file
Returns: Result indicating success or failure
Behavior: Creates a pretty-printed JSON file
Error Handling: Provides context for file I/O errors and JSON serialization failures
Example:
```
dataset.save("dataset.json").await?;
```

`ReasoningDataset::add_entry(&mut self, entry: ReasoningEntry)`

Adds a new entry to the dataset.

Parameters:
- entry: ReasoningEntry to add
Example:
```
dataset.add_entry(entry);
```

`ReasoningDataset::len(&self) -> usize`

Returns the number of entries in the dataset.

Returns: Number of entries
Example:
```
let count = dataset.len();
```

`ReasoningDataset::is_empty(&self) -> bool`

Returns true if the dataset is empty.

Returns: Boolean indicating if the dataset is empty

Example:

if dataset.is_empty() {
    println!("Dataset is empty");
}

`ReasoningDataset::create_template(user: &str, reasoning: &str, assistant: &str) -> String`

Creates a template string from user, reasoning, and assistant content.

Parameters:
- user: User's question or request
- reasoning: Detailed reasoning steps
- assistant: Assistant's response
Returns: Formatted template string
Behavior: Creates a template with <|im_start|> and <|im_end|> tokens

Example:

let template = ReasoningDataset::create_template(
    "What is X?",
    "X can be determined by...",
    "X is Y"
);

Utility Functions

`split_content(content: &str) -> (Vec<String>, String)`

Splits content into tags and sentences.

Parameters:
- content: Text content to split
Returns: Tuple of (tags vector, sentences string)
Behavior:
- Identifies the tag portion (comma-separated items before the first sentence)
- Extracts the sentence portion (text after the tags)

Example:

let (tags, text) = split_content("tag1, tag2, tag3., This is the main text.");
// tags = ["tag1", "tag2", "tag3"]
// text = "This is the main text."

Usage Examples

SafeTensors Metadata Extraction

use dset::{process_safetensors_file, get_json_metadata};
use std::path::Path;
use anyhow::Result;

async fn extract_metadata(path: &str) -> Result<()> {
    // Extracts metadata and saves it as a JSON file
    process_safetensors_file(Path::new(path)).await?;
    
    // The output will be saved as "{path}.json"
    
    // Alternatively, get the metadata directly
    let metadata = get_json_metadata(Path::new(path)).await?;
    println!("Model metadata: {}", metadata);
    
    Ok(())
}

Caption File Processing

use dset::{
    caption::process_file,
    caption::process_json_to_caption,
    caption::caption_file_exists_and_not_empty
};
use std::path::Path;
use anyhow::Result;

async fn handle_captions() -> Result<()> {
    let path = Path::new("image1.txt");
    
    // Check if caption file exists and has content
    if caption_file_exists_and_not_empty(&path).await {
        // Process the caption file (auto-detects format)
        process_file(&path).await?;
    }
    
    // Convert JSON caption to text format
    process_json_to_caption(Path::new("image2.json")).await?;
    
    Ok(())
}

File Operations

use dset::rename_file_without_image_extension;
use std::path::Path;
use std::io;

async fn handle_files() -> io::Result<()> {
    // Remove intermediate image extensions from files
    let path = Path::new("image.jpg.toml");
    rename_file_without_image_extension(&path).await?;  // Will rename to "image.toml"
    
    // Won't modify files that are actually images
    let img = Path::new("photo.jpg");
    rename_file_without_image_extension(&img).await?;  // Will remain "photo.jpg"
    
    Ok(())
}

JSON Processing and Formatting

The library provides two main types of JSON processing capabilities besides the e621 caption processing:

1. Tag Probability JSON Processing

Converts JSON files containing tag-probability pairs into caption files. Tags with probabilities above 0.2 are included in the output.

{
    "tag1": 0.9,
    "tag2": 0.5,
    "tag3": 0.1
}

The above JSON would be converted to a caption file containing:

tag1, tag2

Note that:

Tags are sorted by probability in descending order
Only tags with probability >= 0.2 are included
Special characters in tags are escaped (e.g., parentheses)
The output is saved as a .txt file with the same base name

Example usage:

use dset::process_json_to_caption;
use std::path::Path;
use anyhow::Result;

async fn process_tags() -> Result<()> {
    // Process a JSON file containing tag probabilities
    // Input: tags.json
    // {
    //     "person": 0.98,
    //     "smiling": 0.85,
    //     "outdoor": 0.45,
    //     "blurry": 0.15
    // }
    // 
    // Output: tags.txt
    // person, smiling, outdoor
    process_json_to_caption(Path::new("tags.json")).await?;
    
    Ok(())
}

Both functions handle errors gracefully and provide async processing capabilities.

2. General JSON Processing

The library provides two functions for general JSON handling:

format_json_file: Pretty prints any JSON file with proper indentation
process_json_file: Allows custom processing of JSON data with an async handler

Example usage:

use dset::{format_json_file, process_json_file};
use std::path::{Path, PathBuf};
use serde_json::Value;
use anyhow::Result;

async fn handle_json() -> Result<()> {
    // Format a JSON file
    format_json_file(Path::new("data.json").to_path_buf()).await?;
    
    // Process JSON with custom handler
    process_json_file(Path::new("data.json"), |json: &Value| async {
        println!("Processing: {}", json);
        Ok(())
    }).await?;
    
    Ok(())
}

Both functions handle errors gracefully and provide async processing capabilities.

Content Splitting

use dset::split_content;
use log::info;

fn process_tags_and_text() {
    let content = "tag1, tag2, tag3., This is the main text.";
    let (tags, sentences) = split_content(content);
    
    info!("Tags: {:?}", tags);  // ["tag1", "tag2", "tag3"]
    info!("Text: {}", sentences);  // "This is the main text."
}

Text Processing

use dset::caption::{format_text_content, replace_string, replace_special_chars};
use std::path::{Path, PathBuf};
use anyhow::Result;
use log::info;

async fn example() -> Result<()> {
    // Format text by normalizing whitespace
    let formatted = format_text_content("  Multiple    spaces   \n\n  and newlines  ")?;
    assert_eq!(formatted, "Multiple spaces and newlines");
    
    // Replace text in a file
    replace_string(Path::new("caption.txt"), "old text", "new text").await?;
    
    // Replace special characters in a file (smart quotes, etc.)
    replace_special_chars(PathBuf::from("document.txt")).await?;
    
    Ok(())
}

Error Handling

The library uses anyhow for comprehensive error handling:

use dset::process_safetensors_file;
use std::path::Path;
use anyhow::{Context, Result};
use log::info;

async fn example() -> Result<()> {
    process_safetensors_file(Path::new("model.safetensors"))
        .await
        .context("Failed to process safetensors file")?;
    info!("Successfully processed safetensors file");
    Ok(())
}

File Concatenation

The concat module provides utilities for combining files with different extensions, which is particularly useful for dataset preparation. It supports concatenating tag files, caption files, and other auxiliary files into a single output file with intelligent tag deduplication.

Overview

File Finding: Recursively finds files with specified base extensions (e.g., .jpg, .png)
Content Merging: Combines content from multiple related files (e.g., .caption, .wd, .tags)
Tag Deduplication: Automatically removes duplicate tags across input files
Content Deduplication: Optionally skips processing files with identical content based on MD5 hashing
Formatting: Applies consistent formatting with configurable separators
Dry Run: Preview changes without modifying files

Configuration

The module uses two main types to configure concatenation:

FileExtensionPreset

Predefined configurations for common use cases:

pub enum FileExtensionPreset {
    /// Concatenates .caption, .wd, .tags files into .txt
    CaptionWdTags,
    /// Concatenates .florence, .wd, .tags files
    FlorenceWdTags,
}

ConcatConfig

Custom configuration for file concatenation:

pub struct ConcatConfig {
    /// Base file extensions to find (without the dot)
    pub base_extensions: Vec<String>,
    /// Extensions to concatenate (without the dot)
    pub extensions_to_concat: Vec<String>,
    /// Output file extension (without the dot)
    pub output_extension: String,
    /// Set to true to remove duplicate tags
    pub remove_duplicates: bool,
    /// Tag separator to use when concatenating
    pub tag_separator: String,
    /// Set to true to deduplicate files with identical content
    pub deduplicate_files: bool,
}

How It Works

The concatenation process follows these steps:

Directory Traversal: Walks through the specified directory recursively
File Matching: Identifies files with base extensions (e.g., .png, .jpg)
Related File Gathering: For each base file, finds corresponding files with the extensions to concatenate
Content Reading: Reads content from each related file
Optional Deduplication Check: If enabled, computes an MD5 hash of the combined content and skips if identical to a previously processed file
Tag Processing: Extracts tags from all files except caption files (which receive special handling)
Tag Deduplication: If enabled, removes duplicate tags and sorts them alphabetically
Content Concatenation: Combines all unique tags with the caption content
Output Writing: Writes the concatenated content to a new file with the specified output extension

The concat_tags function specifically handles:

Identifying which file is the caption file based on extension
Extracting and optionally deduplicating tags from all non-caption files
Appending the caption content after the deduplicated tags
Sorting tags alphabetically when deduplication is enabled

Examples

Using a Preset

use dset::concat::{ConcatConfig, FileExtensionPreset, concat_files};
use std::path::Path;
use anyhow::Result;

async fn concat_with_preset() -> Result<()> {
    // Use a predefined preset for common configurations
    let config = ConcatConfig::from_preset(FileExtensionPreset::CaptionWdTags);
    
    // Process files in the specified directory
    let processed_count = concat_files(Path::new("./dataset"), &config, false).await?;
    println!("Processed {} files", processed_count);
    
    Ok(())
}

Custom Configuration with Deduplication

use dset::concat::{ConcatConfig, concat_files};
use std::path::Path;
use anyhow::Result;

async fn concat_with_deduplication() -> Result<()> {
    // Create a custom configuration with deduplication enabled
    let config = ConcatConfig::new(
        // Base extensions to look for
        vec!["png".into(), "jpg".into()],
        // Extensions to concatenate
        vec!["caption".into(), "wd".into(), "tags".into()],
        // Output extension
        "txt".into(),
        // Remove duplicate tags
        true,
        // Tag separator
        ", ".into(),
    ).with_deduplication(true);  // Enable file deduplication
    
    // Process files with dry run first (preview only)
    let would_process = concat_files(Path::new("./dataset"), &config, true).await?;
    println!("Would process {} files", would_process);
    
    // Process files for real
    let processed = concat_files(Path::new("./dataset"), &config, false).await?;
    println!("Actually processed {} files", processed);
    println!("Skipped {} duplicates", would_process - processed);
    
    Ok(())
}

Processing a Single File

use dset::concat::{ConcatConfig, process_image_file};
use std::path::Path;
use anyhow::Result;

async fn process_single_file() -> Result<()> {
    // Create configuration with the CaptionWdTags preset
    let config = ConcatConfig::from_preset(FileExtensionPreset::CaptionWdTags);
    
    // Process a single file
    let success = process_image_file(
        Path::new("./dataset/image.jpg"), 
        &config,
        false  // Not a dry run
    ).await?;
    
    if success {
        println!("Successfully processed image.jpg");
    } else {
        println!("Failed to process image.jpg (likely missing required files)");
    }
    
    Ok(())
}

Checking for Duplicate Content

use dset::concat::{ConcatConfig, check_duplicate_content};
use std::path::Path;
use std::sync::Arc;
use tokio::sync::Mutex;
use std::collections::HashMap;
use anyhow::Result;

async fn check_for_duplicates() -> Result<()> {
    // Create configuration
    let config = ConcatConfig::from_preset(FileExtensionPreset::CaptionWdTags)
        .with_deduplication(true);
    
    // Set up the deduplication hash table
    let content_hashes = Arc::new(Mutex::new(HashMap::new()));
    
    // Check if a file is a duplicate
    let file_path = Path::new("./dataset/image1.jpg");
    let is_duplicate = check_duplicate_content(
        &file_path,
        &config,
        content_hashes.clone()
    ).await;
    
    if is_duplicate {
        println!("File is a duplicate of a previously processed file");
    } else {
        println!("File is unique");
    }
    
    Ok(())
}

Format of Concatenated Output

The default output format is:

tag1, tag2, tag3, tag4, caption_content

Where:

Tags are collected from all non-caption files
Duplicate tags are removed (if enabled)
Tags are sorted alphabetically (when deduplication is enabled)
The caption content is appended after the tags

For example, with these input files:

image.jpg - The base image file
image.caption - Contains "a photo of a person"
image.wd - Contains "masterpiece, digital art"
image.tags - Contains "tag1, tag2, tag3"

The output image.txt would contain:

digital art, masterpiece, tag1, tag2, tag3, a photo of a person

Note that tags from the WebUI description file (.wd) are alphabetically sorted.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. When contributing:

Ensure all tests pass
Add tests for new features
Update documentation
Follow the existing code style
Add error handling where appropriate

License

This project is licensed under the MIT License.

Commit count: 54

dset

documentation

README

dset

Overview

Inspecting State Dictionary

E621 Caption Processing

Available Options

Tag Processing

Tag Filtering

Caption File Generation

Example Usage

Example Outputs

Batch Processing Example

🤖 AI Reasoning Dataset

Data Structure

Features

Reasoning Example Usage

JSON Output Format

Best Practices

Installation

Logging Configuration

Core Functions Reference

SafeTensors Functions

process_safetensors_file(path: &Path) -> Result<()>

get_json_metadata(path: &Path) -> Result<Value>

decode_json_strings(value: Value) -> Value

extract_training_metadata(raw_metadata: &Value) -> Value

Caption Processing Functions

process_file(path: &Path) -> Result<()>

json_to_text(json: &Value) -> Result<String>

caption_file_exists_and_not_empty(path: &Path) -> bool

process_e621_json_file(file_path: &Path, config: Option<E621Config>) -> Result<()>

process_e621_json_data(data: &Value, file_path: &Arc<PathBuf>, config: Option<E621Config>) -> Result<()>

format_text_content(content: &str) -> Result<String>

replace_string(path: &Path, search: &str, replace: &str) -> Result<()>

replace_special_chars(path: PathBuf) -> Result<()>

Tag Processing Functions

should_ignore_e621_tag(tag: &str) -> bool

process_e621_tags(tags_dict: &Value, config: Option<&E621Config>) -> Vec<String>

Reasoning Dataset Functions

ReasoningDataset::new() -> Self

ReasoningDataset::load<P: AsRef<Path>>(path: P) -> Result<Self>

ReasoningDataset::save<P: AsRef<Path>>(&self, path: P) -> Result<()>

ReasoningDataset::add_entry(&mut self, entry: ReasoningEntry)

ReasoningDataset::len(&self) -> usize

ReasoningDataset::is_empty(&self) -> bool

ReasoningDataset::create_template(user: &str, reasoning: &str, assistant: &str) -> String

Utility Functions

split_content(content: &str) -> (Vec<String>, String)

Usage Examples

SafeTensors Metadata Extraction

Caption File Processing

File Operations

JSON Processing and Formatting

1. Tag Probability JSON Processing

2. General JSON Processing

Content Splitting

Text Processing

Error Handling

File Concatenation

Overview

Configuration

FileExtensionPreset

ConcatConfig

How It Works

Examples

Using a Preset

Custom Configuration with Deduplication

Processing a Single File

Checking for Duplicate Content

Format of Concatenated Output

Contributing

License

cargo fmt

`process_safetensors_file(path: &Path) -> Result<()>`

`get_json_metadata(path: &Path) -> Result<Value>`

`decode_json_strings(value: Value) -> Value`

`extract_training_metadata(raw_metadata: &Value) -> Value`

`process_file(path: &Path) -> Result<()>`

`json_to_text(json: &Value) -> Result<String>`

`caption_file_exists_and_not_empty(path: &Path) -> bool`

`process_e621_json_file(file_path: &Path, config: Option<E621Config>) -> Result<()>`

`process_e621_json_data(data: &Value, file_path: &Arc<PathBuf>, config: Option<E621Config>) -> Result<()>`

`format_text_content(content: &str) -> Result<String>`

`replace_string(path: &Path, search: &str, replace: &str) -> Result<()>`

`replace_special_chars(path: PathBuf) -> Result<()>`

`should_ignore_e621_tag(tag: &str) -> bool`

`process_e621_tags(tags_dict: &Value, config: Option<&E621Config>) -> Vec<String>`

`ReasoningDataset::new() -> Self`

`ReasoningDataset::load<P: AsRef<Path>>(path: P) -> Result<Self>`

`ReasoningDataset::save<P: AsRef<Path>>(&self, path: P) -> Result<()>`

`ReasoningDataset::add_entry(&mut self, entry: ReasoningEntry)`

`ReasoningDataset::len(&self) -> usize`

`ReasoningDataset::is_empty(&self) -> bool`

`ReasoningDataset::create_template(user: &str, reasoning: &str, assistant: &str) -> String`

`split_content(content: &str) -> (Vec<String>, String)`