| Crates.io | dset |
| lib.rs | dset |
| version | 0.1.12 |
| created_at | 2025-02-15 19:25:01.212763+00 |
| updated_at | 2025-03-18 15:15:22.444074+00 |
| description | A Rust library for processing and managing dataset-related files, with a focus on machine learning datasets, captions, and safetensors files |
| homepage | https://github.com/rakki194/dset |
| repository | https://github.com/rakki194/dset |
| max_upload_size | |
| id | 1557036 |
| size | 190,333 |
A Rust library for processing and managing dataset-related files, particularly for machine learning datasets, captions, and safetensors files.
SafeTensors Processing
Caption File Handling
File Concatenation
File Operations
JSON Processing
Content Processing
Performance Features
Error Handling
You can now inspect the state dictionary of a targeted safensor file using the inspect_state_dict function:
use dset::st::inspect_state_dict;
use std::path::Path;
use anyhow::Result;
async fn example() -> Result<()> {
let state_dict = inspect_state_dict(Path::new("model.safetensors")).await?;
println!("State Dictionary: {:?}", state_dict);
Ok(())
}
This function reads the state dictionary from the specified safensor file and returns it as a JSON value.
The library excels at processing e621 JSON post data into standardized caption files, ideal for creating training datasets. The configuration is highly customizable using E621Config:
use dset::caption::{E621Config, process_e621_json_file};
use std::path::Path;
use std::collections::HashMap;
use anyhow::Result;
async fn process_with_custom_config() -> Result<()> {
// Create custom rating conversions
let mut custom_ratings = HashMap::new();
custom_ratings.insert("s".to_string(), "safe".to_string());
custom_ratings.insert("q".to_string(), "maybe".to_string());
custom_ratings.insert("e".to_string(), "nsfw".to_string());
let config = E621Config::new()
.with_filter_tags(false) // Disable tag filtering
.with_rating_conversions(Some(custom_ratings)) // Custom rating names
.with_format(Some("Rating: {rating}\nArtists: {artists}\nTags: {general}".to_string())); // Custom format
process_e621_json_file(Path::new("e621_post.json"), Some(config)).await
}
Tag Filtering (filter_tags: bool, default: true)
Rating Conversions (rating_conversions: Option<HashMap<String, String>>)
None to use raw ratings)Artist Formatting
artist_prefix: Option<String> (default: Some("by "))artist_suffix: Option<String> (default: None)None for raw artist namesFormat String (format: Option<String>)
"{rating}, {artists}, {characters}, {species}, {copyright}, {general}, {meta}"{rating} - The rating (after conversion){artists} - Artist tags (with configured formatting){characters} - Character tags{species} - Species tags{copyright} - Copyright tags{general} - General tags{meta} - Meta tagsArtist Tags
Character Tags
Species Tags
Copyright Tags
General Tags
Meta Tags
Tag filtering is enabled by default but can be disabled. When enabled, it automatically filters out:
The tag filtering system uses regular expressions for pattern matching and will:
To disable filtering, pass Some(false) as the filter_tags parameter.
.txt files from e621 JSON posts[rating], [artist tags], [character tags], [other tags]use dset::caption::{E621Config, process_e621_json_file};
use std::path::Path;
use anyhow::Result;
async fn process_e621() -> Result<()> {
// Process with default settings
process_e621_json_file(Path::new("e621_post.json"), None).await?;
// Process with custom format
let config = E621Config::new()
.with_format(Some("{rating}\nBy: {artists}\nTags: {general}".to_string()));
process_e621_json_file(Path::new("e621_post.json"), Some(config)).await?;
// Process with raw ratings (no conversion)
let config = E621Config::new()
.with_rating_conversions(None);
process_e621_json_file(Path::new("e621_post.json"), Some(config)).await?;
Ok(())
}
With default settings:
safe, by artist name, character name, species, tag1, tag2
With custom format:
Rating: safe
Artists: by artist name
Tags: tag1, tag2
With raw ratings:
s, by artist name, character name, species, tag1, tag2
use dset::caption::{E621Config, process_e621_json_file};
use std::path::Path;
use anyhow::Result;
use tokio::fs;
async fn batch_process_e621() -> Result<()> {
// Optional: customize processing for all files
let config = E621Config::new()
.with_filter_tags(false)
.with_format(Some("{rating}\n{artists}\n{general}".to_string()));
let entries = fs::read_dir("e621_posts").await?;
for entry in entries {
if let Ok(entry) = entry {
let path = entry.path();
if path.extension().map_or(false, |ext| ext == "json") {
process_e621_json_file(&path, Some(config.clone())).await?;
}
}
}
Ok(())
}
The library provides comprehensive support for managing AI reasoning datasets, particularly useful for training language models in structured reasoning tasks. This functionality helps maintain consistent formatting and organization of reasoning data.
The reasoning dataset format consists of three main components:
Messages - Individual conversation messages:
Message {
content: String, // The message content
role: String, // The role (e.g., "user", "reasoning", "assistant")
}
Reasoning Entries - Complete reasoning interactions:
ReasoningEntry {
user: String, // The user's question/request
reasoning: String, // Detailed step-by-step reasoning
assistant: String, // Final summarized response
template: String, // Structured template combining all roles
conversations: Vec<Message>, // Complete conversation history
}
Dataset Collection - Collection of reasoning entries:
ReasoningDataset {
entries: Vec<ReasoningEntry>
}
Structured Data Management
Template Generation
<|im_start|> and <|im_end|> tokensFile Operations
Dataset Manipulation
Creating and Managing Datasets
use dset::reasoning::{ReasoningDataset, ReasoningEntry, Message};
use anyhow::Result;
async fn manage_dataset() -> Result<()> {
// Create a new dataset
let mut dataset = ReasoningDataset::new();
// Create an entry
let entry = ReasoningEntry {
user: "What motivates Luna?".to_string(),
reasoning: "Luna's motivations can be analyzed based on several factors:\n1. Desire for acceptance\n2. Self-expression needs\n3. Personal growth aspirations".to_string(),
assistant: "Luna is motivated by acceptance, self-expression, and personal growth.".to_string(),
template: ReasoningDataset::create_template(
"What motivates Luna?",
"Luna's motivations can be analyzed...",
"Luna is motivated by acceptance, self-expression, and personal growth."
),
conversations: vec![
Message {
content: "What motivates Luna?".to_string(),
role: "user".to_string(),
},
Message {
content: "Luna's motivations can be analyzed...".to_string(),
role: "reasoning".to_string(),
},
Message {
content: "Luna is motivated by acceptance, self-expression, and personal growth.".to_string(),
role: "assistant".to_string(),
},
],
};
// Add entry to dataset
dataset.add_entry(entry);
// Save dataset to file
dataset.save("reasoning_data.json").await?;
// Load dataset from file
let loaded_dataset = ReasoningDataset::load("reasoning_data.json").await?;
assert_eq!(loaded_dataset.len(), 1);
Ok(())
}
Working with Templates
use dset::reasoning::ReasoningDataset;
// Create a template string
let template = ReasoningDataset::create_template(
"What is the best approach?",
"Let's analyze this step by step...",
"Based on the analysis, the best approach is..."
);
// Template output format:
// <|im_start|>user
// What is the best approach?
// <|im_end|>
// <|im_start|>reasoning
// Let's analyze this step by step...
// <|im_end|>
// <|im_start|>assistant
// Based on the analysis, the best approach is...
// <|im_end|>
The dataset is saved in a structured JSON format:
{
"entries": [
{
"user": "What motivates Luna?",
"reasoning": "Luna's motivations can be analyzed...",
"assistant": "Luna is motivated by acceptance, self-expression, and personal growth.",
"template": "<|im_start|>user\n...<|im_end|>...",
"conversations": [
{
"content": "What motivates Luna?",
"role": "user"
},
{
"content": "Luna's motivations can be analyzed...",
"role": "reasoning"
},
{
"content": "Luna is motivated by acceptance, self-expression, and personal growth.",
"role": "assistant"
}
]
}
]
}
Structured Reasoning
Role Attribution
Template Management
<|im_start|> and <|im_end|>)Error Handling
Async Operations
cargo add dset
The library uses the log crate for logging. To enable logging in your application:
Add a logging implementation like env_logger to your project:
cargo add env_logger
Initialize the logger in your application:
use env_logger;
fn main() {
env_logger::init();
// Your code here...
}
Set the log level using the RUST_LOG environment variable:
export RUST_LOG=info # Show info and error messages
export RUST_LOG=debug # Show debug, info, and error messages
export RUST_LOG=trace # Show all log messages
The library uses different log levels:
error: For unrecoverable errorswarn: For recoverable errors or unexpected conditionsinfo: For important operations and successful processingdebug: For detailed processing informationtrace: For very detailed debugging informationprocess_safetensors_file(path: &Path) -> Result<()>Processes a safetensors file by extracting its metadata and saving it as a JSON file.
Parameters:
path: Path to the safetensors fileReturns: Result indicating success or failure
Error Handling: Provides detailed context for failures including file opening issues, memory mapping errors, and metadata extraction failures
Performance: Uses memory mapping for efficient file access without loading the entire file into memory
Example:
process_safetensors_file(Path::new("model.safetensors")).await?;
// Creates model.safetensors.metadata.json
get_json_metadata(path: &Path) -> Result<Value>Extracts and parses JSON metadata from a safetensors file.
Parameters:
path: Path to the safetensors fileReturns: The extracted metadata as a serde_json Value
Error Handling: Provides context for file opening, memory mapping, and JSON parsing errors
Performance: Uses memory mapping for efficient handling of large files
Example:
let metadata = get_json_metadata(Path::new("model.safetensors")).await?;
println!("Model metadata: {}", metadata);
decode_json_strings(value: Value) -> ValueRecursively decodes JSON-encoded strings within a serde_json::Value.
Parameters:
value: JSON value potentially containing encoded stringsReturns: Decoded JSON value with nested structures properly parsed
Behavior:
Example:
let raw_json = json!({"config": "{\"param\": 123}"});
let decoded = decode_json_strings(raw_json);
// Results in: {"config": {"param": 123}}
extract_training_metadata(raw_metadata: &Value) -> ValueExtracts and processes training metadata from raw safetensors metadata.
Parameters:
raw_metadata: Raw metadata from a safetensors fileReturns: Processed metadata with decoded JSON strings
Behavior:
__metadata__ field firstExample:
let raw_meta = get_json_metadata(Path::new("model.safetensors")).await?;
let training_meta = extract_training_metadata(&raw_meta);
process_file(path: &Path) -> Result<()>Processes a caption file in either JSON or plain text format.
Parameters:
path: Path to the caption fileReturns: Result indicating success or failure
Behavior:
Error Handling: Provides context for file I/O errors and JSON parsing failures
Example:
process_file(Path::new("caption.json")).await?;
json_to_text(json: &Value) -> Result<String>Extracts caption text from a JSON value.
Parameters:
json: JSON value containing caption dataReturns: Extracted caption text
Behavior:
Error Handling: Returns error for unsupported JSON structures
Example:
let json = serde_json::from_str("{\"caption\": \"A beautiful landscape\"}")?;
let text = json_to_text(&json)?;
// text = "A beautiful landscape"
caption_file_exists_and_not_empty(path: &Path) -> boolChecks if a caption file exists and has content.
Parameters:
path: Path to the caption fileReturns: Boolean indicating if the file exists and is not empty
Performance: Uses efficient file operations to avoid unnecessary reads
Example:
if caption_file_exists_and_not_empty(Path::new("caption.txt")).await {
println!("Caption file is valid");
}
process_e621_json_file(file_path: &Path, config: Option<E621Config>) -> Result<()>Processes an e621 JSON file and creates a caption file.
Parameters:
file_path: Path to the e621 JSON fileconfig: Optional configuration for customizing processingReturns: Result indicating success or failure
Behavior:
Error Handling: Provides context for file I/O errors and JSON parsing failures
Example:
let config = E621Config::new().with_filter_tags(true);
process_e621_json_file(Path::new("post.json"), Some(config)).await?;
process_e621_json_data(data: &Value, file_path: &Arc<PathBuf>, config: Option<E621Config>) -> Result<()>Processes e621 JSON data and creates a caption file.
Parameters:
data: e621 JSON data as a serde_json Valuefile_path: Path to the JSON file (used for output path calculation)config: Optional configuration for customizing processingReturns: Result indicating success or failure
Behavior:
Example:
let json_data = serde_json::from_str(json_str)?;
let path = Arc::new(PathBuf::from("post.json"));
process_e621_json_data(&json_data, &path, None).await?;
format_text_content(content: &str) -> Result<String>Formats text content by normalizing whitespace.
Parameters:
content: Text content to formatReturns: Formatted text
Behavior:
Example:
let formatted = format_text_content(" Multiple spaces \n\n and newlines ")?;
// formatted = "Multiple spaces and newlines"
replace_string(path: &Path, search: &str, replace: &str) -> Result<()>Replaces occurrences of a string in a file.
Parameters:
path: Path to the filesearch: String to search forreplace: String to replace withReturns: Result indicating success or failure
Performance: Reads the entire file into memory, so be cautious with very large files
Error Handling: Provides context for file I/O errors
Example:
replace_string(Path::new("caption.txt"), "old text", "new text").await?;
replace_special_chars(path: PathBuf) -> Result<()>Replaces special characters in a file with standard ASCII equivalents.
Parameters:
path: Path to the fileReturns: Result indicating success or failure
Behavior:
Error Handling: Provides context for file I/O errors
Example:
replace_special_chars(PathBuf::from("document.txt")).await?;
should_ignore_e621_tag(tag: &str) -> boolDetermines if an e621 tag should be ignored.
Parameters:
tag: Tag to checkReturns: Boolean indicating if the tag should be ignored
Behavior:
Performance: Uses precompiled regex patterns for efficiency
Example:
if !should_ignore_e621_tag("2023") {
tags.push("2023");
}
process_e621_tags(tags_dict: &Value, config: Option<&E621Config>) -> Vec<String>Processes e621 tags from a JSON dictionary.
Parameters:
tags_dict: JSON dictionary containing e621 tagsconfig: Optional configuration for customizing processingReturns: Vector of processed tags
Behavior:
Example:
let tags = process_e621_tags(&tags_json, Some(&config));
ReasoningDataset::new() -> SelfCreates a new empty reasoning dataset.
Returns: Empty ReasoningDataset
Example:
let dataset = ReasoningDataset::new();
ReasoningDataset::load<P: AsRef<Path>>(path: P) -> Result<Self>Loads a reasoning dataset from a JSON file.
Parameters:
path: Path to the JSON fileReturns: Loaded ReasoningDataset
Error Handling: Provides context for file I/O errors and JSON parsing failures
Example:
let dataset = ReasoningDataset::load("dataset.json").await?;
ReasoningDataset::save<P: AsRef<Path>>(&self, path: P) -> Result<()>Saves the reasoning dataset to a JSON file.
Parameters:
path: Path to save the JSON fileReturns: Result indicating success or failure
Behavior: Creates a pretty-printed JSON file
Error Handling: Provides context for file I/O errors and JSON serialization failures
Example:
dataset.save("dataset.json").await?;
ReasoningDataset::add_entry(&mut self, entry: ReasoningEntry)Adds a new entry to the dataset.
Parameters:
entry: ReasoningEntry to addExample:
dataset.add_entry(entry);
ReasoningDataset::len(&self) -> usizeReturns the number of entries in the dataset.
Returns: Number of entries
Example:
let count = dataset.len();
ReasoningDataset::is_empty(&self) -> boolReturns true if the dataset is empty.
Returns: Boolean indicating if the dataset is empty
Example:
if dataset.is_empty() {
println!("Dataset is empty");
}
ReasoningDataset::create_template(user: &str, reasoning: &str, assistant: &str) -> StringCreates a template string from user, reasoning, and assistant content.
Parameters:
user: User's question or requestreasoning: Detailed reasoning stepsassistant: Assistant's responseReturns: Formatted template string
Behavior: Creates a template with <|im_start|> and <|im_end|> tokens
Example:
let template = ReasoningDataset::create_template(
"What is X?",
"X can be determined by...",
"X is Y"
);
split_content(content: &str) -> (Vec<String>, String)Splits content into tags and sentences.
Parameters:
content: Text content to splitReturns: Tuple of (tags vector, sentences string)
Behavior:
Example:
let (tags, text) = split_content("tag1, tag2, tag3., This is the main text.");
// tags = ["tag1", "tag2", "tag3"]
// text = "This is the main text."
use dset::{process_safetensors_file, get_json_metadata};
use std::path::Path;
use anyhow::Result;
async fn extract_metadata(path: &str) -> Result<()> {
// Extracts metadata and saves it as a JSON file
process_safetensors_file(Path::new(path)).await?;
// The output will be saved as "{path}.json"
// Alternatively, get the metadata directly
let metadata = get_json_metadata(Path::new(path)).await?;
println!("Model metadata: {}", metadata);
Ok(())
}
use dset::{
caption::process_file,
caption::process_json_to_caption,
caption::caption_file_exists_and_not_empty
};
use std::path::Path;
use anyhow::Result;
async fn handle_captions() -> Result<()> {
let path = Path::new("image1.txt");
// Check if caption file exists and has content
if caption_file_exists_and_not_empty(&path).await {
// Process the caption file (auto-detects format)
process_file(&path).await?;
}
// Convert JSON caption to text format
process_json_to_caption(Path::new("image2.json")).await?;
Ok(())
}
use dset::rename_file_without_image_extension;
use std::path::Path;
use std::io;
async fn handle_files() -> io::Result<()> {
// Remove intermediate image extensions from files
let path = Path::new("image.jpg.toml");
rename_file_without_image_extension(&path).await?; // Will rename to "image.toml"
// Won't modify files that are actually images
let img = Path::new("photo.jpg");
rename_file_without_image_extension(&img).await?; // Will remain "photo.jpg"
Ok(())
}
The library provides two main types of JSON processing capabilities besides the e621 caption processing:
Converts JSON files containing tag-probability pairs into caption files. Tags with probabilities above 0.2 are included in the output.
{
"tag1": 0.9,
"tag2": 0.5,
"tag3": 0.1
}
The above JSON would be converted to a caption file containing:
tag1, tag2
Note that:
Example usage:
use dset::process_json_to_caption;
use std::path::Path;
use anyhow::Result;
async fn process_tags() -> Result<()> {
// Process a JSON file containing tag probabilities
// Input: tags.json
// {
// "person": 0.98,
// "smiling": 0.85,
// "outdoor": 0.45,
// "blurry": 0.15
// }
//
// Output: tags.txt
// person, smiling, outdoor
process_json_to_caption(Path::new("tags.json")).await?;
Ok(())
}
Both functions handle errors gracefully and provide async processing capabilities.
The library provides two functions for general JSON handling:
format_json_file: Pretty prints any JSON file with proper indentationprocess_json_file: Allows custom processing of JSON data with an async handlerExample usage:
use dset::{format_json_file, process_json_file};
use std::path::{Path, PathBuf};
use serde_json::Value;
use anyhow::Result;
async fn handle_json() -> Result<()> {
// Format a JSON file
format_json_file(Path::new("data.json").to_path_buf()).await?;
// Process JSON with custom handler
process_json_file(Path::new("data.json"), |json: &Value| async {
println!("Processing: {}", json);
Ok(())
}).await?;
Ok(())
}
Both functions handle errors gracefully and provide async processing capabilities.
use dset::split_content;
use log::info;
fn process_tags_and_text() {
let content = "tag1, tag2, tag3., This is the main text.";
let (tags, sentences) = split_content(content);
info!("Tags: {:?}", tags); // ["tag1", "tag2", "tag3"]
info!("Text: {}", sentences); // "This is the main text."
}
use dset::caption::{format_text_content, replace_string, replace_special_chars};
use std::path::{Path, PathBuf};
use anyhow::Result;
use log::info;
async fn example() -> Result<()> {
// Format text by normalizing whitespace
let formatted = format_text_content(" Multiple spaces \n\n and newlines ")?;
assert_eq!(formatted, "Multiple spaces and newlines");
// Replace text in a file
replace_string(Path::new("caption.txt"), "old text", "new text").await?;
// Replace special characters in a file (smart quotes, etc.)
replace_special_chars(PathBuf::from("document.txt")).await?;
Ok(())
}
The library uses anyhow for comprehensive error handling:
use dset::process_safetensors_file;
use std::path::Path;
use anyhow::{Context, Result};
use log::info;
async fn example() -> Result<()> {
process_safetensors_file(Path::new("model.safetensors"))
.await
.context("Failed to process safetensors file")?;
info!("Successfully processed safetensors file");
Ok(())
}
The concat module provides utilities for combining files with different extensions, which is particularly useful for dataset preparation. It supports concatenating tag files, caption files, and other auxiliary files into a single output file with intelligent tag deduplication.
.jpg, .png).caption, .wd, .tags)The module uses two main types to configure concatenation:
Predefined configurations for common use cases:
pub enum FileExtensionPreset {
/// Concatenates .caption, .wd, .tags files into .txt
CaptionWdTags,
/// Concatenates .florence, .wd, .tags files
FlorenceWdTags,
}
Custom configuration for file concatenation:
pub struct ConcatConfig {
/// Base file extensions to find (without the dot)
pub base_extensions: Vec<String>,
/// Extensions to concatenate (without the dot)
pub extensions_to_concat: Vec<String>,
/// Output file extension (without the dot)
pub output_extension: String,
/// Set to true to remove duplicate tags
pub remove_duplicates: bool,
/// Tag separator to use when concatenating
pub tag_separator: String,
/// Set to true to deduplicate files with identical content
pub deduplicate_files: bool,
}
The concatenation process follows these steps:
.png, .jpg)The concat_tags function specifically handles:
use dset::concat::{ConcatConfig, FileExtensionPreset, concat_files};
use std::path::Path;
use anyhow::Result;
async fn concat_with_preset() -> Result<()> {
// Use a predefined preset for common configurations
let config = ConcatConfig::from_preset(FileExtensionPreset::CaptionWdTags);
// Process files in the specified directory
let processed_count = concat_files(Path::new("./dataset"), &config, false).await?;
println!("Processed {} files", processed_count);
Ok(())
}
use dset::concat::{ConcatConfig, concat_files};
use std::path::Path;
use anyhow::Result;
async fn concat_with_deduplication() -> Result<()> {
// Create a custom configuration with deduplication enabled
let config = ConcatConfig::new(
// Base extensions to look for
vec!["png".into(), "jpg".into()],
// Extensions to concatenate
vec!["caption".into(), "wd".into(), "tags".into()],
// Output extension
"txt".into(),
// Remove duplicate tags
true,
// Tag separator
", ".into(),
).with_deduplication(true); // Enable file deduplication
// Process files with dry run first (preview only)
let would_process = concat_files(Path::new("./dataset"), &config, true).await?;
println!("Would process {} files", would_process);
// Process files for real
let processed = concat_files(Path::new("./dataset"), &config, false).await?;
println!("Actually processed {} files", processed);
println!("Skipped {} duplicates", would_process - processed);
Ok(())
}
use dset::concat::{ConcatConfig, process_image_file};
use std::path::Path;
use anyhow::Result;
async fn process_single_file() -> Result<()> {
// Create configuration with the CaptionWdTags preset
let config = ConcatConfig::from_preset(FileExtensionPreset::CaptionWdTags);
// Process a single file
let success = process_image_file(
Path::new("./dataset/image.jpg"),
&config,
false // Not a dry run
).await?;
if success {
println!("Successfully processed image.jpg");
} else {
println!("Failed to process image.jpg (likely missing required files)");
}
Ok(())
}
use dset::concat::{ConcatConfig, check_duplicate_content};
use std::path::Path;
use std::sync::Arc;
use tokio::sync::Mutex;
use std::collections::HashMap;
use anyhow::Result;
async fn check_for_duplicates() -> Result<()> {
// Create configuration
let config = ConcatConfig::from_preset(FileExtensionPreset::CaptionWdTags)
.with_deduplication(true);
// Set up the deduplication hash table
let content_hashes = Arc::new(Mutex::new(HashMap::new()));
// Check if a file is a duplicate
let file_path = Path::new("./dataset/image1.jpg");
let is_duplicate = check_duplicate_content(
&file_path,
&config,
content_hashes.clone()
).await;
if is_duplicate {
println!("File is a duplicate of a previously processed file");
} else {
println!("File is unique");
}
Ok(())
}
The default output format is:
tag1, tag2, tag3, tag4, caption_content
Where:
For example, with these input files:
image.jpg - The base image fileimage.caption - Contains "a photo of a person"image.wd - Contains "masterpiece, digital art"image.tags - Contains "tag1, tag2, tag3"The output image.txt would contain:
digital art, masterpiece, tag1, tag2, tag3, a photo of a person
Note that tags from the WebUI description file (.wd) are alphabetically sorted.
Contributions are welcome! Please feel free to submit a Pull Request. When contributing:
This project is licensed under the MIT License.