rexturl

Crates.iorexturl
lib.rsrexturl
version0.4.1
created_at2022-09-16 13:09:55.671937+00
updated_at2025-08-21 05:47:23.104121+00
descriptionA simple tool to split urls in their protocol, host, port, path and query parts.
homepagehttps://github.com/vschwaberow/rexturl.git
repositoryhttps://github.com/vschwaberow/rexturl.git
max_upload_size
id667341
size366,557
Volker Schwaberow (vschwaberow)

documentation

README

rexturl

Version License: MIT Rust

A command-line tool for parsing and manipulating URLs with predictable output formats.

Key Features

Clean UX Design

  • One flag controls format: --format {plain,tsv,csv,json,jsonl,custom,sql}
  • Precise field selection: --fields domain,path,url
  • Custom templates: --template '{scheme}://{domain}{path}'
  • SQL generation: Multi-dialect INSERT statements with proper escaping
  • Consistent output: Same field order across all formats
  • Machine-friendly: Proper headers, null handling, exit codes

Technical Implementation

  • Custom URL parser with optimized component extraction
  • Zero-copy parsing with minimal allocations
  • Parallel processing for bulk operations
  • Multi-part TLD support (co.uk, com.au, etc.)
  • Template engine with conditional logic and escaping modes
  • SQL generation with dialect-specific type mapping

Processing Features

  • Field extraction: scheme, username, host, domain, subdomain, port, path, query, fragment
  • Data processing: Sort, deduplicate, filter
  • Input flexibility: Command line args or stdin

Installation

cargo install rexturl

or clone the repository and build from source:

git clone https://github.com/vschwaberow/rexturl.git
cd rexturl
cargo build --release

Quick Start

Extract domain from URL:

rexturl --urls "https://www.example.com/path" --fields domain
# Output: example.com

TSV format with headers:

echo "https://blog.example.co.uk/posts" | rexturl --fields subdomain,domain,path --format tsv --header
# Output:
# subdomain    domain          path
# blog         example.co.uk   /posts

JSON output for APIs:

curl -s api.com/urls | rexturl --fields domain --format json
# Output: {"urls":[{"domain":"api.com"}]}

Usage

rexturl [OPTIONS]

Input Methods

  • --urls <URLS> - Specify URLs as command-line arguments
  • stdin - Pipe URLs from other commands (default if no --urls)
  • Supports single or multiple URLs

Options

Core Options

Option Values Description
--format plain, tsv, csv, json, jsonl, custom, sql Output format (default: plain)
--fields domain,path,url Comma-separated fields to extract
--urls URL strings Input URLs to process
--header - Include header row for tabular formats
--sort - Sort output by first field
--unique - Remove duplicate entries

Available Fields

Field Description Example
url Original URL string https://www.example.com/path
scheme Protocol https
username Username portion user
host/hostname Full hostname www.example.com
subdomain Subdomain only www
domain Registrable domain example.com
port Port number 8080
path URL path /path
query Query parameters q=search
fragment Fragment identifier section

Advanced Options

Option Values Description
--pretty - Pretty-print JSON output
--strict - Exit code 2 if any URL fails to parse
--no-newline - Suppress trailing newline
--null-empty Custom string Value for missing fields (default: \N)
--color auto, never, always Colored output for plain format

Custom Format Options

Option Values Description
--template Template string Custom format template (e.g., '{scheme}://{domain}{path}')
--escape none, shell, csv, json, sql Escaping mode for custom format

SQL Output Options

Option Values Description
--sql-table Table name SQL table name (default: urls)
--sql-create-table - Include CREATE TABLE statement
--sql-dialect postgres, mysql, sqlite, generic SQL dialect for type mapping

Legacy Field Flags (Still Supported)

These flags automatically add fields - use --fields for explicit control:

Flag Equivalent Description
--domain --fields domain Extract domain
--host --fields subdomain Extract subdomain
--scheme --fields scheme Extract scheme
--path --fields path Extract path

Deprecated Options

Option Use Instead Description
--json --format json JSON output (deprecated)
--all --fields with specific names All fields (deprecated)
--custom --format and --fields Custom format (deprecated)

Examples

Most Common Use Cases

1. Extract domains for analysis:

cat urls.txt | rexturl --fields domain --sort --unique
# Clean list of unique domains

2. Create a spreadsheet-ready CSV:

rexturl --urls "https://api.example.com/v1/users" --fields subdomain,domain,path --format csv --header
# subdomain,domain,path
# api,example.com,/v1/users

3. JSON for APIs and scripts:

curl -s api.com/endpoints | rexturl --fields domain,path --format json
# {"urls":[{"domain":"api.com","path":"/endpoints"}]}

All Format Examples

Plain (default):

rexturl --urls "https://blog.example.com/posts" --fields subdomain,domain,path
# blog example.com /posts

TSV with header:

echo "https://api.example.com/v1" | rexturl --fields subdomain,domain,path --format tsv --header
# subdomain    domain        path
# api          example.com   /v1

CSV for spreadsheets:

rexturl --fields url,domain --format csv --header < urls.txt
# url,domain
# https://www.example.com,example.com

JSON for APIs:

echo "https://api.example.com" | rexturl --fields domain,path --format json --pretty
# {
#   "urls": [
#     {
#       "domain": "example.com", 
#       "path": "/"
#     }
#   ]
# }

JSONL for streaming:

cat large-urls.txt | rexturl --fields domain --format jsonl | head -3
# {"domain":"example.com"}
# {"domain":"api.com"}  
# {"domain":"blog.net"}

Custom format with templates:

rexturl --urls "https://api.example.com/v1/users" --format custom --template "{scheme}://{domain}{path}"
# https://example.com/v1/users

SQL INSERT statements:

rexturl --urls "https://www.example.com/path" --format sql --fields domain,path
# INSERT INTO urls (domain, path) VALUES ('example.com', '/path');

Advanced Examples

Multi-part TLD handling:

rexturl --urls "https://blog.example.co.uk/posts" --fields subdomain,domain,path --format tsv
# blog    example.co.uk    /posts

Handle missing values:

echo "https://example.com" | rexturl --fields domain,port --format tsv --null-empty "N/A"
# example.com    N/A

Error handling with strict mode:

rexturl --urls "not-a-url" --strict --fields domain
# Error: Failed to parse URL: not-a-url
# Exit code: 2

Legacy syntax (still works):

rexturl --urls "https://www.example.com" --domain --path
# example.com /

Domain and Subdomain Extraction

rexturl includes intelligent handling for domains and subdomains:

  • Multi-part TLD Support: Automatically detects complex TLDs like co.uk, org.uk, com.au, etc.
  • Domain Extraction: The --domain flag extracts the registrable domain name
  • Subdomain Extraction: When using --host alone, it extracts the subdomain portion
  • Smart Detection: Handles edge cases with nested subdomains and international domains

Supported multi-part TLDs include: co.uk, org.uk, ac.uk, gov.uk, me.uk, net.uk, sch.uk, com.au, net.au, org.au, edu.au, gov.au, co.nz, net.nz, org.nz, govt.nz, co.za, org.za, com.br, net.br, org.br, co.jp, com.mx, com.ar, com.sg, com.my, co.id, com.hk, co.th, in.th

Examples:

# Using custom format for specific extraction
echo "https://blog.example.co.uk/posts" | rexturl --format custom --template "Subdomain: {subdomain}, Domain: {domain}"
# Output: Subdomain: blog, Domain: example.co.uk

# Extract all components (tab-separated format)
rexturl --urls "https://user@blog.example.co.uk:8080/posts?q=test#frag" --fields scheme,username,hostname,port,path,query,fragment,domain --format tsv
# Output: https	user	blog.example.co.uk	8080	/posts	q=test	frag	example.co.uk

# Extract components with URLs flag
rexturl --urls "https://blog.example.co.uk/posts" --fields domain
# Output: example.co.uk

Custom Templates

Template Syntax

Use --format custom --template for flexible output formatting:

Basic fields:

  • {field} - Insert field value or empty string if missing
  • {field:default} - Insert field value or default if missing
  • {field?text} - Insert text only if field has a value
  • {field!text} - Insert text only if field is missing

Available fields:

  • {scheme} - URL scheme (http, https, etc.)
  • {username} - Username portion of the URL
  • {host} - Full hostname
  • {hostname} - Alias for host
  • {subdomain} - Subdomain portion (e.g., "www" in www.example.com)
  • {domain} - Domain name (e.g., "example.com")
  • {port} - Port number
  • {path} - URL path
  • {query} - Query string (without the leading ?)
  • {fragment} - Fragment identifier (without the leading #)

Escaping modes:

  • --escape none - No escaping (default)
  • --escape shell - Shell-safe quoting
  • --escape csv - CSV-compatible escaping
  • --escape json - JSON string escaping
  • --escape sql - SQL value escaping

Template Examples

# Basic template
rexturl --urls "https://example.com/api" --format custom --template "Host: {host}, Path: {path}"
# Output: Host: example.com, Path: /api

# With defaults
rexturl --urls "https://example.com" --format custom --template "{scheme}://{domain}:{port:80}"
# Output: https://example.com:80

# Conditional text
rexturl --urls "https://example.com/path?q=test" --format custom --template "{domain}{query?&found}"
# Output: example.com&found

# Shell escaping
rexturl --urls "https://example.com/path with spaces" --format custom --template "{url}" --escape shell
# Output: 'https://example.com/path with spaces'

SQL Output

Generate SQL INSERT statements from URL data:

# Basic SQL output
rexturl --urls "https://www.example.com/path" --format sql --fields domain,path
# INSERT INTO urls (domain, path) VALUES ('example.com', '/path');

# With CREATE TABLE
rexturl --urls "https://example.com" --format sql --fields domain --sql-create-table
# CREATE TABLE IF NOT EXISTS urls (
#     id SERIAL PRIMARY KEY,
#     domain VARCHAR(253),
#     created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
# );
# INSERT INTO urls (domain) VALUES ('example.com');

# Custom table and dialect
rexturl --urls "https://example.com:3306" --format sql --fields domain,port --sql-table my_urls --sql-dialect mysql
# INSERT INTO my_urls (domain, port) VALUES ('example.com', '3306');

Performance & Architecture

URL Parser Implementation

  • Custom URL parser with optimized component extraction
  • Zero-copy parsing with minimal memory allocations
  • Parallel processing using Rayon for bulk operations

Architecture

  • Unified data model: Single UrlRecord struct for all formats
  • Template engine: Flexible custom formatting with conditional logic
  • SQL generation: Multi-dialect support with proper type mapping
  • Predictable output: Same field order across all formats
  • Proper error handling: Exit codes and stderr for failures
  • Streaming support: Memory-efficient for large datasets

Benchmarks

cargo bench
# fast_url_parsing        time:   [823.79 ns 827.53 ns 831.87 ns]
# fast_url_component_access time: [69.100 ns 69.309 ns 69.527 ns]

Technical Details

  • Modular design: Separate parsing, formatting, and domain intelligence
  • Multi-part TLD support: Handles complex domains like example.co.uk
  • Memory efficient: <1KB overhead per URL

Changelog

For a detailed list of changes and version history, see CHANGELOG.md.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes with proper tests
  4. Ensure all tests pass (cargo test)
  5. Run formatting and linting (cargo fmt && cargo clippy)
  6. Commit your changes (git commit -m 'Add some amazing feature')
  7. Push to the branch (git push origin feature/amazing-feature)
  8. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Commit count: 22

cargo fmt