Crates.io | url-cleaner |
lib.rs | url-cleaner |
version | |
source | src |
created_at | 2024-04-24 17:34:47.253328+00 |
updated_at | 2025-03-09 11:24:16.779149+00 |
description | A CLI tool and library for URL manipulation with a focus on stripping tracking garbage. |
homepage | |
repository | https://github.com/Scripter17/url-cleaner |
max_upload_size | |
id | 1219339 |
Cargo.toml error: | TOML parse error at line 17, column 1 | 17 | autolib = false | ^^^^^^^ unknown field `autolib`, expected one of `name`, `version`, `edition`, `authors`, `description`, `readme`, `license`, `repository`, `homepage`, `documentation`, `build`, `resolver`, `links`, `default-run`, `default_dash_run`, `rust-version`, `rust_dash_version`, `rust_version`, `license-file`, `license_dash_file`, `license_file`, `licenseFile`, `license_capital_file`, `forced-target`, `forced_dash_target`, `autobins`, `autotests`, `autoexamples`, `autobenches`, `publish`, `metadata`, `keywords`, `categories`, `exclude`, `include` |
size | 0 |
Websites often put unique identifiers into URLs so that when you send a link to a friend and they open it, the website knows it was you who sent it to them.
As most people do not understand and therefore cannot consent to this, it is polite to remove the maltext before sending URLs to people.
URL Cleaner is an extremely versatile tool designed to make this process as comprehensive, easy, and fast as possible.
The main privacy concern when using URL Cleaner for day-to-day activities is the fact URL Cleaner, when the no-network
flag isn't set, expands redirects/shortlinks.
For example, passing a bit.ly link to URL Cleaner will effectively click on that URL will send an HTTP request to bit.ly.
While the default config removes as much tracking stuff as possible before sending the request, some redirect sites may merge the sender and the destination information into the same part of the URL.
For example, if Alice and Bob share the same social media post with you, the social media may give Alice the URL https://example.com/share/1234
but give Bob the URL https://example.com/share/5678
.
In this case, it's impossible (or extremely difficult to find a way) to expand either link without telling the social media who you got the URL from.
In general it's impossible to prove if a redirect website doesn't merge sender and destination information, so one should always assume it is.
If you consider this a problem, please use the no-network
flag.
Some redirect websites will still be handled but that's only because they can be done entirely offline.
The lesser main privacy concern is that the default config makes no attempt to hide from websites that you (or the person sending you a link) uses URL Cleaner.
For example, amazon product listings are shortened from a paragraph of crap to just https://amazon.com/dp/PRODUCT-ID
.
In the past (and possibly the future), extreme cases of this were gated behind a minimize
flag that would try to only remove tracking stuff.
It was made the default because I consider the benefit from blending into other URL cleaning programs extremely slim.
These packages are required on Kubuntu 2024.04 (and therefore probably all Debian based distros.):
libssl-dev
for the http
feature flag.libsqlite3-dev
for the caching
feature flag.There are likely plenty more dependencies required that various Linux distros may or may not pre-install.
If you can't compile it I'll try to help you out. And if you can make it work on your own please let me know so I can add to this list.
By default, compiling URL Cleaner includes the default-config.json
file in the binary. Because of this, URL Cleaner can be used simply with url-cleaner "https://example.com/of?a=dirty#url"
.
Additionally, URL Cleaner can take jobs from STDIN lines. cat urls.txt | url-cleaner
works by printing each result on the same line as its input.
See Parsing output for details on the output format, and yes JSON output is supported.
The default config is intended to always obey the following rules:
unmangle
flag for details.command
and custom
features, as well as any features starting with debug
or experiment
are never expected to be enabled.
The command
feature is enabled by default for convenience but, for situations where untrusted/user-provided configs have a chance to be run, should be disabled.Currently no guarantees are made, though when the above rules are broken it is considered a bug and I'd appreciate being told about it.
Additionally, these rules may be changed at any time for any reason. Usually just for clarification.
breezewiki
: Replace fandom/known Breezewiki hosts with the breezewiki-host
variable.unbreezewiki
: Replace Breezewiki hosts with fandom.com.nitter
: Replace twitter/known Nitter hosts with the nitter-host
variable.unnitter
: Replace Nitter hosts with x.com.invidious
: Replace youtube/known Invidious hosts with the invidious-host
variabel.uninvidious
: Replace Invidious hosts with youtube.comembed-compatibility
: Sets the domain of twitter domiains (and supported twitter redirects like vxtwitter.com
) to the variable twitter-embed-host
and bsky.app
to the variable bsky-embed-host
.discord-unexternal
: Replace images-ext-1.discordapp.net
with the original images they refer to.assume-1-dot-2-is-redirect
: Treat all hosts that match the Regex ^.\...$
as redirects. Let's be real, they all are.bypass.vip
: Use bypass.vip to expand linkvertise and some other links.no-https-upgrade
: Disable replacing http://
with https://
.no-network
: Don't make any HTTP requests.no-unmangle-host-is-http-or-https
: Don't convert https://https//example.com/abc
to https://example.com/abc
.no-unmangle-path-is-url
: Don't convert https://example1.com/https://example2.com/user
to https://example2.com/abc
.no-unmangle-path-is-url-encoded-url
: Don't convert https://example.com/https%3A%2F%2Fexample.com%2Fuser
to https://example.com/user
.no-unmangle-second-path-segment-is-url
: Don't convert https://example1.com/profile/https://example2.com/profile/user
to https://example2.com/profile/user
.no-unmangle-subdomain-ends-in-reg-domain
: Don't convert https://profile.example.com.example.com
to https://profile.example.com
.no-unmangle-subdomain-starting-with-www-segment
: Don't convert https://www.username.example.com
to https://username.example.com
.no-unmangle-twitter-first-path-segment-is-twitter-domain
: If a twitter domain's first path segment is a twitter domain, don't remove it.onion-location
: Replace hosts with results from the Onion-Location
HTTP header if present. This makes an HTTP request one time per domain and caches it.tor2web
: Append the suffix specified by the tor2web-suffix
variable to .onion
domains.tor2web2tor
: Replace **.onion.**
domains with **.onion
domains.tumblr-unsubdomain-blog
: Changes blog.tumblr.com
URLs to tumblr.com/blog
URLs. Doesn't move at
or www
subdomains.unmangle
: "Unmangle" certain "invalid but I know what you mean" URLs. Should not be used with untrusted URLs as malicious actors can use this to sneak malicuous URLs past, for example, email spam filters.unmobile
: Convert https://m.example.com
, https://mobile.example.com
, https://abc.m.example.com
, and https://abc.mobile.example.com
into https://example.com
and https://abc.example.com
.youtube-unlive
: Turns https://youtube.com/live/abc
into https://youtube.com/watch?v=abc
.youtube-unplaylist
: Removes the list
query parameter from https://youtube.com/watch
URLs.youtube-unshort
: Turns https://youtube.com/shorts/abc
into https://youtube.com/watch?v=abc
.youtube-unembed
: Turns https://youtube.com/embed/abc
into https://youtube.com/watch?v=abc
.remove-unused-search-query
: Remove search queries from URLs that aren't search results (for example, posts).instagram-unprofilecard
: Turns https://instagram.com/username/profilecard
into https://instagram.com/username
.keep-lang
: Keeps language query parameters.breezewiki-host
: The domain to replace fandom/Breezewiki domains with when the breezewiki
flag is enablednitter-host
: The domain to replace twitter/nitter domains with when the nitter
flag is enabledinvidious-host
: The domain to replace twitter/Invidious domains with when the invidious
flag is enabledtwitter-embed-host
: The domain to use for twitter when the embed-compatibility
flag is set. Defaults to vxtwitter.com
.bluesky-embed-host
: The domain to use for bluesky when the embed-compatibility
flag is set. Defaults to fxbsky.com
.bypass.vip-api-key
: The API key used for bypass.vip's premium backend. Overrides the URL_CLEANER_BYPASS_VIP_API_KEY
environment variable.tor2web-suffix
: The suffix to append to the end of .onion
domains if the flag tor2web
is enabled. Should not start with .
as that's added automatically. Left unset by default.URL_CLEANER_BYPASS_VIP_API_KEY
: The API key used for bypass.vip's premium backend. Can be overridden with the bypass.vip-api-key
variable.bypass.vip-host-without-www-dot-prefixes
: The HostWithoutWWWDotPrefix
es of websites bypass.vip can expand.email-link-format-1-hosts
: (TEMPORARY NAME) Hosts that use unknown link format 1.https-upgrade-host-blacklist
: Hosts to never upgrade from http
to https
.redirect-host-without-www-dot-prefixes
: Hosts that are considered redirects in the sense that they return HTTP 3xx status codes. URLs with hosts in this set (as well as URLs with hosts that are "www." then a host in this set) will have the ExpandRedirect
mapper applied.redirect-reg-domains
: The redirect-host-without-www-dot-prefixes
set but using the RegDomain
of the URL.remove-empty-fragment-reg-domain-blacklist
: The RegDomains to not remove an empty fragment (the #stuff at the end (but specifically just a #)) from.remove-empty-query-reg-domain-blacklist
: The RegDomains to not remove an empty query from.remove-www-subdomain-reg-domain-blacklist
: RegDomain
s where a www
Subdomain
is important and thus won't have it removed.unmangle-path-is-url-blacklist
: Effectively the no-unmangle-path-is-url
flag for the specified Host
s.unmangle-subdomain-ends-in-reg-domain-reg-domain-blacklist
: Effectively the no-unmangle-subdomain-ends-in-reg-domain-reg-domain-blacklist
flag for the specified RegDomain
s.unmangle-subdomain-starting-with-www-segment-reg-domain-blacklist
: Effectively the no-unmangle-subdomain-starting-with-www-segment
flag for the specified RegDomain
s.unmobile-reg-domain-blacklist
: Effectively unsets the unmobile
flag for the specified RegDomain
s.utps
: The set of "universal tracking parameters" that are always removed for any URL with a host not in the utp-host-whitelist
set. Please note that the utps
common mapper in the default config also removes any parameter starting with any string in the utp-prefixes
list and thus parameters starting with those can be omitted from this set.utps-reg-domain-whitelist
: RegDomains to never remove universal tracking parameters from.utp-prefixes
: If a query parameter starts with any of the strings in this list (such as utm_
) it is removed.hwwwwdp_lang_query_params
: The name of the HostWithoutWWWDotPrefix
's language query parameter.hwwwwdp_categories
: Categories of similar websites with shared cleaning methods.redirect_shortcut
: For links that use redirect sites but have the final URL in the link's text/title/whatever, this is used to avoid sending that HTTP request.site_name
: For furaffinity contact info links, the name of the website the contact info is for. Used for unmangling.link_text
: The text of the link the job came from.SOURCE_REG_DOMAIN
: The RegDomain of the "source" of the jobs. Usually the webpage it came from.SOURCE_URL
: The URL of the "source" of the jobs. Usually the webpage it came from.Reasonably fast. [benchmarking/benchmark.sh
] is a Bash script that runs some Hyperfine and Valgrind benchmarking so I can reliably check for regressions.
On a mostly stock lenovo thinkpad T460S (Intel i5-6300U (4) @ 3.000GHz) running Kubuntu 24.10 (kernel 6.11.0) that has "not much" going on (FireFox, Steam, etc. are closed), hyperfine gives me the following benchmark:
Last updated 2025-03-09.
Also the numbers are in milliseconds.
{
"https://x.com?a=2": {
"0" : 7.383,
"1" : 7.478,
"10" : 7.485,
"100" : 7.840,
"1000" : 10.115,
"10000": 32.909
},
"https://example.com?fb_action_ids&mc_eid&ml_subscriber_hash&oft_ck&s_cid&unicorn_click_id": {
"0" : 7.380,
"1" : 7.451,
"10" : 7.549,
"100" : 7.872,
"1000" : 11.211,
"10000": 45.314
},
"https://www.amazon.ca/UGREEN-Charger-Compact-Adapter-MacBook/dp/B0C6DX66TN/ref=sr_1_5?crid=2CNEQ7A6QR5NM&keywords=ugreen&qid=1704364659&sprefix=ugreen%2Caps%2C139&sr=8-5&ufe=app_do%3Aamzn1.fos.b06bdbbe-20fd-4ebc-88cf-fa04f1ca0da8": {
"0" : 7.378,
"1" : 7.391,
"10" : 7.614,
"100" : 8.530,
"1000" : 12.563,
"10000": 60.176
}
}
For reasons not yet known to me, everything from an Intel i5-8500 (6) @ 4.100GHz to an AMD Ryzen 9 7950X3D (32) @ 5.759GHz seems to max out at between 140 and 110ms per 100k (not a typo) of the amazon URL despite the second CPU being significantly more powerful.
In practice, when using URL Cleaner Site and its userscript, performance is significantly (but not severely) worse.
Often the first few cleanings will take a few hundred milliseconds each because the page is still loading.
However, because of the overhead of using HTTP (even if it's just to localhost) subsequent cleanings, for me, are basically always at least 10ms.
The people and projects I have stolen various parts of the default config from.
The Minimum Supported Rust Version is the latest stable release. URL Cleaner may or may not work on older versions, but there's no guarantee.
Although URL Cleaner has various feature flags that can be disabled at compile time to make handling untrusted input safer, no guarantees are made. Especially if the config file being used is untrusted.
That said, if you notice any rules that use but don't actually need HTTP requests or other data-leaky features, please let me know.
Note: JSON output is supported.
Unless a Debug
variant is used, the following should always be true:
The --json
/-j
flag can be used to have URL Cleaner output JSON instead of lines.
The exact format is currently in flux, though it should always be identical to URL Cleaner Site's output.
Currently, the exit code is determined by the following rules:
URL Cleaner should only ever panic under the following circumstances:
debug
feature is enabled) The mutex controlling debug printing indenting is poisoned and a lock is attempted.
This should only be possible when URL Cleaner is used as a library.Outside of these cases, URL Cleaner should never panic. However as this is equivalent to saying "URL Cleaner has no bugs", no actual guarantees can be made.
URL Cleaner does not accept donations. If you feel the need to donate please instead donate to The Tor Project and/or The Internet Archive.