content_inspector

Crates.io	content_inspector
lib.rs	content_inspector
version	0.2.4
created_at	2018-06-03 17:56:59.765896+00
updated_at	2018-11-04 16:27:15.394881+00
description	Fast inspection of binary buffers to guess/determine the encoding
homepage	https://github.com/sharkdp/content_inspector
repository	https://github.com/sharkdp/content_inspector
max_upload_size
id	68379
size	27,566

David Peter (sharkdp)

documentation

README

content_inspector

A simple library for fast inspection of binary buffers to guess the type of content.

This is mainly intended to quickly determine whether a given buffer contains "binary" or "text" data. Programs like grep or git diff use similar mechanisms to decide whether to treat some files as "binary data" or not.

The analysis is based on a very simple heuristic: Searching for NULL bytes (indicating "binary" content) and the detection of special byte order marks (indicating a particular kind of textual encoding). Note that this analysis can fail. For example, even if unlikely, UTF-8-encoded text can legally contain NULL bytes. Conversely, some particular binary formats (like binary PGM) may not contain NULL bytes. Also, for performance reasons, only the first 1024 bytes are checked for the NULL-byte (if no BOM was detected).

If this library reports a certain type of encoding (say UTF_16LE), there is no guarantee that the binary buffer can actually be decoded as UTF-16LE.

Usage

use content_inspector::{ContentType, inspect};

assert_eq!(ContentType::UTF_8, inspect(b"Hello"));
assert_eq!(ContentType::BINARY, inspect(b"\xFF\xE0\x00\x10\x4A\x46\x49\x46\x00"));

assert!(inspect(b"Hello").is_text());

CLI example

This crate also comes with a small example command-line program (see examples/inspect.rs) that demonstrates the usage:

> inspect
USAGE: inspect FILE [FILE...]

> inspect testdata/*
testdata/create_text_files.py: UTF-8
testdata/file_sources.md: UTF-8
testdata/test.jpg: binary
testdata/test.pdf: binary
testdata/test.png: binary
testdata/text_UTF-16BE-BOM.txt: UTF-16BE
testdata/text_UTF-16LE-BOM.txt: UTF-16LE
testdata/text_UTF-32BE-BOM.txt: UTF-32BE
testdata/text_UTF-32LE-BOM.txt: UTF-32LE
testdata/text_UTF-8-BOM.txt: UTF-8-BOM
testdata/text_UTF-8.txt: UTF-8

If you only want to detect whether something is a binary or text file, this is about a factor of 250 faster than file --mime ....

License

Licensed under either of

Apache License, Version 2.0, (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

Commit count: 0

content_inspector

documentation

README

content_inspector

Usage

CLI example

License

cargo fmt