Command line tool to validate and normalize UTF-8 data
Go to file
Leonora Tindall 80848d0802
Add --buffered and same stream check
Before this, running the program with the same input and output file
would destroy the file's contents. Now, it isn't allowed without the new
--buffered option, which reads the whole file into memory before opening
it for writing.

This requires stevedonovan/lapp#1 to be merged so for now this version
is not publishable to crates.io as it is using my fork of lapp.
2019-10-21 08:26:35 -05:00
src Add --buffered and same stream check 2019-10-21 08:26:35 -05:00
.gitignore First commit - working version from RBR 2019-10-19 16:05:52 -04:00
Cargo.lock Add --buffered and same stream check 2019-10-21 08:26:35 -05:00
Cargo.toml Add --buffered and same stream check 2019-10-21 08:26:35 -05:00
README Add --buffered and same stream check 2019-10-21 08:26:35 -05:00

README

utf8-norm, validate and normalize UTF-8 Unicode data

ABOUT

Version 1.1.0 licensed GPLv3. (C) 2019 Leonora Tindall <nora@nora.codes>
Fast command line Unicode normalization, supporting stream safety transformations as well
as NFC, NFD, NFKD, and NFKC. Exits with failure if the incoming stream is not valid UTF-8.

USAGE

Usage: utf8-norm [--nfc | --nfd | --nfkc | --nfkd] [--stream-safe] [--crlf] <infile> <outfile>

    <infile> (default stdin) - file from which to read bytes.
    <outfile> (default stdout) - file to which to write normalized Unicode.
    -w, --crlf  - write CRLF (Windows) instead of LF only (Unix) at the end of lines.
    -d, --nfd   - write NFD (canonical decomposition).
    -D, --nfkd  - write NFKD (compatibility decomposition).
    -c, --nfc   - write NFC (canonical composition computed from NFD). This is the default.
    -C, --nfkc  - write NFKC (canonical composition computed from NFC).
    -s, --stream-safe   - write stream-safe bytes (Conjoining Grapheme Joiners, UAX15-D4).
    -b, --buffered  - read the entire input file into memory before operating on it.
    -V, --version - output version information and exit.

The --buffered option is primarily useful for reading and writing to the same file. It will
read bytes from the input until end of file and only then begin processing lines of the
input.

utf8-norm was created at Rust Belt Rust 2019 in Dayton, OH. Thanks to @j41manning for her
excellent talk regarding Unicode handling in Rust.

Natively install as `cargo install utf8-norm` or from your distribution's package manager.