2019-10-19 19:50:52 +00:00
|
|
|
utf8-norm, validate and normalize UTF-8 Unicode data
|
|
|
|
|
2019-10-21 13:26:35 +00:00
|
|
|
ABOUT
|
|
|
|
|
|
|
|
Version 1.1.0 licensed GPLv3. (C) 2019 Leonora Tindall <nora@nora.codes>
|
2019-10-19 19:50:52 +00:00
|
|
|
Fast command line Unicode normalization, supporting stream safety transformations as well
|
|
|
|
as NFC, NFD, NFKD, and NFKC. Exits with failure if the incoming stream is not valid UTF-8.
|
|
|
|
|
2019-10-21 13:26:35 +00:00
|
|
|
USAGE
|
|
|
|
|
2019-10-19 19:50:52 +00:00
|
|
|
Usage: utf8-norm [--nfc | --nfd | --nfkc | --nfkd] [--stream-safe] [--crlf] <infile> <outfile>
|
|
|
|
|
|
|
|
<infile> (default stdin) - file from which to read bytes.
|
|
|
|
<outfile> (default stdout) - file to which to write normalized Unicode.
|
|
|
|
-w, --crlf - write CRLF (Windows) instead of LF only (Unix) at the end of lines.
|
|
|
|
-d, --nfd - write NFD (canonical decomposition).
|
|
|
|
-D, --nfkd - write NFKD (compatibility decomposition).
|
|
|
|
-c, --nfc - write NFC (canonical composition computed from NFD). This is the default.
|
|
|
|
-C, --nfkc - write NFKC (canonical composition computed from NFC).
|
|
|
|
-s, --stream-safe - write stream-safe bytes (Conjoining Grapheme Joiners, UAX15-D4).
|
2019-10-21 13:26:35 +00:00
|
|
|
-b, --buffered - read the entire input file into memory before operating on it.
|
|
|
|
-V, --version - output version information and exit.
|
|
|
|
|
2023-04-03 03:02:15 +00:00
|
|
|
utf8-norm operates linewise on the input unless --buffered is specified.
|
|
|
|
|
2019-10-21 13:26:35 +00:00
|
|
|
The --buffered option is primarily useful for reading and writing to the same file. It will
|
|
|
|
read bytes from the input until end of file and only then begin processing lines of the
|
|
|
|
input.
|
2019-10-19 19:50:52 +00:00
|
|
|
|
2023-04-03 03:02:15 +00:00
|
|
|
EXAMPLES
|
|
|
|
|
|
|
|
Write the contents of input.txt, compatibly decomposed, with CRLF line endings,
|
|
|
|
to output.txt:
|
|
|
|
|
|
|
|
utf8-norm --nfkd --crlf input.txt output.txt
|
|
|
|
|
|
|
|
Normalize file.md, in the canonical composition, buffering the file in memory to
|
|
|
|
avoid overwriting it with zeros:
|
|
|
|
|
|
|
|
utf8-norm --buffered file.md file.md
|
|
|
|
|
|
|
|
Emit the output of my_program to stdout, in the canonical composition, linewise.
|
|
|
|
|
|
|
|
my_program | utf8-norm
|
|
|
|
|
|
|
|
Buffer the entire output of my_program in memory, and emit it to
|
|
|
|
my_program.output in the canonical composition after receiving end-of-file.
|
|
|
|
|
|
|
|
my_program | utf8-norm --buffered - my_program.out
|
|
|
|
|
|
|
|
ABOUT
|
|
|
|
|
2019-10-19 19:50:52 +00:00
|
|
|
utf8-norm was created at Rust Belt Rust 2019 in Dayton, OH. Thanks to @j41manning for her
|
|
|
|
excellent talk regarding Unicode handling in Rust.
|
|
|
|
|
|
|
|
Natively install as `cargo install utf8-norm` or from your distribution's package manager.
|
|
|
|
|