json: disallow overlong and out-of-range UTF-8 - zig - General-purpose programming language and toolchain for maintaining robust, optimal, and reusable software. https://ziglang.org

diff options

author	hryx <codroid@gmail.com>	2020-01-05 23:16:38 -0800
committer	Andrew Kelley <andrew@ziglang.org>	2020-01-07 12:07:44 -0500
commit	2933a8241a54af436f2df5eac73aa2acf5eabd40 (patch)
tree	b0a937c67db1a30df6251273ab51a80c9682f053 /src/link.cpp
parent	9390e8b84883331757d3ce15cfca89279aceb090 (diff)
download	zig-2933a8241a54af436f2df5eac73aa2acf5eabd40.tar.gz zig-2933a8241a54af436f2df5eac73aa2acf5eabd40.zip

json: disallow overlong and out-of-range UTF-8

Fixes #2379 = Overlong (non-shortest) sequences UTF-8's unique encoding scheme allows for some Unicode codepoints to be represented in multiple ways. For any of these characters, the spec forbids all but the shortest form. These disallowed longer sequences are called "overlong". As an interesting side effect of this rule, the bytes C0 and C1 never appear in valid UTF-8. = Codepoint range UTF-8 disallows representation of codepoints beyond U+10FFFF, which is the highest character which can be encoded in UTF-16. Because a 4-byte sequence is capable of resulting in such characters, they must be explicitly rejected. This rule also has an interesting side effect, which is that bytes F5 to FF never appear. = References Detecting an overlong version of a codepoint could get gnarly, but luckily The Unicode Consortium did the hard work by creating this handy table of valid byte sequences: https://unicode.org/versions/corrigendum1.html I thought this mapped nicely to the parser's state machine, so I rearranged the relevant states to make use of it.

Diffstat (limited to 'src/link.cpp')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: