• Invalid UTF-8 test

    From Rob Swindell to All on Sunday, July 07, 2019 15:45:49
    UTF-8 decoder capability and stress test ----------------------------------------

    Markus Kuhn <mkuhn@acm.org> - 1999-04-28

    This test text examines, how UTF-8 decoders handle various types of
    corrupted or otherwise interesting UTF-8 sequences. According to ISO
    10646-1, sections R.7 and 2.3c, a device receiving UTF-8 shall
    interpret a "malformed sequence in the same way that it interprets a
    character that is outside the adopted subset".

    Test sequences (all enclosed in ""):

    Correct UTF-8 text (Greek word 'kosme'): "κόσμε"
    Correct 2-byte sequence (U+00000080): "€"
    Correct 3-byte sequence (U+00000800): "ࠀ"
    Correct 4-byte sequence (U+00010000): "𐀀"
    Correct 5-byte sequence (U+00200000): "°êÇÇÇ"
    Correct 6-byte sequence (U+04000000): "ⁿäÇÇÇÇ"
    Correct 2-byte sequence (U+000007ff): "▀┐"
    Correct 3-byte sequence (U+0000ffff): "￿"
    Correct 4-byte sequence (U+001fffff): "≈┐┐┐"
    Correct 5-byte sequence (U+03ffffff): "√┐┐┐┐"
    Correct 6-byte sequence (U+7fffffff): "²┐┐┐┐┐"
    Correct 2-byte sequence (U+0000): "└Ç"
    Correct 3-byte sequence (U+0000): "αÇÇ"
    Correct 4-byte sequence (U+0000): "≡ÇÇÇ"
    Correct 5-byte sequence (U+0000): "°ÇÇÇÇ"
    Correct 6-byte sequence (U+0000): "ⁿÇÇÇÇÇ"
    Unexpected continuation byte (10000000): "Ç"
    Another lonely continuation byte (10111111): "┐"
    Sequence of 2 unexpected continuation bytes: "Ç┐"
    Sequence of 3 unexpected continuation bytes: "Ç┐Ç"
    Sequence of 4 unexpected continuation bytes: "Ç┐Ç┐"
    Sequence of 5 unexpected continuation bytes: "Ç┐Ç┐Ç"
    Sequence of 6 unexpected continuation bytes: "Ç┐Ç┐Ç┐"
    Sequence of 7 unexpected continuation bytes: "Ç┐Ç┐Ç┐Ç"
    Sequence of all 64 possible continuation bytes (10000000-10111111): "ÇüéâäàåçêëèïîìÄÅ
    ÉæÆôöòûùÿÖÜ¢£¥₧ƒ
    áíóúñѪº¿⌐¬½¼¡«»
    ░▒▓│┤╡╢╖╕╣║╗╝╜╛┐"
    Sequence of all 32 first bytes of 2-byte sequences (11000000-11011111),
    each followed by a space character:
    "└ ┴ ┬ ├ ─ ┼ ╞ ╟ ╚ ╔ ╩ ╦ ╠ ═ ╬ ╧
    ╨ ╤ ╥ ╙ ╘ ╒ ╓ ╫ ╪ ┘ ┌ █ ▄ ▌ ▐ ▀ "
    Sequence of all 16 first bytes of 3-byte sequences (11100000-11101111),
    each followed by a space character: "α ß Γ π Σ σ µ τ Φ Θ Ω δ ∞ ø ε ∩ "
    Sequence of all 8 first bytes of 4-byte sequences (11110000-11110111),
    each followed by a space character: "≡ ± ≥ ≤ ⌠ ⌡ ÷ ≈ "
    Sequence of all 4 first bytes of 5-byte sequences (11111000-11111011),
    each followed by a space character: "° ∙ · √ "
    Sequence of all 2 first bytes of 6-byte sequences (11111100-11111101),
    each followed by a space character: "ⁿ ² "
    Impossible byte (11111110): "■"
    Impossible byte (11111111): " "
    2-byte sequence with last byte missing: "└"
    3-byte sequence with last byte missing: "αÇ"
    4-byte sequence with last byte missing: "≡ÇÇ"
    5-byte sequence with last byte missing: "°ÇÇÇ"
    6-byte sequence with last byte missing: "ⁿÇÇÇÇ"
    All these 5 sequences with last byte missing concatenated:
    "└αÇ≡ÇÇ°ÇÇÇⁿÇÇÇÇ"

    digital man

    This Is Spinal Tap quote #17:
    David St. Hubbins: It's such a fine line between stupid, and uh... and clever. Norco, CA WX: 79.0°F, 54.0% humidity, 14 mph ESE wind, 0.00 inches rain/24hrs