Accordance fails to correctly restore workspace when search box contains en dash in verse reference

February 25, 2022

Environment

Accordance 13.3.2 (13.3.2.0)
Windows 10 (10.0.19044)

Description
Accordance accepts and correctly handles a Unicode en dash (U+2013) as part of a verse reference in the search box. However, upon saving and reloading a workspace with search text containing an en dash, Accordance displays an error indicating the search text is invalid and replaces the en dash in the original search text with three garbage characters.

Reproduction frequency
Always

Reproduction steps

Create a new workspace.
If necessary, open a new tab for the "ESV with Strong's" text.
If necessary, change the search type to "Verses".
Enter "Eph. 6:5–8" into the search box. Note that the range character in this verse reference is an en dash (U+2013), not a standard hyphen (U+002D).
Press "Enter" to execute the search. Observe that Accordance performs the search without error or any kind of warning about extra characters after the verse reference.
Save the workspace.
Close the workspace.
Open the workspace closed in (7).

Expected behavior
The workspace should open without error. The text entered in step (4) should be present in the search box (although it would be acceptable if the original en dash was replaced with a standard hyphen).

Actual behavior
The error message "There are extra characters after the end of the verse reference" is displayed (see attached image open-workspace-error.png). The search box contains the text "Eph. 6:5ÔøΩ8".

Analysis
Examining the workspace file created in step (6), I observed that the string entered in the search box is stored as a sequence of one-byte characters, and, specifically, the Unicode en dash character (U+2013) in the search string is encoded as the single byte 0xD0, which is the code point for the en dash character in the Mac OS Roman encoding. Therefore, it seems safe to say that, when serializing the search string, Accordance (or the Mac emulation framework it runs within on Windows) encodes the Unicode search string using Mac OS Roman, translating applicable characters during the process (in this case, a Unicode en dash to a Mac OS Roman en dash).

However, what happens during deserialization of the search string—from its Mac OS Roman-encoded form in the workspace file to its final display in the search box—is not as clear to me. The three characters "ÔøΩ" displayed in the search box after loading the workspace file are represented on Windows by the sequence of Unicode characters "U+00D4 U+00F8 U+03A9". In the Mac OS Roman encoding, those same three characters are represented by the three-byte sequence "0xEF 0xBF 0xBD". This three-byte sequence also represents the single Unicode character U+FFFD (REPLACEMENT CHARACTER) in the UTF-8 encoding. This is a total guess, but those various encodings seem to suggest something like the following is happening:

The raw search string bytes read from the workspace file "0x45 0x70 0x68 0x2E 0x20 0x36 0x3A 0x35 0xD0 0x38" are first UTF-8 decoded. The UTF-8 decoder encounters the invalid two-byte sequence "0xD0 0x38" ["11010000 00111000"] (invalid because the second byte doesn't match the mask "10xxxxxx"), so it replaces the invalid one-byte sequence 0xD0 with U+FFFD, and then moves on to the next byte (0x38), which is a valid one-byte UTF-8 encoding. The resulting Unicode string is "Eph. 6:5�8".
The decoded Unicode string is then, for whatever reason, once again UTF-8 encoded, resulting in the byte sequence "0x45 0x70 0x68 0x2E 0x20 0x36 0x3A 0x35 0xEF 0xBF 0xBD 0x38".
The encoded byte sequence is now Mac OS Roman decoded. During this process, the Mac characters in the range 0x80-0xFF are translated into their Unicode equivalents (0xEF to U+00D4, 0xBF to U+00F8, and 0xBD to U+03A9). The resulting Unicode string is "Eph. 6:5ÔøΩ8", which is used to initialize the search box.

Again, that's just one possible scenario. But, regardless of the actual details, it seems there is some kind of asymmetry in UTF-8 encoding between serialization and deserialization of the search string.

Notes
The use case that prompted this report was copying a verse reference from an ebook that regularly employs en dashes rather than standard hyphens for ranges.

A sample workspace that demonstrates the bug is attached (search-en-dash-bug.accord).

Workaround
Simply dismiss the error message and reenter the search text, if necessary.

However, note that if the invalid characters in the search string ("ÔøΩ" above) with which Accordance initializes the search box are not replaced before the session is autosaved, then, upon exiting Accordance and restarting (assuming the preference is set to restore the previous session), Accordance will crash (see attached crash log Accordance_2022_02_25 09_10_48.txt).

search-en-dash-bug.accord Accordance_2022_02_25 09_10_48.txt

Accordance fails to correctly restore workspace when search box contains en dash in verse reference

Recommended Posts

Steven S

Link to comment

Share on other sites

Please sign in to comment

Browse

Activity