The x12 spec for character sets defines the standard as graphic-character oriented.
The spec allows for numerous encodings to be used for extended character sets but only allows for graphical characters.
I'd like to be clear about what a graphical character is so that I can understand the correct way to implement length constraints and measure the length of strings as well as to make sure that x12 data is valid and only contains chars that are in-fact graphical characters.
From my interpretation I believe that what x12 refers to as graphical characters matches up with what unicode calls "grapheme clusters". http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
Could you clarify if grapheme clusters map to the x12 definition of "graphical characters" when using unicode for the underlying edi string data?
It isn't explicitly clear about what the definition is for a graphical character.
Understanding the spec for building systems that read, write and validate x12 EDI.
Just to add an example and to give a concrete case.
The char "g̈" is 3 bytes in UTF-8.
It is 2 bytes in UTF-16.
It is one grapheme cluster.
How long is it as an X12 string when it comes to enforcing min/max lengths? My guess is that it's one character and that X12 should be using grapheme clusters for its string length measurement.
There are three questions, summarized and numbered below and then answered following those summaries:
1. Could you clarify if grapheme clusters map to the x12 definition of "graphical characters" when using unicode for the underlying edi string data?
2. How long is it as an X12 string when it comes to enforcing min/max lengths?
3. What about non-graphical characters like unicode control characters. EG: \u0000 - the null code point. Should these be disallowed by the x12 spec as these are not 'graphical'? https://en.wikipedia.org/wiki/Null_character
1. Originally X12 content was encoded, for the most part, as ASCII graphic characters. With the publication of the ISX segment, multiple character encodings are available to trading partners, including several Unicode encodings.
A grapheme cluster is a sequence of one or more Unicode code points that represents a single visual element or character in a writing system. Grapheme clusters can be composed of multiple individual characters or combining characters that are rendered as a single unit.
A graphic character, on the other hand, is a basic unit of writing in a particular writing system that is capable of being displayed on a screen or printed on paper. It is a character that has its own visual representation, and can include letters, numbers, punctuation marks, and other symbols.
In other words, a grapheme cluster is a sequence of Unicode code points that represents a single visual element or character, while a graphic character is a basic unit of writing that can be displayed on a screen or printed on paper.
2. If the transmission is encoded as one of the Unicode variants, see data element I70 - Character Encoding, even if there are multiple individual characters that compose a single unit, the length of that grapheme cluster is one.
3. Yes, the NULL character has no visual representation and is disallowed.