DR-16-0022: Shared ML: Escaping strings in ST_Xstring

Francis Cave francis at franciscave.com
Fri Nov 10 11:40:31 CET 2017


Murata-san

 

I think you meant _[0-9a-fA-F]{4}_

 

I have tested this by creating a simple spreadsheet in LibreOffice. It seems that LibreOffice does not support this feature. See attached. Here is the string table:

 

<sst xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" count="4" uniqueCount="4">

    <si>

        <t xml:space="preserve">_xaaaa_</t>

    </si>

    <si>

        <t xml:space="preserve">_xaaaa</t>

    </si>

    <si>

        <t xml:space="preserve">_xgggg_</t>

    </si>

    <si>

        <t xml:space="preserve">_xaaa_</t>

    </si>

</sst>

 

If I open the document in LibreOffice, the first string is displayed as ‘_xaaaa_’. If I open the same document in Excel, the first string is converted to the Unicode character #xaaaa.

 

Kind regards,

 

Francis

 

 

 

From: eb2mmrt at gmail.com [mailto:eb2mmrt at gmail.com] On Behalf Of MURATA Makoto
Sent: 10 November 2017 02:00
To: SC 34 WG4 <e-SC34-WG4 at ecma-international.org>
Subject: Re: DR-16-0022: Shared ML: Escaping strings in ST_Xstring

 

Francis,

 

Thanks for your comments.  I now understand.

 

I did some experiments.  Excel appears to escape an underscore 

only when it is the first character of a string matching  _[0-9a-zA-F]{4}_

 

Regards,

Makoto

 

    <si>

        <t>SW_x005F_x3850_CPU</t>

        <phoneticPr fontId="1"/>

    </si>

    <si>

        <t>_x005F_x3850_CPU</t>

        <phoneticPr fontId="1"/>

    </si>

    <si>

        <t>_</t>

        <phoneticPr fontId="1"/>

    </si>

    <si>

        <t>_xzxcv</t>

        <phoneticPr fontId="1"/>

    </si>

    <si>

        <t>_xzxcv_</t>

        <phoneticPr fontId="1"/>

    </si>

    <si>

        <t>_xzxcwev_</t>

        <phoneticPr fontId="1"/>

    </si>

    <si>

        <t>_xFFFFFF_</t>

        <phoneticPr fontId="1"/>

    </si>

    <si>

        <t>_x2000B_</t>

        <phoneticPr fontId="1"/>

    </si>

    <si>

        <t>_x3000</t>

        <phoneticPr fontId="1"/>

    </si>

    <si>

        <t>_x005F_x3000_</t>

        <phoneticPr fontId="1"/>

    </si>

    <si>

        <t>_x005F_xFFFF_</t>

        <phoneticPr fontId="1"/>

    </si>

    <si>

        <t>_xFF_</t>

        <phoneticPr fontId="1"/>

    </si>

    <si>

        <t>_x0F_</t>

        <phoneticPr fontId="1"/>

    </si>

    <si>

        <t>_xF_</t>

        <phoneticPr fontId="1"/>

    </si>

    <si>

        <t>_xG_</t>

        <phoneticPr fontId="1"/>

    </si>

    <si>

        <t>_xGG_</t>

        <phoneticPr fontId="1"/>

    </si>

    <si>

        <t>_xGGG_</t>

        <phoneticPr fontId="1"/>

    </si>

    <si>

        <t>_xGGGG_</t>

        <phoneticPr fontId="1"/>

    </si>

    <si>

        <t>_x000G_</t>

        <phoneticPr fontId="1"/>

    </si>

    <si>

        <t>_xFFF_</t>

        <phoneticPr fontId="1"/>

    </si>

    <si>

        <t>_x005F_xffff_</t>

        <phoneticPr fontId="1"/>

    </si>

    <si>

        <t>_x005F_xf3f2_</t>

        <phoneticPr fontId="1"/>

    </si>

 

2017-11-09 23:50 GMT+09:00 Francis Cave <francis at franciscave.com <mailto:francis at franciscave.com> >:

Murata-san

 

I think that this DR is asking how to serialise the literal string “SW_x3850_CPU”, not “SW㡐CPU”. If “_xHHHH_” is interpreted as the Unicode character #xHHHH, any literal string in the form “_xHHHH_” has to have the initial “_” escaped, which is what Charlie is saying that Excel does. But does this mean that “_” is always escaped by Excel, or only escaped in certain contexts, such as if followed by “x”? Does this need to be tested? 

 

Kind regards,

 

Francis

 

 

 

From: eb2mmrt at gmail.com <mailto:eb2mmrt at gmail.com>  [mailto:eb2mmrt at gmail.com <mailto:eb2mmrt at gmail.com> ] On Behalf Of MURATA Makoto
Sent: 09 November 2017 00:55
To: SC 34 WG4 <e-SC34-WG4 at ecma-international.org <mailto:e-SC34-WG4 at ecma-international.org> >
Subject: Re: DR-16-0022: Shared ML: Escaping strings in ST_Xstring

 

>§22.9.2.19, “ST_Xstring (Escaped String)” says:
>
>For all characters that cannot be represented in XML as defined by the 

>XML 1.0 specification, the characters are escaped using the Unicode 

>numerical character representation escape character format _xHHHH_, where H
>represents a hexadecimal character in the character's value. 

>[Example: The Unicode character 8 is not permitted
> in an XML 1.0 document, so it must be escaped as _x0008_. end example]

> But it's not clear from this if all such combinations should be escaped?
> or just those in the range [001-031]. Excel itself handles such sequences by

>  escaping the first underscore but unfortunately other consumers such as 

> OpenOffice do not remove the escaping so I think this needs clarifying.

 

 

W3C XML clearly defines which character is legal.  We should 

mention Well-formedness constraint: Legal Character.

 

https://www.w3.org/TR/2006/REC-xml-20060816/#wf-Legalchar

 

Or, does this DR ask how we can represent a literal such as  "_x2345"?

 

Regards,

Makoto

 

2016-12-07 5:20 GMT+09:00 Rex Jaeschke <rex at rexjaeschke.com <mailto:rex at rexjaeschke.com> >:

Here's a new DR from Charlie.

Rex





 

-- 


Praying for the victims of the Japan Tohoku earthquake

Makoto





 

-- 


Praying for the victims of the Japan Tohoku earthquake

Makoto

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.vse.cz/pipermail/sc34wg4/attachments/20171110/c7a4c28d/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: escaped characters LO.XLSX
Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Size: 4756 bytes
Desc: not available
URL: <http://mailman.vse.cz/pipermail/sc34wg4/attachments/20171110/c7a4c28d/attachment-0001.xlsx>


More information about the sc34wg4 mailing list