UTF-16 not parsing

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

UTF-16 not parsing

Michel_Dumontier

Hello,

 I’m unable to load the attached file as i cannot specify a UTF-16 character set (French characters)

 

-=Michel=-



--
Michel Dumontier
Associate Professor of Bioinformatics
Carleton University
http://dumontierlab.com


_______________________________________________
p4-feedback mailing list
[hidden email]
https://mailman.stanford.edu/mailman/listinfo/p4-feedback

utf-16-test.owl (650 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: UTF-16 not parsing

Timothy Redmond

Ok to start I will indicate that I am not expert on unicode and utf-8  
and utf-16.  I am hoping that if there are experts out there they will  
help me out.  But I have been studying the specifications and I am  
feeling that I am starting to understand.

The short story is that I think that you really want to use utf-8  
which Protege handles fine.  I don't know what the original file was  
but it does not appear to be either utf-16 or utf-8.

>  UTF-16 character set (French characters)

Ok - this is the first thing that I had to figure out.  In order to  
represent French characters you need unicode.  Utf-8 and UTF-16 are  
just character encoding forms mapping unicode to streams of integers.  
One of the big advantages of UTF-8 is that when considered as a stream  
of bytes it  is compatible with ascii.  This seems like a big deal for  
owl files.  In fact the unicode faq [1] describe four different  
methods for representing unicode fit in an 8-bit  format

  1. UTF-8
  2. Use Java or C style escapes, of the form \uXXXXX or \xXXXXX.
  3. Use the &#xXXXX; or &#DDDDD; numeric character escapes as in HTML  
or XML.
  4. Use SCSU.

I have seen the first three of these many times.

UTF-16 on the other hand requires at least two bytes per character [2]  
and I don't think that it is compatible with ascii.

> I’m unable to load the attached file as i cannot specify a UTF-16  
> character set

I don't believe that the file was valid UTF-16.  No character in the  
file is represented by two bytes.  I have attached the octal dump of  
the file to demonstrate this.  The one character that was clearly  
intended to be french was the octal 0351 character. It is not UTF-16  
because it was only one byte.  It is not UTF-8 because this byte must  
be the start of a three byte sequence.  My first hint of this was the  
fact that my text editor read it as utf-16 and displayed it as  
international looking garbage.

In order to test my understanding of unicode I created a small test  
file where I hand crafted the utf-8 (inserting octal characters) based  
on the specification. To get the e with the diacritic mark, I needed  
to calculate the two character encoding of the unicode character U
+0308.  It seemed that this displayed as I expected in both protege 3  
and protege 4. (Interestingly my text editor that I mentioned before  
could not figure it out ;)).

-Timothy

[1] http://unicode.org/faq/utf_bom.html#gen9
[2] http://unicode.org/faq/utf_bom.html#gen6




On Aug 28, 2009, at 9:39 PM, Michel_Dumontier wrote:

> Hello,
>  I’m unable to load the attached file as i cannot specify a UTF-16  
> character set (French characters)
>
> -=Michel=-
>
>
> --
> Michel Dumontier
> Associate Professor of Bioinformatics
> Carleton University
> http://dumontierlab.com
> <utf-16-test.owl>
> _______________________________________________
> p4-feedback mailing list
> [hidden email]
> https://mailman.stanford.edu/mailman/listinfo/p4-feedback

_______________________________________________
p4-feedback mailing list
[hidden email]
https://mailman.stanford.edu/mailman/listinfo/p4-feedback

utf-16-test-original.owl.txt (4K) Download Attachment
utf-8-timothy.owl (569 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: UTF-16 not parsing

Timothy Redmond
In reply to this post by Michel_Dumontier

I just looked up one more source [1].  Using this information and what  
I already know, I could hand craft (the hard way) utf-8 for any of the  
characters on that page.

-Timothy

[1] http://www.geocities.com/click2speak/unicode/chars_fr.html

On Aug 28, 2009, at 9:39 PM, Michel_Dumontier wrote:

> Hello,
>  I’m unable to load the attached file as i cannot specify a UTF-16  
> character set (French characters)
>
> -=Michel=-
>
>
> --
> Michel Dumontier
> Associate Professor of Bioinformatics
> Carleton University
> http://dumontierlab.com
> <utf-16-test.owl>
> _______________________________________________
> p4-feedback mailing list
> [hidden email]
> https://mailman.stanford.edu/mailman/listinfo/p4-feedback

_______________________________________________
p4-feedback mailing list
[hidden email]
https://mailman.stanford.edu/mailman/listinfo/p4-feedback
Reply | Threaded
Open this post in threaded view
|

Re: UTF-16 not parsing

Rinke Hoekstra-4
Hi Michel, Timothy,

The file is not UTF-16, it is ISO-8859-1 with CRLF line endings.  
Attached is a "proper" UTF-16 version (Big Endian) which loads fine in  
Protege 4 (I hope my mail client has not converted it)

I guess P4 got bogged because it thought the file was UTF-16 (since  
that's in the XML header) while it wasn't.

-Rinke






On 4 sep 2009, at 09:20, Timothy Redmond wrote:

>
> I just looked up one more source [1].  Using this information and  
> what I already know, I could hand craft (the hard way) utf-8 for any  
> of the characters on that page.
>
> -Timothy
>
> [1] http://www.geocities.com/click2speak/unicode/chars_fr.html
>
> On Aug 28, 2009, at 9:39 PM, Michel_Dumontier wrote:
>
>> Hello,
>> I’m unable to load the attached file as i cannot specify a UTF-16  
>> character set (French characters)
>>
>> -=Michel=-
>>
>>
>> --
>> Michel Dumontier
>> Associate Professor of Bioinformatics
>> Carleton University
>> http://dumontierlab.com
>> <utf-16-test.owl>
>> _______________________________________________
>> p4-feedback mailing list
>> [hidden email]
>> https://mailman.stanford.edu/mailman/listinfo/p4-feedback
>
> _______________________________________________
> p4-feedback mailing list
> [hidden email]
> https://mailman.stanford.edu/mailman/listinfo/p4-feedback

_______________________________________________
p4-feedback mailing list
[hidden email]
https://mailman.stanford.edu/mailman/listinfo/p4-feedback

utf-16-test-utf16-BE.owl (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: UTF-16 not parsing

Michel_Dumontier
In reply to this post by Timothy Redmond
Hi,
  Yes, you're right that it should be utf-8. I used an editor
(notepad++) to force the file utf-8 (it was ansi) and this converted the
symbols into coded garbarge, then I inserted the proper characters using
the windows character map. This loads fine in P4. Since original file is
not mine, i will have to find out by what method characters were being
added.

Thank you both (Timothy & Rinke) for looking into this,

-=Michel=-



> -----Original Message-----
> From: Timothy Redmond [mailto:[hidden email]]
> Sent: September-04-09 3:06 AM
> To: Submit feedback for Protege 4.0 beta
> Cc: [hidden email]; Michel_Dumontier
> Subject: Re: [p4-feedback] UTF-16 not parsing
>
>
> Ok to start I will indicate that I am not expert on unicode and utf-8
> and utf-16.  I am hoping that if there are experts out there they will
> help me out.  But I have been studying the specifications and I am
> feeling that I am starting to understand.
>
> The short story is that I think that you really want to use utf-8
which

> Protege handles fine.  I don't know what the original file was but it
> does not appear to be either utf-16 or utf-8.
>
> >  UTF-16 character set (French characters)
>
> Ok - this is the first thing that I had to figure out.  In order to
> represent French characters you need unicode.  Utf-8 and UTF-16 are
> just character encoding forms mapping unicode to streams of integers.
> One of the big advantages of UTF-8 is that when considered as a stream
> of bytes it  is compatible with ascii.  This seems like a big deal for
> owl files.  In fact the unicode faq [1] describe four different
methods

> for representing unicode fit in an 8-bit  format
>
>   1. UTF-8
>   2. Use Java or C style escapes, of the form \uXXXXX or \xXXXXX.
>   3. Use the &#xXXXX; or &#DDDDD; numeric character escapes as in HTML
> or XML.
>   4. Use SCSU.
>
> I have seen the first three of these many times.
>
> UTF-16 on the other hand requires at least two bytes per character [2]
> and I don't think that it is compatible with ascii.
>
> > I'm unable to load the attached file as i cannot specify a UTF-16
> > character set
>
> I don't believe that the file was valid UTF-16.  No character in the
> file is represented by two bytes.  I have attached the octal dump of
> the file to demonstrate this.  The one character that was clearly
> intended to be french was the octal 0351 character. It is not UTF-16
> because it was only one byte.  It is not UTF-8 because this byte must
> be the start of a three byte sequence.  My first hint of this was the
> fact that my text editor read it as utf-16 and displayed it as
> international looking garbage.
>
> In order to test my understanding of unicode I created a small test
> file where I hand crafted the utf-8 (inserting octal characters) based
> on the specification. To get the e with the diacritic mark, I needed
to
> calculate the two character encoding of the unicode character U
> +0308.  It seemed that this displayed as I expected in both protege 3
> and protege 4. (Interestingly my text editor that I mentioned before
> could not figure it out ;)).
>
> -Timothy
>
> [1] http://unicode.org/faq/utf_bom.html#gen9
> [2] http://unicode.org/faq/utf_bom.html#gen6

_______________________________________________
p4-feedback mailing list
[hidden email]
https://mailman.stanford.edu/mailman/listinfo/p4-feedback
Reply | Threaded
Open this post in threaded view
|

Re: UTF-16 not parsing

Timothy Redmond
In reply to this post by Rinke Hoekstra-4

Hi Rinke,

Thanks for filling in the gaps!   The attachment was very  
educational.  I had been curious about what UTF-16 looked like.  It is  
interesting that Protege (Java) and my text editor are able to figure  
the format out.

That was very cool!

-Timothy

On Sep 4, 2009, at 12:45 AM, Rinke Hoekstra wrote:

> Hi Michel, Timothy,
>
> The file is not UTF-16, it is ISO-8859-1 with CRLF line endings.  
> Attached is a "proper" UTF-16 version (Big Endian) which loads fine  
> in Protege 4 (I hope my mail client has not converted it)
>
> I guess P4 got bogged because it thought the file was UTF-16 (since  
> that's in the XML header) while it wasn't.
>
> -Rinke
>
>
> <utf-16-test-utf16-BE.owl>
>
>
> On 4 sep 2009, at 09:20, Timothy Redmond wrote:
>
>>
>> I just looked up one more source [1].  Using this information and  
>> what I already know, I could hand craft (the hard way) utf-8 for  
>> any of the characters on that page.
>>
>> -Timothy
>>
>> [1] http://www.geocities.com/click2speak/unicode/chars_fr.html
>>
>> On Aug 28, 2009, at 9:39 PM, Michel_Dumontier wrote:
>>
>>> Hello,
>>> I’m unable to load the attached file as i cannot specify a UTF-16  
>>> character set (French characters)
>>>
>>> -=Michel=-
>>>
>>>
>>> --
>>> Michel Dumontier
>>> Associate Professor of Bioinformatics
>>> Carleton University
>>> http://dumontierlab.com
>>> <utf-16-test.owl>
>>> _______________________________________________
>>> p4-feedback mailing list
>>> [hidden email]
>>> https://mailman.stanford.edu/mailman/listinfo/p4-feedback
>>
>> _______________________________________________
>> p4-feedback mailing list
>> [hidden email]
>> https://mailman.stanford.edu/mailman/listinfo/p4-feedback
>
> _______________________________________________
> p4-feedback mailing list
> [hidden email]
> https://mailman.stanford.edu/mailman/listinfo/p4-feedback

_______________________________________________
p4-feedback mailing list
[hidden email]
https://mailman.stanford.edu/mailman/listinfo/p4-feedback