[Date Prev][Date Next][Thread Prev][Thread Next][Author Index][Date Index][Thread Index]

Re: Non-ASCII characters in Green

To: Aaron Bingham <abingham@xxxxxx>
Subject: Re: Non-ASCII characters in Green
From: Jeff Rush <jrush@xxxxxxxxxx>
Date: Fri, 09 May 2003 00:41:51 -0500
Cc: udanax@xxxxxxxxxx
In-reply-to: <20030507101048.GA5199@nomad>
References: <20030507101048.GA5199@nomad>

Hi Aaron. I've not looked at it specifically but wondered how it oughtto be supported. I believe your problem is that while you are fixingthe Python code to be utf-8, the server portion written in C isn't. Nowhaving said that, you -ought- to at least get it to store and return anarbitrary byte sequence as the server doesn't trim high order bits. Ipresume that utf-8 strings don't have an embedded zero byte that wouldmess up C code.

I'd look at the portion of the Python code that transmits the (utf-8)string over the TCP socket and insure that that translation is occurringcorrectly. I'm ignorant of what happens when a Unicode string is passedto a socket write call. You mention you changed String_write() but didyou change String_read() to examine the returned string and treat it asUnicode as appropriate?

Re how to support it in the bigger scheme, the original Xanadoersbelieved that it ought to be transparent to the backend and to indicatewhether the byte sequence in a particular document is 8 or 16 bits orencoded in some manner, a link would be added by the front-endindicating that. Of course all front-ends must then query for andrespect that link, but no such standardization has yet been done.

I wonder whether 16-bit chars ought to be done with a different resourcetype (1 = bytes, 2 = links, 3 = words) so that it isn't even possible toaddress the bytes out-of-phase as you could using a link-type. Iwouldn't use a different resource type for each encoding though, justfor each physical chunk size.


-Jeff


Aaron Bingham wrote:

Hello,

Has anyone looked at supporting this?  I looked at it briefly
yesterday and came across a couple of problems in Pyxi, but after
fixing these I was still getting incorrect data from the backend, so
there must be a deeper issue here.

In order to get Pyxi to handle non-ASCII chars, I had to change
x88.XuConn.write() to pass both regular and Unicode strings to
String_write().  I also had to change the default encoding to utf-8.

After doing this I was able to insert some German text, but when I
reloaded it, regular ASCII characters were substituted.  For example,
a 223 LATIN SMALL LETTER SHARP S became 67 LATIN CAPITAL LETTER C.
Any ideas?

Regards,

Follow-Ups:
- Re: Non-ASCII characters in Green
  - From: Roger Gregory
- Re: Non-ASCII characters in Green
  - From: Steve Witham

References:
- Non-ASCII characters in Green
  - From: Aaron Bingham

Prev by Date: Re: Non-ASCII characters in Green
Next by Date: Re: Non-ASCII characters in Green
Previous by thread: Re: Non-ASCII characters in Green
Next by thread: Re: Non-ASCII characters in Green
Index(es):