m_mans has quit [Remote host closed the connection]
m_mans has joined #picolisp
m_mans has quit [Read error: Connection reset by peer]
m_mans has joined #picolisp
seninha has joined #picolisp
m_mans has quit [Remote host closed the connection]
seninha has quit [Quit: Leaving]
rob_w has joined #picolisp
<abu[m]>
I'm using iconv to export files from PicoLisp to Windows. Now I wonder what is the correct charset for Windows?
<abu[m]>
I used (in (list "/usr/bin/iconv" "-f" "UTF-8" "-t" "ISO-8859-15" (tmp File))
<abu[m]>
But this fails on some accented char
<abu[m]>
So I tried (in (list "/usr/bin/iconv" "-f" "utf8" "-t" "latin1" (tmp File))
<abu[m]>
But I wonder what is the "correct" destination charset
<beneroth>
might depend on the particular program to import it
<abu[m]>
I'm exporting CSV from reports to Excel
<beneroth>
I think Windows itself usually uses ISO-8859-1 or UTF-16. But Windows applications vary a lot, its all over the place. Many try to detect the encoding in use, sometimes falsely.
<abu[m]>
I suspected it is a mess ;)
<beneroth>
I recommend UTF-16 and TAB separation
<beneroth>
then you can have newlines and special chars in values without Excel getting the columns messed up... most of the times. I have one reported issue where it still messes it up when using another way to import it then just opening it, but I haven't fully investigated that one yet
<abu[m]>
The main purpose was to get away from TABs, because users often don't succeed to get them into Excel properly
<beneroth>
I had such issues, tested around, settled on UTF-16 + TAB, that works.
<abu[m]>
latin1 seems to work now, but I don't know what other chars might give problems
<beneroth>
so my programs now usually offer "Excel CSV" and "RFC CSV" for download. The first for people who just open it in Excel (without import, just having .csv associated with Excel), and the second for the people who work with proper applications or their own programs
<beneroth>
umlaute might be okay in latin1, but e.g. Asian chars might break everything
<abu[m]>
yes, probably
<abu[m]>
In this case it is BTG and they have no Kanji in there data
<beneroth>
I use (out '("iconv" "-f" "UTF-8" "-t" "UTF-16") ...) then print with "^I" (Tab) as separator and (prinl "^M") for newlines
<beneroth>
windows newlines
<beneroth>
I have users from China/Korea/Japan
<abu[m]>
I see, so I should try UTF-16 perhaps
<abu[m]>
Tabs I want to avoid, seems they are not automatically handled by Excel
<abu[m]>
My question is which charset does Excel understand out of the box, without the user having to fiddle with import settings?
seninha has joined #picolisp
<beneroth>
the files I export this way can be opened directly in excel without import settings
<beneroth>
users all believe its an excel file, not csv
<beneroth>
TABS do work automatically in Excel... but it depends on the charset
<beneroth>
my problem was, additionally to Asian characters, I had also columns which can have newlines in the value, e.g. text comment/description. this is the format I found where that works, without the newlines in values scrambling everything in excel
<beneroth>
and of course I also have columns with Commas in it
<abu[m]>
I see, good news. So I can avoid fiddling with the double quotes.
<beneroth>
yep, no quotes then
<beneroth>
RFC standard CSV is comma + quotes. works well in LibreOffice and other sane applications. but not with Microsoft.
<abu[m]>
Very good
<beneroth>
ah yes, because Excel separator (comma) is also location-dependent....
<beneroth>
for non UTF-16 you can use "SEP=char" on first line of the CSV, then excel will use this as separator. but that doesn't work with UTF-16 encoding of the file, then excel ignores it.
<abu[m]>
I will experiment with the folks at BTG, as I have neither Win nor Excel
<beneroth>
it's incredibly stupid and full of edge cases and weird behaviors.
<beneroth>
I've spent days to come to the current formula
<abu[m]>
I believe so
<beneroth>
in UTF-8 excel also expects an BOM, even UTF-8 explicitly doesn't need and discourages a BOM
<abu[m]>
In summary, you recommend TABs and UTF-16?
<beneroth>
and windows newlines, I guess that's also important
<beneroth>
^M^J
<abu[m]>
Yes, that's the easiest part ;) Thanks benerothfor all the research!
<abu[m]>
it is ^J^M
<abu[m]>
err, no ;)
<abu[m]>
return + newline of course
<abu[m]>
(prinl Line "\r")
<beneroth>
TABS, UTF-16, windows newlines. And when the value contains , or " or newline ^J, then surround the value with quotes. escape " in the value with double quotes "" (so "\"\"" in pil)
<beneroth>
I do (prinl "Line "^M")
<beneroth>
^M and \r is the same I think
<abu[m]>
yes, in pil21 \r is handled by the reader
<abu[m]>
But, hm, even if the separator is TAB, you have to escape " or , in the value?
<beneroth>
yeah that application is still pil64. and I think to remember that pil handles \n \r etc. at some places, but not everywhere. and it handles ^notation everywhere, so I always use this.
<beneroth>
escaping " - not sure actually, but that is what is done in the implementation for which I know it works reliably. maybe the reasoning for putting the whole value into quotes was newlines or commas or special chars, and not quotes itself. but when the whole value is quoted, than contained quotes need to be escaped by using double-double-quotes
<beneroth>
here the comment in my code is missing for why this was done that way :D
<abu[m]>
ok :)
<beneroth>
the other stuff has even links to this stackoverflow comments I sent you above
<beneroth>
was horrible to work out >.<
<abu[m]>
😈
<abu[m]>
I just see that \n and \t were also in pil64
<abu[m]>
New is e.g. \e
<abu[m]>
and \b
<beneroth>
I just stick with ^notation as it works reliably everywhere without me having to check or think about it :)
<abu[m]>
Yes, perfect
<abu[m]>
Anyway both ^ and \ are just meta characters in the reader
<abu[m]>
Over the years I tended to prefer \n etc., as it is a little more readable (?)
<abu[m]>
no big thing
<beneroth>
\n is certainly much more widespread, in other programming languages incl. newer ones. so people are more likely to know it, and might be suprised when some don't work in pil.
<beneroth>
in essence it's just a habit thing
<abu[m]>
T
<beneroth>
the \nnnn unicode notation is now also widespread, even in encodings like UTF-8 where its totally unnecessary. But many software requires certain characters to be escaped this way even when using UTF-8 to prevent injection bugs. some standard formats explicitly specify \nnnn unicode notation and and some \-notation for certain special characters (e.g. JSON).
<beneroth>
a bit of a mess. but in most practical situations not a biggie to handle.
<beneroth>
just not simple and elegant.
<abu[m]>
In PicoLisp we have something similar, but with a slightly special syntax
<abu[m]>
It is /nnn/ where "nnn" is any decimal number
<abu[m]>
just for the records here
<abu[m]>
: (= "\t" "^I" "\9\")
<abu[m]>
-> T
<beneroth>
ooooh, I didn't know about this \nnn\ notation!
<beneroth>
so decimals, okay.
<beneroth>
The \unnnn notation is hex for UTF-16 values
<abu[m]>
It was also in old Pils
<beneroth>
I never knew or I guess I forgot
<abu[m]>
Yes, but I find \u ugly
<beneroth>
it is, but it's a de-facto standard
<abu[m]>
And a fixed number of digits is unpractical
<abu[m]>
yes
<abu[m]>
So now we have a new one ;)
<abu[m]>
\u is a typical after-thought extension
<abu[m]>
because \digit was already taken (for octals)
<abu[m]>
Nobody uses octals any more
<beneroth>
yep
<beneroth>
but limiting it to 4 hex digits was stupid
<abu[m]>
right
<beneroth>
I see the whole UTF-16 thing as stupid. I think UTF-8 was available at same time, no? or only shortly after?
<abu[m]>
That's why I introduced a terminating \
<beneroth>
Windows and Sun/oracle and Java use natively UTF-16 I think.
<abu[m]>
I don't remember when 16 was introduced, or never cared
<beneroth>
and C/C++
<abu[m]>
No, Jave used UTF-8 with up to 3 bytes only
<beneroth>
but must have been kinda obvious that again limiting the length will just bring the same limitations that ASCII/ANSII had beforehand
<abu[m]>
UTF-16 is double bytes as I understaind it
<beneroth>
I think read in most cases it uses internally UTF-16
<beneroth>
source code can be in multiple encodings
<beneroth>
dunno
<beneroth>
just dumb xD
<abu[m]>
Seems like a confusion to me, about the 16-bit Character type in Java
<beneroth>
possible
<abu[m]>
It was definitely utf8 in the 1990s
<abu[m]>
but only max 3 bytes
<beneroth>
maybe also specific to the OS on which the JVM run?
<abu[m]>
But I never looked at the utf16 format
<beneroth>
yeah that would be 16bit nut not using UTF-16 encoding.
<abu[m]>
I think Java was always very platform independent
<abu[m]>
And they documented the bytes in the utf8 format
<beneroth>
they're not exactly compatible encodings, UTF-16 and UTF-8, they use different representations afaik. UTF-32 might be close to UTF-16
<abu[m]>
I used it for the first pil versions
<abu[m]>
I only mean the mapping of an n-bit-number to m utf8 bytes
<abu[m]>
The first byte gives the length
<abu[m]>
Special bit patterns in the most significant bits of each byte etc
<beneroth>
yeah, I don't get why UTF-16 (or even UTF-32) ever got popularized when UTF-8 was available. if UTF-8 was only introduced later I would understand it (I thought once this was the reason, but I think thats false).
<abu[m]>
Exactly
<beneroth>
maybe a number of runtime/OS/DBMS devs were just lazy and went with the fixed length types
<abu[m]>
no idea
<beneroth>
but that would have been really dumb
<abu[m]>
Maybe it has to do with efficiency for higher numbers?
<beneroth>
C/C++ gained many years ago support for UTF-16 (wide chars) but I think UTF-8 is still not in the language standard?
<abu[m]>
For 16 bit Kanjis you need 3 bytes in UTF-8
m_mans has joined #picolisp
<abu[m]>
Yes, poor utf8 support in some langs
<abu[m]>
alse PostScript
<abu[m]>
Main reason why I switched to SVG
<beneroth>
maybe, but still then a bad specific optimization instead of a general solution (UTF-8). when doing specific solutions, either it must be really because you have just a limited use-case (hardly applicable to DB/OS/general lang runtimes) or then you need a way to insert and support multiple specific solutions to keep it extendable and maintainable.
<abu[m]>
Except for bigger sizes of Kanjis Utf8 is optimal I think
<beneroth>
sounds more like dumb laziness or premature optimization
<abu[m]>
true
<beneroth>
that was the good effect of high unicode emoticons/smileys
<beneroth>
it pushed UTF-8 support
<beneroth>
not really a good cause but good effect
<abu[m]>
yes, and uses chars beyond 16 bits
<abu[m]>
most smileys are larger than 65536
<beneroth>
yeah, that's why it pushed UTF-8
<beneroth>
otherwise systems would have sticked with UTF-16 or 3byte-UTF-8
seninha has quit [Quit: Leaving]
<abu[m]>
PicoLisp also switched to 4 bytes rather lately
<abu[m]>
I did not feel an urge except for 😃s
<beneroth>
there were even security bugs like MySQL where the SQL-validator did handle high unicode chars but the SQL interpreter interpreted them as 3bit-UTF-8, making it possible to craft an SQL command with smileys which would pass the security filter but would be interpreted differently in the actual SQL engine
<beneroth>
yeah see :)
<beneroth>
I remember :)
<abu[m]>
I think it was you who pushed me
<beneroth>
likely. haha
<beneroth>
if its possible to do emoticons in html textarea than we should also be able to store it in pilDB
<beneroth>
lets do URLs with emoticons in it
<beneroth>
might be funny to test how many popular software gets broken by it
<abu[m]>
URLs are a tough beast
<beneroth>
URLs are ugly because different parts have different encodings and different rules
<abu[m]>
What was this really stupid 7bit notation called?
<abu[m]>
*really* complicated
<abu[m]>
and totally unreadable
<beneroth>
same with email address.. the part before the @ (so the username, not the domain) are actually case-sensitive (up to the receiving server how to implement it).
<beneroth>
and theoretically there is even a standard to encode a comment into an email address which must be ignored
<beneroth>
I guess both is not widely supported
<abu[m]>
oh
<beneroth>
7bit? there are multiple ones I think. the most widespread is the MIME-encoding in email bodies