#picolisp on 2023-01-11 — irc logs at libera.irclog.whitequark.org

2021-05-27 09:06 beneroth changed the topic of #picolisp to: PicoLisp language | The scalpel of software development | Channel Log: https://libera.irclog.whitequark.org/picolisp | Check www.picolisp.com for more information

00:54 seninha has quit [Quit: Leaving]

03:04 m_mans has joined #picolisp

03:31 m_mans has quit [Remote host closed the connection]

03:31 m_mans has joined #picolisp

03:34 m_mans has quit [Read error: Connection reset by peer]

03:35 m_mans has joined #picolisp

04:17 seninha has joined #picolisp

04:45 m_mans has quit [Remote host closed the connection]

05:42 seninha has quit [Quit: Leaving]

07:33 rob_w has joined #picolisp

08:23 <abu[m]> I'm using iconv to export files from PicoLisp to Windows. Now I wonder what is the correct charset for Windows?

08:23 <abu[m]> I used (in (list "/usr/bin/iconv" "-f" "UTF-8" "-t" "ISO-8859-15" (tmp File))

08:24 <abu[m]> But this fails on some accented char

08:24 <abu[m]> So I tried (in (list "/usr/bin/iconv" "-f" "utf8" "-t" "latin1" (tmp File))

08:24 <abu[m]> But I wonder what is the "correct" destination charset

08:33 <beneroth> might depend on the particular program to import it

08:34 <abu[m]> I'm exporting CSV from reports to Excel

08:34 <beneroth> I think Windows itself usually uses ISO-8859-1 or UTF-16. But Windows applications vary a lot, its all over the place. Many try to detect the encoding in use, sometimes falsely.

08:34 <abu[m]> I suspected it is a mess ;)

08:34 <beneroth> I recommend UTF-16 and TAB separation

08:35 <beneroth> then you can have newlines and special chars in values without Excel getting the columns messed up... most of the times. I have one reported issue where it still messes it up when using another way to import it then just opening it, but I haven't fully investigated that one yet

08:36 <abu[m]> The main purpose was to get away from TABs, because users often don't succeed to get them into Excel properly

08:36 <beneroth> I had such issues, tested around, settled on UTF-16 + TAB, that works.

08:37 <abu[m]> latin1 seems to work now, but I don't know what other chars might give problems

08:37 <beneroth> so my programs now usually offer "Excel CSV" and "RFC CSV" for download. The first for people who just open it in Excel (without import, just having .csv associated with Excel), and the second for the people who work with proper applications or their own programs

08:37 <abu[m]> I use this now: http://pb1n.de/?023bb1

08:38 <beneroth> umlaute might be okay in latin1, but e.g. Asian chars might break everything

08:38 <abu[m]> yes, probably

08:38 <abu[m]> In this case it is BTG and they have no Kanji in there data

08:40 <beneroth> I use (out '("iconv" "-f" "UTF-8" "-t" "UTF-16") ...) then print with "^I" (Tab) as separator and (prinl "^M") for newlines

08:40 <beneroth> windows newlines

08:40 <beneroth> I have users from China/Korea/Japan

08:40 <abu[m]> I see, so I should try UTF-16 perhaps

08:41 <abu[m]> Tabs I want to avoid, seems they are not automatically handled by Excel

08:53 <abu[m]> My question is which charset does Excel understand out of the box, without the user having to fiddle with import settings?

09:15 seninha has joined #picolisp

09:19 <beneroth> the files I export this way can be opened directly in excel without import settings

09:19 <beneroth> users all believe its an excel file, not csv

09:19 <beneroth> TABS do work automatically in Excel... but it depends on the charset

09:21 <beneroth> my problem was, additionally to Asian characters, I had also columns which can have newlines in the value, e.g. text comment/description. this is the format I found where that works, without the newlines in values scrambling everything in excel

09:21 <beneroth> and of course I also have columns with Commas in it

09:21 <abu[m]> I see, good news. So I can avoid fiddling with the double quotes.

09:21 <beneroth> yep, no quotes then

09:22 <beneroth> RFC standard CSV is comma + quotes. works well in LibreOffice and other sane applications. but not with Microsoft.

09:22 <abu[m]> Very good

09:22 <beneroth> ah yes, because Excel separator (comma) is also location-dependent....

09:22 <beneroth> https://superuser.com/questions/238944/how-to-force-excel-to-open-csv-files-with-data-arranged-in-columns/1222081#1222081

09:22 <abu[m]> But TABs are OK everywhere?

09:23 <beneroth> when its UTF-16

09:23 <abu[m]> yes

09:24 <beneroth> for non UTF-16 you can use "SEP=char" on first line of the CSV, then excel will use this as separator. but that doesn't work with UTF-16 encoding of the file, then excel ignores it.

09:24 <abu[m]> I will experiment with the folks at BTG, as I have neither Win nor Excel

09:24 <beneroth> it's incredibly stupid and full of edge cases and weird behaviors.

09:24 <beneroth> I've spent days to come to the current formula

09:25 <abu[m]> I believe so

09:25 <beneroth> in UTF-8 excel also expects an BOM, even UTF-8 explicitly doesn't need and discourages a BOM

09:26 <abu[m]> In summary, you recommend TABs and UTF-16?

09:26 <beneroth> https://stackoverflow.com/questions/20395699/sep-statement-breaks-utf8-bom-in-csv-file-which-is-generated-by-xsl

09:26 <beneroth> yep

09:26 <beneroth> and windows newlines, I guess that's also important

09:26 <beneroth> ^M^J

09:26 <abu[m]> Yes, that's the easiest part ;) Thanks benerothfor all the research!

09:27 <abu[m]> it is ^J^M

09:27 <abu[m]> err, no ;)

09:27 <abu[m]> return + newline of course

09:28 <abu[m]> (prinl Line "\r")

09:28 <beneroth> TABS, UTF-16, windows newlines. And when the value contains , or " or newline ^J, then surround the value with quotes. escape " in the value with double quotes "" (so "\"\"" in pil)

09:28 <beneroth> I do (prinl "Line "^M")

09:28 <beneroth> ^M and \r is the same I think

09:29 <abu[m]> yes, in pil21 \r is handled by the reader

09:30 <abu[m]> But, hm, even if the separator is TAB, you have to escape " or , in the value?

09:30 <beneroth> yeah that application is still pil64. and I think to remember that pil handles \n \r etc. at some places, but not everywhere. and it handles ^notation everywhere, so I always use this.

09:33 <beneroth> escaping " - not sure actually, but that is what is done in the implementation for which I know it works reliably. maybe the reasoning for putting the whole value into quotes was newlines or commas or special chars, and not quotes itself. but when the whole value is quoted, than contained quotes need to be escaped by using double-double-quotes

09:33 <beneroth> here the comment in my code is missing for why this was done that way :D

09:34 <abu[m]> ok :)

09:34 <beneroth> the other stuff has even links to this stackoverflow comments I sent you above

09:34 <beneroth> was horrible to work out >.<

09:34 <abu[m]> 😈

09:36 <abu[m]> I just see that \n and \t were also in pil64

09:36 <abu[m]> New is e.g. \e

09:41 <abu[m]> and \b

11:22 <beneroth> I just stick with ^notation as it works reliably everywhere without me having to check or think about it :)

11:22 <abu[m]> Yes, perfect

11:23 <abu[m]> Anyway both ^ and \ are just meta characters in the reader

11:25 <abu[m]> Over the years I tended to prefer \n etc., as it is a little more readable (?)

11:25 <abu[m]> no big thing

11:26 <beneroth> \n is certainly much more widespread, in other programming languages incl. newer ones. so people are more likely to know it, and might be suprised when some don't work in pil.

11:26 <beneroth> in essence it's just a habit thing

11:26 <abu[m]> T

11:28 <beneroth> the \nnnn unicode notation is now also widespread, even in encodings like UTF-8 where its totally unnecessary. But many software requires certain characters to be escaped this way even when using UTF-8 to prevent injection bugs. some standard formats explicitly specify \nnnn unicode notation and and some \-notation for certain special characters (e.g. JSON).

11:28 <beneroth> a bit of a mess. but in most practical situations not a biggie to handle.

11:29 <beneroth> just not simple and elegant.

11:29 <abu[m]> In PicoLisp we have something similar, but with a slightly special syntax

11:29 <abu[m]> It is /nnn/ where "nnn" is any decimal number

11:30 <abu[m]> just for the records here

11:32 <abu[m]> : (= "\t" "^I" "\9\")

11:32 <abu[m]> -> T

11:37 <beneroth> ooooh, I didn't know about this \nnn\ notation!

11:38 <beneroth> so decimals, okay.

11:38 <beneroth> The \unnnn notation is hex for UTF-16 values

11:38 <abu[m]> It was also in old Pils

11:38 <beneroth> I never knew or I guess I forgot

11:38 <abu[m]> Yes, but I find \u ugly

11:39 <beneroth> it is, but it's a de-facto standard

11:39 <abu[m]> And a fixed number of digits is unpractical

11:39 <abu[m]> yes

11:39 <abu[m]> So now we have a new one ;)

11:40 <abu[m]> \u is a typical after-thought extension

11:40 <abu[m]> because \digit was already taken (for octals)

11:40 <abu[m]> Nobody uses octals any more

11:46 <beneroth> yep

11:47 <beneroth> but limiting it to 4 hex digits was stupid

11:47 <abu[m]> right

11:47 <beneroth> I see the whole UTF-16 thing as stupid. I think UTF-8 was available at same time, no? or only shortly after?

11:47 <abu[m]> That's why I introduced a terminating \

11:47 <beneroth> Windows and Sun/oracle and Java use natively UTF-16 I think.

11:48 <abu[m]> I don't remember when 16 was introduced, or never cared

11:48 <beneroth> and C/C++

11:48 <abu[m]> No, Jave used UTF-8 with up to 3 bytes only

11:48 <beneroth> but must have been kinda obvious that again limiting the length will just bring the same limitations that ASCII/ANSII had beforehand

11:49 <abu[m]> UTF-16 is double bytes as I understaind it

11:49 <abu[m]> UTF-8 with max 3 bytes can hold 16 bits

11:50 <beneroth> you sure? https://stackoverflow.com/questions/50687683/javas-native-character-set-for-strings

11:50 <beneroth> I think read in most cases it uses internally UTF-16

11:50 <beneroth> source code can be in multiple encodings

11:50 <beneroth> dunno

11:50 <beneroth> just dumb xD

11:51 <abu[m]> Seems like a confusion to me, about the 16-bit Character type in Java

11:51 <beneroth> possible

11:51 <abu[m]> It was definitely utf8 in the 1990s

11:51 <abu[m]> but only max 3 bytes

11:52 <beneroth> maybe also specific to the OS on which the JVM run?

11:52 <abu[m]> But I never looked at the utf16 format

11:52 <beneroth> yeah that would be 16bit nut not using UTF-16 encoding.

11:52 <abu[m]> I think Java was always very platform independent

11:53 <abu[m]> And they documented the bytes in the utf8 format

11:53 <beneroth> they're not exactly compatible encodings, UTF-16 and UTF-8, they use different representations afaik. UTF-32 might be close to UTF-16

11:53 <abu[m]> I used it for the first pil versions

11:53 <abu[m]> I only mean the mapping of an n-bit-number to m utf8 bytes

11:54 <abu[m]> The first byte gives the length

11:54 <abu[m]> Special bit patterns in the most significant bits of each byte etc

11:54 <beneroth> yeah, I don't get why UTF-16 (or even UTF-32) ever got popularized when UTF-8 was available. if UTF-8 was only introduced later I would understand it (I thought once this was the reason, but I think thats false).

11:54 <abu[m]> Exactly

11:55 <beneroth> maybe a number of runtime/OS/DBMS devs were just lazy and went with the fixed length types

11:55 <abu[m]> no idea

11:55 <beneroth> but that would have been really dumb

11:56 <abu[m]> Maybe it has to do with efficiency for higher numbers?

11:56 <beneroth> C/C++ gained many years ago support for UTF-16 (wide chars) but I think UTF-8 is still not in the language standard?

11:56 <abu[m]> For 16 bit Kanjis you need 3 bytes in UTF-8

11:56 m_mans has joined #picolisp

11:56 <abu[m]> Yes, poor utf8 support in some langs

11:57 <abu[m]> alse PostScript

11:57 <abu[m]> Main reason why I switched to SVG

11:58 <beneroth> maybe, but still then a bad specific optimization instead of a general solution (UTF-8). when doing specific solutions, either it must be really because you have just a limited use-case (hardly applicable to DB/OS/general lang runtimes) or then you need a way to insert and support multiple specific solutions to keep it extendable and maintainable.

11:58 <abu[m]> Except for bigger sizes of Kanjis Utf8 is optimal I think

11:58 <beneroth> sounds more like dumb laziness or premature optimization

11:58 <abu[m]> true

11:59 <beneroth> that was the good effect of high unicode emoticons/smileys

11:59 <beneroth> it pushed UTF-8 support

11:59 <beneroth> not really a good cause but good effect

11:59 <abu[m]> yes, and uses chars beyond 16 bits

11:59 <abu[m]> most smileys are larger than 65536

12:00 <beneroth> yeah, that's why it pushed UTF-8

12:00 <beneroth> otherwise systems would have sticked with UTF-16 or 3byte-UTF-8

12:00 seninha has quit [Quit: Leaving]

12:01 <abu[m]> PicoLisp also switched to 4 bytes rather lately

12:01 <abu[m]> I did not feel an urge except for 😃s

12:01 <beneroth> there were even security bugs like MySQL where the SQL-validator did handle high unicode chars but the SQL interpreter interpreted them as 3bit-UTF-8, making it possible to craft an SQL command with smileys which would pass the security filter but would be interpreted differently in the actual SQL engine

12:01 <beneroth> yeah see :)

12:02 <beneroth> I remember :)

12:02 <abu[m]> I think it was you who pushed me

12:02 <beneroth> likely. haha

12:02 <beneroth> if its possible to do emoticons in html textarea than we should also be able to store it in pilDB

12:03 <beneroth> lets do URLs with emoticons in it

12:03 <beneroth> might be funny to test how many popular software gets broken by it

12:03 <abu[m]> URLs are a tough beast

12:04 <beneroth> URLs are ugly because different parts have different encodings and different rules

12:04 <abu[m]> What was this really stupid 7bit notation called?

12:04 <abu[m]> *really* complicated

12:04 <abu[m]> and totally unreadable

12:04 <beneroth> same with email address.. the part before the @ (so the username, not the domain) are actually case-sensitive (up to the receiving server how to implement it).

12:05 <beneroth> and theoretically there is even a standard to encode a comment into an email address which must be ignored

12:05 <beneroth> I guess both is not widely supported

12:05 <abu[m]> oh

12:05 <beneroth> 7bit? there are multiple ones I think. the most widespread is the MIME-encoding in email bodies

12:06 <beneroth> there is even UTF-7 https://en.wikipedia.org/wiki/UTF-7

12:06 <abu[m]> No, there is something really strange, but I forgot the name

12:06 <abu[m]> not utf7

12:06 <abu[m]> ASCII 7

12:06 <abu[m]> The "web standard" :)

12:07 <beneroth> no idea. sounds horrible

12:07 <abu[m]> It is not "url encoding", but for domain names

12:10 <abu[m]> grr, I don't remember or find it

12:11 <beneroth> maybe just international domain name puny code?

12:11 <beneroth> https://en.wikipedia.org/wiki/Internationalized_domain_name

12:11 <beneroth> like domains with umlaute, resulting in xn-weirdendocidng

12:11 <abu[m]> Yess!!! I also just found it

12:12 <abu[m]> Punycode

12:12 <abu[m]> Who is able to think up such a mess

12:12 <abu[m]> (instead of simply using utf8)

12:13 <abu[m]> "München" becomes "Mnchen-3ya"

12:14 <abu[m]> Such overhead for just a single Umlaut

12:16 <beneroth> apparently a certain A. Castello https://datatracker.ietf.org/doc/html/rfc3492

12:16 <beneroth> *Costello

12:16 <beneroth> and that was 2003

12:17 <beneroth> hm...probably UTF-8 was even around already when the web was specified?

12:17 <beneroth> domains are probably older

12:17 <abu[m]> yes, in Java it was already in the mid-90s

12:17 <abu[m]> utf8

12:18 <abu[m]> The problem is that they wanted to stick with 7 bits for any means

12:19 <beneroth> probably for backwards-compatibility to existing network equipment and software

12:19 <beneroth> e.g. email

12:19 <abu[m]> This is also why e-mails get soooo big with attachments. Base64 encoding

12:19 <abu[m]> yes

12:19 <abu[m]> Some hardware used a parity bit

12:19 <abu[m]> 100 years ago

12:19 <beneroth> it's probably not in use anymore xD

12:20 <abu[m]> Yes, for sure

12:20 <abu[m]> stupid

12:36 <beneroth> bbl

13:18 m_mans has quit [Remote host closed the connection]

13:19 m_mans has joined #picolisp

13:24 m_mans has quit [Ping timeout: 260 seconds]

16:00 Thorsten[m] has quit [Quit: You have been kicked for being idle]

16:14 <abu[m]> Cool! Works at BTG now

16:14 <abu[m]> Seems only UTF-16 is necessary: http://pb1n.de/?154543

16:15 seninha has joined #picolisp

16:23 rob_w has quit [Quit: Leaving]

16:24 clacke has joined #picolisp

17:02 <beneroth> nice

17:02 <beneroth> yeah not everything is needed. I needed TABs because I had comma and newline in values

17:04 <abu[m]> TAB is ideal, as it normally does not appear in user-generated data

17:43 m_mans has joined #picolisp

17:58 seninha has quit [Quit: Leaving]

17:58 seninha has joined #picolisp

18:19 m_mans has quit [Remote host closed the connection]

18:20 m_mans has joined #picolisp

18:24 m_mans has quit [Ping timeout: 246 seconds]

18:32 m_mans has joined #picolisp

18:38 m_mans has quit [Ping timeout: 252 seconds]