#pypy on 2022-05-22 — irc logs at libera.irclog.whitequark.org

2022-04-07 20:04 cfbolz changed the topic of #pypy to: #pypy PyPy, the flexible snake https://pypy.org | IRC logs: https://quodlibet.duckdns.org/irc/pypy/latest.log.html#irc-end and https://libera.irclog.whitequark.org/pypy | Matti: I made a bit of progress, the tests now only segfault towards the end

00:05 <nimaje> pypy is a bytecode vm the same as cpython is and generates .pyc files as well

00:17 <Heston> hmm, the description of pypy on it's project page and wikipedia is a bit ambiguous then

00:17 <Heston> it's described as a JIT whereas cpython is an interpreter

00:18 <Heston> curious why it would use the same file extension if they're a different type of file

00:21 <nimaje> the jit part is runtime only (iirc there were some ideas to cache that)

00:22 <nimaje> pypy translates python source code to bytecode and interpretes that, while interpreting it traces it and translates hot parts to machinecode

00:59 <Heston> could you say then that the only difference is in interpreting the byte code to be more efficient. Or is the bytecode generated quite different? The garbage collector strategy from what I read is what offers a large speed improvement

01:10 <exarkun> At what point does a difference in runtime behavior go from "interpreting the byte code is more efficient" or to something else?

01:11 <exarkun> Maybe observing the byte code that is running and the values it is running on and then generating native code that skips over some interpretation overhead that is unnecessary for those cases is "interpreting the byte codoe more efficiently"?

01:11 <exarkun> Or maybe it is a JIT instead of an interpreter? :)

01:57 mjacob has quit [Ping timeout: 260 seconds]

01:57 mjacob has joined #pypy

05:22 f4at has quit [Ping timeout: 246 seconds]

05:22 f4at has joined #pypy

06:24 <fijal> Heston: I think you could say that

06:30 <fijal> seems like pypy on m1 simply works?

06:30 <fijal> I need to fix _vmprof and _continuation, but modulo those, it just works

06:39 <fijal> I'm about to disappear because of missing electricity though

06:54 <tumbleweed> as it should (simply work)

06:56 <tumbleweed> supporting m1 properly means creating universal binaries, probably?

07:10 <LarstiQ> osx or Asahi? ;)

07:19 <mattip> osx

07:19 <arigato> LarstiQ: OSX. do you know if Asahi linux on M1 works just like an aarch64 linux at assembly level?

07:30 <mattip> tumbleweed: I added a comment to https://foss.heptapod.net/pypy/pypy/-/issues/3697 to provide a universal2 tarball

07:30 <antocuni> I'm considering adding support for "alloca" to RPython, but I'm unsure what's the best way to do it

07:30 <mattip> still looking for someone to contribute to the macos packaging story

07:32 <antocuni> I think that the end goal is to be able to do e.g. with "lltype.scoped_alloc(..., use_alloca=True)"

07:33 <antocuni> I see at least three ways

07:34 <antocuni> 1. add a new flavor to lltype.malloc; e.g. lltype.malloc(flavor='alloca'). This worries me because I've seen places which do things like "if flavor!='raw': assume_its_gc()"

07:35 <antocuni> 2. keep flavor='raw' but add an option, e.g. lltype.malloc(..., flavor='raw', use_alloca=True)

07:35 <antocuni> 3. bypass malloc and add directly an llop, which will be used by scoped_alloc directly

07:38 <antocuni> also, it's unclear how to implement it before translation: after translation, alloca() doesn't need a free() (obviously), but before translation we want to emulate it using malloc, and thus we need to free it at some point

07:40 <antocuni> bonus point if we manage to design a solution in which if by chance you use alloca() wrongly before translation, you get a clear error (e.g. if you use the memory after you returned from the function); this is not strictly necessary though, because I expect that the blessed usage will be "with scoped_alloc(...)", so in that case we have a clear point where to invalidate it

07:42 <LarstiQ> arigato: I don't know, but judging from https://asahilinux.org/2021/03/progress-report-january-february-2021/ and https://asahilinux.org/2021/03/progress-report-january-february-2021/ my guess is yes

07:42 <LarstiQ> arigato: tweeting a question at the Asahi developers should confirm that

07:43 <LarstiQ> (or a bit more digging than I did now)

07:45 <tumbleweed> afaik, yes

07:45 <antocuni> I think I'm leaning towards option (2), it seems the simplest of the three

07:47 <tumbleweed> they're using standard linux distro arm64 binaries, from what I see

07:47 <tumbleweed> all the weird shit is confined to the kernel and booting

07:47 <tumbleweed> (and graphics)

08:02 <arigato> antocuni: any way would work. we also had something like that loooong ago but dropped it, but then we have more options now

08:07 <arigato> LarstiQ: so I guess the answer to "osx or Asahi?" is "both"

08:08 <arigato> but someone would need to check if there is still some issue

08:39 <cfbolz> antocuni: we used to have a flavor=stack

08:39 <cfbolz> there are some rests of that still there I think

08:40 <antocuni> ah, and why we killed it?

08:40 <cfbolz> antocuni: was never useful, I think

08:41 <cfbolz> antocuni: what kind of things do you plan to use it for?

08:41 <cfbolz> "just" the callbacks?

08:43 <cfbolz> (the biggest blocker in the past was this: with the new GCs, you couldn't put gc references into structs which are alloca'd, because the GC wouldn't find those roots, so they don't stay alive)

09:11 <antocuni> I think it would be usable for all the current usages of ScopedAlloc

09:11 <antocuni> and ScopedAlloc is always flavor='raw', so the GC is not a problem

09:12 <antocuni> e.g., we currently do a a lltype.scoped_alloc for every HPy call

09:12 <antocuni> using alloca() would be a good speedup, probably

09:33 <cfbolz> antocuni: ok. what's the scoped_alloc used in each hpy call atm?

09:34 <antocuni> for METH_VARARGS, you pass all the handles inside an array of HPy

09:35 <cfbolz> ok

09:35 <antocuni> the relevant code is in interp_extfunc.py:call_varargs_kw

09:36 <cfbolz> writing a blog post atm, can't look ;-)

09:36 <antocuni> ok :)

09:38 <cfbolz> trying to explain what record_known_result does

09:40 * antocuni digging in the history to read about malloc(flavor='stack')

09:41 <cfbolz> let me guess, it was in 2006

09:41 <antocuni> just found the commit which killed it

09:41 <antocuni> c0c5ed69969c, 2007-08-04

09:41 <cfbolz> hah, close

09:42 <antocuni> ah no, it seems that that commit didn't kill it, it was just a refactoring

09:46 greedom has joined #pypy

09:58 greedom has quit [Remote host closed the connection]

09:59 greedom has joined #pypy

10:37 greedom has quit [Remote host closed the connection]

10:52 <fijal> mattip: I'm not volunteering

10:54 <fijal> next step is figuring out the assembler on OS X

10:56 marvin has quit [Remote host closed the connection]

10:56 lazka has quit [Quit: bye]

10:56 marvin_ has joined #pypy

10:56 lazka has joined #pypy

11:15 <fijal> not today though

11:22 <arigato> antocuni: about the implementation, for fixed-size stuff, you never need to call alloca(), instead you just emit C code that has got a local struct

11:23 <arigato> calling alloca() can be dangerous, actually

11:23 <arigato> for example, if a ScopedAlloc() appears in a loop (even after some inlining), then calling alloca() repeatedly will blow up the stack

11:26 <arigato> maybe we should not call alloca() at all because of that, and instead use something like ScopedAlloc(VARSIZE, flavor='stack', limit_length=50)

11:26 <arigato> which would be implemented by generating C code that contains an inline array (or struct-with-array) of length 50,

11:26 <arigato> and logic to take either a pointer to that if the actual length is <= 50, or do a malloc() and free() if > 50

11:28 <arigato> (or, of course, just not allow flavor='stack' for varsized types, as a first step)

11:41 <antocuni> uhm, I see the problem

11:42 <antocuni> we could also use C99 varsized local array, but I think they don't work reliably on windows

11:43 <antocuni> I hoped it would be easier :(

11:43 <antocuni> with all these yaks to shave, I'm loosing a bit of motivation to do that right now

11:46 <antocuni> also, doing this kind of stuff inside the ScopedAlloc context manager is tricky: you need to ensure that the context manager itself is malloc-removed and that the __enter__ and __exit__ are inlined at the RPython level, else you allocate stuff which is local to the __enter__

11:46 <antocuni> what a mess

11:48 <cfbolz> antocuni: there is a hint for that

11:48 <cfbolz> it says "we must never see instances of this"

11:48 <cfbolz> but yes, it sounds all quite a lot

11:48 <cfbolz> antocuni: what is the actual problem you are solving, the high level goal?

11:49 <cfbolz> tp_traverse?

11:49 <antocuni> yes

11:49 <antocuni> so, my plan was roughly the following

11:49 <antocuni> 1. refactor the GC so that we call gc.trace using only function-callbacks, not method-callbacks

11:50 <antocuni> 2. to do that, we need to pass an additional arg to the callback; so, RPython callbacks receives two arguments, and the tp_traverse callbacks receive only one

11:51 <antocuni> 3. to implement (2), I need some memory where to encode the two arguments and pass them around as void*

11:51 <antocuni> 4. alloca() sounded "easy", and also generally useful e.g. to speed up hpy calls

11:51 <cfbolz> hm

11:52 <cfbolz> it's a good plan, but indeed many steps

11:52 <cfbolz> as a first step we would be happy enough to support the new hpy features in whatever slow way though, right?

11:52 <antocuni> I think I could avoid (3) by storing the arguments in some global variable, since the callbacks should never be called concurrently

11:53 <cfbolz> because eg graalpython never calls tp_traverse somehow

11:53 <antocuni> if we don't care about speed, we can probably malloc() the area to pass the callback arguments

11:55 <antocuni> yes, I think that GraalPython keep tracks of HPyFields by saving some extra info when they do HPyField_Store, but I would prefer to have a solution which uses tp_traverse, also to validate the HPy design

11:58 <cfbolz> I get that

11:58 <cfbolz> but it's not "first make it work, then make it fast" ;-)

11:59 <antocuni> this was also the approach which we followed for cpyext and it turned out not to work too well :)

12:00 <antocuni> also, I think there is not point in having a slow HPy on PyPy: being fast on pypy is probably the only reason why people would be interested in migrating their code to HPy, at this stage at least

12:02 <cfbolz> they are your yaks, you can shave them in whatever order you want ;-)

12:04 <antocuni> my secret hope is that someone steps up to say "hey, my secret dream was to implement alloca() for RPython, let me do that" :)

12:35 <cfbolz> Oops, seems netlify is maybe broken

12:41 f4at has quit [Ping timeout: 246 seconds]

13:09 <mattip> cfbolz: can you reach this link (netlify build log)

13:09 <mattip> https://app.netlify.com/sites/keen-mestorf-442210/deploys/628a267bad03c1000a548e07#L231

13:10 <mattip> I wonder why it is using python3.5

13:11 <cfbolz> No, I don't see the preview either

13:14 <mattip> I changed the default image to ubuntu 20.04, so now it is using python3.8

13:15 <mattip> but the build still fails "ModuleNotFoundError: No module named 'doit.cmd_auto'"

13:15 <mattip> which I think is nikola's fault

13:16 <cfbolz> mattip: no, wait, I think you removed that from the makefile

13:16 <cfbolz> https://github.com/pypy/pypy.org/commit/383e630d1142915464c5769c11e2be902ed41475

13:17 <mattip> right. Is your commit against main?

13:18 <mattip> I mean "can you please rebase off main"

13:23 <cfbolz> mattip: I'm only at the phone atm

13:23 <cfbolz> Will do a bit later

13:52 <cfbolz> mattip: done

13:53 <cfbolz> sorry for the confusion

13:55 <cfbolz> mattip: now only the nikola check happens 🤔

13:55 <cfbolz> ah, no

13:55 <cfbolz> just takes a while to load

14:01 <mattip> "The one one of them is called" -> "The other is called"

14:01 <mattip> it might be nice to make the connection between function decorators and jit hints explicit:

14:02 <mattip> "We implement hints useful to the JIT as decorators" or so

14:02 <mattip> just before "One of the very important ones"

14:21 <cfbolz> mattip: would you write it into the pr so I don't lose it?

14:21 <cfbolz> Thanks for reading :-)

14:21 <mattip> yeah, makes sense sorry

14:51 <Corbin> RPython question: I want to build an interned cache for a class whose constructor takes an immutable list. In order to to this, I somehow need to get that list to be a dict key. The lists are relatively short, so I was going to use codegen to build a table that tests for the length of the list and then repacks the list into a fixed-length tuple. Are there better ways to do it?

14:53 <Corbin> (The class in question is at l66 of https://osdn.net/users/corbin/pf/cammy/scm/blobs/master/cammy-rpy/cammylib/sexp.py)

14:57 otisolsen70 has joined #pypy

15:14 <cfbolz> Corbin: yes

15:14 <cfbolz> You can use an r_dict

15:15 <cfbolz> It takes a hash and an eq function

15:15 <cfbolz> So that means you can use unhashable things as keys

15:16 <Corbin> cfbolz: Oh, and then I could use my existing equality function. Something about that feels strange, but maybe it's just because I'm used to requiring hashable keys.

15:19 <cfbolz> mattip: thanks for all the feedback!

15:20 <cfbolz> Does the general direction of the post make sense to you though?

15:46 <mattip> yes

15:46 <arigato> cfbolz: a question

15:46 <arigato> bytes.lower(bytes.upper(x)) == bytes.lower(x)

15:46 <arigato> how would you express that?

15:50 <arigato> def bytes_upper(b): res = _bytes_upper_helper(b); record_known_result(_bytes_lower_helper(b), _bytes_lower_helper, res)?

15:51 <arigato> and rely on various levels of optimizations to make sure the _bytes_lower_helper(b) call is not actually usually done?

15:51 <cfbolz> arigato: yeah it's not ideal

15:51 <cfbolz> But it seems to work in practice

15:51 <arigato> OK, cool

15:52 <cfbolz> arigato: I'm not sure how much I should push on this

15:52 <cfbolz> Eg every rbigint function could have a whole bunch of these hints

15:54 <arigato> yes, unclear. I guess the kind of optimization that is useful (because not trivially done by the programmer already) is a little bit too subtle for just a hint

15:54 <arigato> I mean stuff like "a+100+1" => "a+101"

15:55 <arigato> which doesn't work because it parses as (a+100)+1 of course

16:00 <fijal> antocuni: isn't repeatedly calling malloc() free() with the same size kind of fast?

16:04 <arigato> cfbolz: record_known_result can be called with a "func" that is not marked elidable at all, right? thinking about a case where we know that func called with these particular arguments can still be optimized away, but in the general case it couldn't

16:06 f4at has joined #pypy

16:07 <arigato> e.g. if "func" adds an item to an externally visible set and returns None, it still is kind of idempotent in the sense that f(x);f(x) has exactly the same effects than f(x)

16:07 <cfbolz> arigato: no, it's not really implemented at this point

16:09 <antocuni> fijal: "kind of fast", but nothing near to stack allocation

16:09 <antocuni> consider this: https://paste.openstack.org/show/buKoc0KIzZ3xBL0DJI3R/

16:10 <antocuni> on my machine with -O1, malloc takes 1.418s, stack 0.221

16:10 <arigato> "it's not really possible to express that bigint_sub(x, x) == bigint(0) for arbitrary big integers ``x``"

16:11 <arigato> can't we just say "def rbigint_sub(x, y): record_known_result(zero, _sub_helper, x, x); return _sub_helper(x, y)" --- ignoring equal-valued but different-identity objects of course?

16:12 <arigato> I guess it's easier to write "if x is y: return zero" obviously

16:17 <arigato> but maybe it's an application anyway: for code where you might be tempted to add this kind of checks for strength reduction, it can be a problem because it makes more bridges in the tracing JIT, but maybe we could sometimes express it with a record_known_result(.. func ..) call just before the actual call to func()

16:17 <arigato> at least for the cases that we don't manage to write using jit.isconstant() or stuff like that

16:34 <cfbolz> Yeah, it needs a lot of care

16:35 <cfbolz> arigato: I think it would have to be driven by real benchmarks, and not just by 'it would be cool and somehow possible to do x'

16:35 <cfbolz> (even though the latter is much more fun)

17:05 <cfbolz> arigato: I am also wondering about the following: for strings something like this is quite common: s[:-1].endswith(t)

17:05 <cfbolz> can we prevent the copy somehow?

17:06 Heston has quit [Remote host closed the connection]

17:06 Heston has joined #pypy

17:07 <cfbolz> arigato: basically we could systematically add start, stop arguments to a whole bunch of string functions and teach the optimizer

17:12 <Corbin> Heston: Do the answers you've gotten make sense? Both CPython and PyPy have bytecode compilers (and very similar bytecodes). They also both have interpreters for their bytecode, including debugging frames and coroutines, etc. PyPy *also* has a JIT for the bytecode interpreter, compiling bytecode to native code.

17:38 <Heston> Corbin, mostly yes. Just the interpreter is creating native code in either case, right? Unless you mean something else by native code

17:42 <Corbin> Heston: In both cases, the interpreter is *executing effects* written in native code. A bytecode interpreter looks up an effect for each instruction, and then performs the effect. But PyPy's JIT can take a sequence of bytecodes and compile an *optimized* effect, at runtime, even for sequences which it's never seen before.

17:45 <Heston> Corbin, interesting ok, thank you for the description

17:46 <Heston> it doesn't seem like pypy is creating .pyc bytecode files by default. I was hoping to have those to measure and reduce execution time

17:46 <Heston> I only see the flag to disable producing .pyc files

17:48 <Heston> perhaps they get stored somewhere other than the working directory in debian stable with pypy3?

17:51 <Corbin> I don't think that .pyc bytecode from PyPy can be used with CPython, or vice versa. You might need https://peps.python.org/pep-3147/ to find them; I'm not sure whether PyPy uses the __pycache__ format yet.

17:52 <Heston> I wasn't trying to cross use them, just to have a quicker execution time for the same file after the initial use

17:52 hexology has quit [Ping timeout: 248 seconds]

17:54 xcm has quit [Ping timeout: 256 seconds]

17:55 the_rat has quit [Quit: ZNC - http://znc.in]

17:55 the_rat has joined #pypy

17:56 <Corbin> I think that that should already happen by default.

18:02 <Heston> it could be this 9KB file is too small to show a measurable difference for parsing, but otherwise execution time is about the same if I rename the file

18:03 <Corbin> How long does your benchmark take on CPython? Are you using something like timeit for an apples-to-apples comparison?

18:05 <Heston> oh it's significantly quicker with pypy, no question. Almost 1/3 the time. I was just comparing initial execution time to second execution time where hopefully it wouldnt have to regenerate bytecode

18:05 <Heston> I wasn't using timeit so my measure of time could be off by a second

18:06 <Corbin> Sure. Even though it's not a favorite of this channel, I like to use the timeit module from the stdlib for this sort of thing; it has a customizeable setup phase (-s from the CLI) so that you can do imports or other expensive work ahead of time.

18:07 <Corbin> Still, a 3x speedup is solidly in the expected range; I normally look for anywhere from 3x to 5x speedup.

18:08 <Heston> for what it's worth this is the sha256 code from pypy itself

18:11 <Heston> which although is almost 3x faster in pypy, is still no where close to the native executable

18:12 hexology has joined #pypy

18:14 xcm has joined #pypy

18:25 <cfbolz> Heston: bytecode generation is reaaaally quick

18:26 <cfbolz> it takes milliseconds, lost in the noise

18:27 <Heston> that makes sense

18:27 <cfbolz> the more interesting question would be to persist JIT generated machine code somehow, but that's rather tricky

18:29 <Heston> that would be the next goal for any significant speed increase. I thought cython was that but apparently not

18:30 <Heston> the generated machine code would differ depending on how you parse argument passed

18:31 <Heston> arguments

18:33 <Corbin> If you want monomorphic machine code, I would strongly consider writing a subprocess (or extension module, I guess~) in a language like Rust or OCaml which is still memory-safe but gives you fine control over how types are specialized.

18:33 <Corbin> But I'll tell you that, lately, I've compiled to OCaml, and it wasn't as fast as compiling to CHICKEN Scheme; the fastest option so far has been writing a custom JIT in RPython, the toolkit used to create PyPy.

18:35 <Heston> wow cool. Did you describe your work somewhere?

18:37 <Corbin> No. Individual stages are in the history of https://esolangs.org/wiki/Cammy but I don't have a document comparing the approaches. In short, CHICKEN is faster than OCaml for code which requires continuation-passing (CPS) transformation; but RPython JITs are excellent at specializing and CHICKEN is not.

19:08 <Heston> very interesting, thanks for sharing

19:37 Guest96 has joined #pypy

19:43 Dejan has joined #pypy

20:52 glyph has quit [Quit: End of line.]

20:52 glyph has joined #pypy

21:00 otisolsen70 has quit [Quit: Leaving]

21:40 Guest96 has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]