cfbolz changed the topic of #pypy to: #pypy PyPy, the flexible snake | IRC logs: and | Matti: I made a bit of progress, the tests now only segfault towards the end
<nimaje> pypy is a bytecode vm the same as cpython is and generates .pyc files as well
<Heston> hmm, the description of pypy on it's project page and wikipedia is a bit ambiguous then
<Heston> it's described as a JIT whereas cpython is an interpreter
<Heston> curious why it would use the same file extension if they're a different type of file
<nimaje> the jit part is runtime only (iirc there were some ideas to cache that)
<nimaje> pypy translates python source code to bytecode and interpretes that, while interpreting it traces it and translates hot parts to machinecode
<Heston> could you say then that the only difference is in interpreting the byte code to be more efficient. Or is the bytecode generated quite different? The garbage collector strategy from what I read is what offers a large speed improvement
<exarkun> At what point does a difference in runtime behavior go from "interpreting the byte code is more efficient" or to something else?
<exarkun> Maybe observing the byte code that is running and the values it is running on and then generating native code that skips over some interpretation overhead that is unnecessary for those cases is "interpreting the byte codoe more efficiently"?
<exarkun> Or maybe it is a JIT instead of an interpreter? :)
mjacob has quit [Ping timeout: 260 seconds]
mjacob has joined #pypy
f4at has quit [Ping timeout: 246 seconds]
f4at has joined #pypy
<fijal> Heston: I think you could say that
<fijal> seems like pypy on m1 simply works?
<fijal> I need to fix _vmprof and _continuation, but modulo those, it just works
<fijal> I'm about to disappear because of missing electricity though
<tumbleweed> as it should (simply work)
<tumbleweed> supporting m1 properly means creating universal binaries, probably?
<LarstiQ> osx or Asahi? ;)
<mattip> osx
<arigato> LarstiQ: OSX. do you know if Asahi linux on M1 works just like an aarch64 linux at assembly level?
<mattip> tumbleweed: I added a comment to to provide a universal2 tarball
<antocuni> I'm considering adding support for "alloca" to RPython, but I'm unsure what's the best way to do it
<mattip> still looking for someone to contribute to the macos packaging story
<antocuni> I think that the end goal is to be able to do e.g. with "lltype.scoped_alloc(..., use_alloca=True)"
<antocuni> I see at least three ways
<antocuni> 1. add a new flavor to lltype.malloc; e.g. lltype.malloc(flavor='alloca'). This worries me because I've seen places which do things like "if flavor!='raw': assume_its_gc()"
<antocuni> 2. keep flavor='raw' but add an option, e.g. lltype.malloc(..., flavor='raw', use_alloca=True)
<antocuni> 3. bypass malloc and add directly an llop, which will be used by scoped_alloc directly
<antocuni> also, it's unclear how to implement it before translation: after translation, alloca() doesn't need a free() (obviously), but before translation we want to emulate it using malloc, and thus we need to free it at some point
<antocuni> bonus point if we manage to design a solution in which if by chance you use alloca() wrongly before translation, you get a clear error (e.g. if you use the memory after you returned from the function); this is not strictly necessary though, because I expect that the blessed usage will be "with scoped_alloc(...)", so in that case we have a clear point where to invalidate it
<LarstiQ> arigato: tweeting a question at the Asahi developers should confirm that
<LarstiQ> (or a bit more digging than I did now)
<tumbleweed> afaik, yes
<antocuni> I think I'm leaning towards option (2), it seems the simplest of the three
<tumbleweed> they're using standard linux distro arm64 binaries, from what I see
<tumbleweed> all the weird shit is confined to the kernel and booting
<tumbleweed> (and graphics)
<arigato> antocuni: any way would work. we also had something like that loooong ago but dropped it, but then we have more options now
<arigato> LarstiQ: so I guess the answer to "osx or Asahi?" is "both"
<arigato> but someone would need to check if there is still some issue
<cfbolz> antocuni: we used to have a flavor=stack
<cfbolz> there are some rests of that still there I think
<antocuni> ah, and why we killed it?
<cfbolz> antocuni: was never useful, I think
<cfbolz> antocuni: what kind of things do you plan to use it for?
<cfbolz> "just" the callbacks?
<cfbolz> (the biggest blocker in the past was this: with the new GCs, you couldn't put gc references into structs which are alloca'd, because the GC wouldn't find those roots, so they don't stay alive)
<antocuni> I think it would be usable for all the current usages of ScopedAlloc
<antocuni> and ScopedAlloc is always flavor='raw', so the GC is not a problem
<antocuni> e.g., we currently do a a lltype.scoped_alloc for every HPy call
<antocuni> using alloca() would be a good speedup, probably
<cfbolz> antocuni: ok. what's the scoped_alloc used in each hpy call atm?
<antocuni> for METH_VARARGS, you pass all the handles inside an array of HPy
<cfbolz> ok
<antocuni> the relevant code is in
<cfbolz> writing a blog post atm, can't look ;-)
<antocuni> ok :)
<cfbolz> trying to explain what record_known_result does
* antocuni digging in the history to read about malloc(flavor='stack')
<cfbolz> let me guess, it was in 2006
<antocuni> just found the commit which killed it
<antocuni> c0c5ed69969c, 2007-08-04
<cfbolz> hah, close
<antocuni> ah no, it seems that that commit didn't kill it, it was just a refactoring
greedom has joined #pypy
greedom has quit [Remote host closed the connection]
greedom has joined #pypy
greedom has quit [Remote host closed the connection]
<fijal> mattip: I'm not volunteering
<fijal> next step is figuring out the assembler on OS X
marvin has quit [Remote host closed the connection]
lazka has quit [Quit: bye]
marvin_ has joined #pypy
lazka has joined #pypy
<fijal> not today though
<arigato> antocuni: about the implementation, for fixed-size stuff, you never need to call alloca(), instead you just emit C code that has got a local struct
<arigato> calling alloca() can be dangerous, actually
<arigato> for example, if a ScopedAlloc() appears in a loop (even after some inlining), then calling alloca() repeatedly will blow up the stack
<arigato> maybe we should not call alloca() at all because of that, and instead use something like ScopedAlloc(VARSIZE, flavor='stack', limit_length=50)
<arigato> which would be implemented by generating C code that contains an inline array (or struct-with-array) of length 50,
<arigato> and logic to take either a pointer to that if the actual length is <= 50, or do a malloc() and free() if > 50
<arigato> (or, of course, just not allow flavor='stack' for varsized types, as a first step)
<antocuni> uhm, I see the problem
<antocuni> we could also use C99 varsized local array, but I think they don't work reliably on windows
<antocuni> I hoped it would be easier :(
<antocuni> with all these yaks to shave, I'm loosing a bit of motivation to do that right now
<antocuni> also, doing this kind of stuff inside the ScopedAlloc context manager is tricky: you need to ensure that the context manager itself is malloc-removed and that the __enter__ and __exit__ are inlined at the RPython level, else you allocate stuff which is local to the __enter__
<antocuni> what a mess
<cfbolz> antocuni: there is a hint for that
<cfbolz> it says "we must never see instances of this"
<cfbolz> but yes, it sounds all quite a lot
<cfbolz> antocuni: what is the actual problem you are solving, the high level goal?
<cfbolz> tp_traverse?
<antocuni> yes
<antocuni> so, my plan was roughly the following
<antocuni> 1. refactor the GC so that we call gc.trace using only function-callbacks, not method-callbacks
<antocuni> 2. to do that, we need to pass an additional arg to the callback; so, RPython callbacks receives two arguments, and the tp_traverse callbacks receive only one
<antocuni> 3. to implement (2), I need some memory where to encode the two arguments and pass them around as void*
<antocuni> 4. alloca() sounded "easy", and also generally useful e.g. to speed up hpy calls
<cfbolz> hm
<cfbolz> it's a good plan, but indeed many steps
<cfbolz> as a first step we would be happy enough to support the new hpy features in whatever slow way though, right?
<antocuni> I think I could avoid (3) by storing the arguments in some global variable, since the callbacks should never be called concurrently
<cfbolz> because eg graalpython never calls tp_traverse somehow
<antocuni> if we don't care about speed, we can probably malloc() the area to pass the callback arguments
<antocuni> yes, I think that GraalPython keep tracks of HPyFields by saving some extra info when they do HPyField_Store, but I would prefer to have a solution which uses tp_traverse, also to validate the HPy design
<cfbolz> I get that
<cfbolz> but it's not "first make it work, then make it fast" ;-)
<antocuni> this was also the approach which we followed for cpyext and it turned out not to work too well :)
<antocuni> also, I think there is not point in having a slow HPy on PyPy: being fast on pypy is probably the only reason why people would be interested in migrating their code to HPy, at this stage at least
<cfbolz> they are your yaks, you can shave them in whatever order you want ;-)
<antocuni> my secret hope is that someone steps up to say "hey, my secret dream was to implement alloca() for RPython, let me do that" :)
<cfbolz> Oops, seems netlify is maybe broken
f4at has quit [Ping timeout: 246 seconds]
<mattip> cfbolz: can you reach this link (netlify build log)
<mattip> I wonder why it is using python3.5
<cfbolz> No, I don't see the preview either
<mattip> I changed the default image to ubuntu 20.04, so now it is using python3.8
<mattip> but the build still fails "ModuleNotFoundError: No module named 'doit.cmd_auto'"
<mattip> which I think is nikola's fault
<cfbolz> mattip: no, wait, I think you removed that from the makefile
<mattip> right. Is your commit against main?
<mattip> I mean "can you please rebase off main"
<cfbolz> mattip: I'm only at the phone atm
<cfbolz> Will do a bit later
<cfbolz> mattip: done
<cfbolz> sorry for the confusion
<cfbolz> mattip: now only the nikola check happens 🤔
<cfbolz> ah, no
<cfbolz> just takes a while to load
<mattip> "The one one of them is called" -> "The other is called"
<mattip> it might be nice to make the connection between function decorators and jit hints explicit:
<mattip> "We implement hints useful to the JIT as decorators" or so
<mattip> just before "One of the very important ones"
<cfbolz> mattip: would you write it into the pr so I don't lose it?
<cfbolz> Thanks for reading :-)
<mattip> yeah, makes sense sorry
<Corbin> RPython question: I want to build an interned cache for a class whose constructor takes an immutable list. In order to to this, I somehow need to get that list to be a dict key. The lists are relatively short, so I was going to use codegen to build a table that tests for the length of the list and then repacks the list into a fixed-length tuple. Are there better ways to do it?
otisolsen70 has joined #pypy
<cfbolz> Corbin: yes
<cfbolz> You can use an r_dict
<cfbolz> It takes a hash and an eq function
<cfbolz> So that means you can use unhashable things as keys
<Corbin> cfbolz: Oh, and then I could use my existing equality function. Something about that feels strange, but maybe it's just because I'm used to requiring hashable keys.
<cfbolz> mattip: thanks for all the feedback!
<cfbolz> Does the general direction of the post make sense to you though?
<mattip> yes
<arigato> cfbolz: a question
<arigato> bytes.lower(bytes.upper(x)) == bytes.lower(x)
<arigato> how would you express that?
<arigato> def bytes_upper(b): res = _bytes_upper_helper(b); record_known_result(_bytes_lower_helper(b), _bytes_lower_helper, res)?
<arigato> and rely on various levels of optimizations to make sure the _bytes_lower_helper(b) call is not actually usually done?
<cfbolz> arigato: yeah it's not ideal
<cfbolz> But it seems to work in practice
<arigato> OK, cool
<cfbolz> arigato: I'm not sure how much I should push on this
<cfbolz> Eg every rbigint function could have a whole bunch of these hints
<arigato> yes, unclear. I guess the kind of optimization that is useful (because not trivially done by the programmer already) is a little bit too subtle for just a hint
<arigato> I mean stuff like "a+100+1" => "a+101"
<arigato> which doesn't work because it parses as (a+100)+1 of course
<fijal> antocuni: isn't repeatedly calling malloc() free() with the same size kind of fast?
<arigato> cfbolz: record_known_result can be called with a "func" that is not marked elidable at all, right? thinking about a case where we know that func called with these particular arguments can still be optimized away, but in the general case it couldn't
f4at has joined #pypy
<arigato> e.g. if "func" adds an item to an externally visible set and returns None, it still is kind of idempotent in the sense that f(x);f(x) has exactly the same effects than f(x)
<cfbolz> arigato: no, it's not really implemented at this point
<antocuni> fijal: "kind of fast", but nothing near to stack allocation
<antocuni> on my machine with -O1, malloc takes 1.418s, stack 0.221
<arigato> "it's not really possible to express that bigint_sub(x, x) == bigint(0) for arbitrary big integers ``x``"
<arigato> can't we just say "def rbigint_sub(x, y): record_known_result(zero, _sub_helper, x, x); return _sub_helper(x, y)" --- ignoring equal-valued but different-identity objects of course?
<arigato> I guess it's easier to write "if x is y: return zero" obviously
<arigato> but maybe it's an application anyway: for code where you might be tempted to add this kind of checks for strength reduction, it can be a problem because it makes more bridges in the tracing JIT, but maybe we could sometimes express it with a record_known_result(.. func ..) call just before the actual call to func()
<arigato> at least for the cases that we don't manage to write using jit.isconstant() or stuff like that
<cfbolz> Yeah, it needs a lot of care
<cfbolz> arigato: I think it would have to be driven by real benchmarks, and not just by 'it would be cool and somehow possible to do x'
<cfbolz> (even though the latter is much more fun)
<cfbolz> arigato: I am also wondering about the following: for strings something like this is quite common: s[:-1].endswith(t)
<cfbolz> can we prevent the copy somehow?
Heston has quit [Remote host closed the connection]
Heston has joined #pypy
<cfbolz> arigato: basically we could systematically add start, stop arguments to a whole bunch of string functions and teach the optimizer
<Corbin> Heston: Do the answers you've gotten make sense? Both CPython and PyPy have bytecode compilers (and very similar bytecodes). They also both have interpreters for their bytecode, including debugging frames and coroutines, etc. PyPy *also* has a JIT for the bytecode interpreter, compiling bytecode to native code.
<Heston> Corbin, mostly yes. Just the interpreter is creating native code in either case, right? Unless you mean something else by native code
<Corbin> Heston: In both cases, the interpreter is *executing effects* written in native code. A bytecode interpreter looks up an effect for each instruction, and then performs the effect. But PyPy's JIT can take a sequence of bytecodes and compile an *optimized* effect, at runtime, even for sequences which it's never seen before.
<Heston> Corbin, interesting ok, thank you for the description
<Heston> it doesn't seem like pypy is creating .pyc bytecode files by default. I was hoping to have those to measure and reduce execution time
<Heston> I only see the flag to disable producing .pyc files
<Heston> perhaps they get stored somewhere other than the working directory in debian stable with pypy3?
<Corbin> I don't think that .pyc bytecode from PyPy can be used with CPython, or vice versa. You might need to find them; I'm not sure whether PyPy uses the __pycache__ format yet.
<Heston> I wasn't trying to cross use them, just to have a quicker execution time for the same file after the initial use
hexology has quit [Ping timeout: 248 seconds]
xcm has quit [Ping timeout: 256 seconds]
the_rat has quit [Quit: ZNC -]
the_rat has joined #pypy
<Corbin> I think that that should already happen by default.
<Heston> it could be this 9KB file is too small to show a measurable difference for parsing, but otherwise execution time is about the same if I rename the file
<Corbin> How long does your benchmark take on CPython? Are you using something like timeit for an apples-to-apples comparison?
<Heston> oh it's significantly quicker with pypy, no question. Almost 1/3 the time. I was just comparing initial execution time to second execution time where hopefully it wouldnt have to regenerate bytecode
<Heston> I wasn't using timeit so my measure of time could be off by a second
<Corbin> Sure. Even though it's not a favorite of this channel, I like to use the timeit module from the stdlib for this sort of thing; it has a customizeable setup phase (-s from the CLI) so that you can do imports or other expensive work ahead of time.
<Corbin> Still, a 3x speedup is solidly in the expected range; I normally look for anywhere from 3x to 5x speedup.
<Heston> for what it's worth this is the sha256 code from pypy itself
<Heston> which although is almost 3x faster in pypy, is still no where close to the native executable
hexology has joined #pypy
xcm has joined #pypy
<cfbolz> Heston: bytecode generation is reaaaally quick
<cfbolz> it takes milliseconds, lost in the noise
<Heston> that makes sense
<cfbolz> the more interesting question would be to persist JIT generated machine code somehow, but that's rather tricky
<Heston> that would be the next goal for any significant speed increase. I thought cython was that but apparently not
<Heston> the generated machine code would differ depending on how you parse argument passed
<Heston> arguments
<Corbin> If you want monomorphic machine code, I would strongly consider writing a subprocess (or extension module, I guess~) in a language like Rust or OCaml which is still memory-safe but gives you fine control over how types are specialized.
<Corbin> But I'll tell you that, lately, I've compiled to OCaml, and it wasn't as fast as compiling to CHICKEN Scheme; the fastest option so far has been writing a custom JIT in RPython, the toolkit used to create PyPy.
<Heston> wow cool. Did you describe your work somewhere?
<Corbin> No. Individual stages are in the history of but I don't have a document comparing the approaches. In short, CHICKEN is faster than OCaml for code which requires continuation-passing (CPS) transformation; but RPython JITs are excellent at specializing and CHICKEN is not.
<Heston> very interesting, thanks for sharing
Guest96 has joined #pypy
Dejan has joined #pypy
glyph has quit [Quit: End of line.]
glyph has joined #pypy
otisolsen70 has quit [Quit: Leaving]
Guest96 has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]