cfbolz changed the topic of #pypy to: #pypy PyPy, the flexible snake https://pypy.org | IRC logs: https://quodlibet.duckdns.org/irc/pypy/latest.log.html#irc-end and https://libera.irclog.whitequark.org/pypy | Matti: I made a bit of progress, the tests now only segfault towards the end
f4at has quit [Quit: bye]
otisolsen70 has joined #pypy
otisolsen70 has quit [Remote host closed the connection]
otisolsen70 has joined #pypy
otisolsen70 has quit [Quit: Leaving]
otisolsen70 has joined #pypy
slav0nic has joined #pypy
<Corbin> https://arxiv.org/abs/2011.13127 is on the front page of lobste.rs today. I don't recall it being discussed at the time. How does this compare to RPython's generated JITs?
<Corbin> I was particularly thinking of p25. "None of [the template JITs mentioned] above supported patching the binary code to burn in literals, stack offsets, and jump addresses, so their technique only works if the binary code can be concatenated without modification."
<Corbin> "This implies that all jumps and calls are indirect, and that all constants must be retrieved from memory, resulting in inferior execution performance."
<LarstiQ> cfbolz tweeted about it recently too
<cfbolz> Great paper
<cfbolz> It's a technique for quite good baseline jits
<cfbolz> It's not a meta jit though and making it one would be an interesting research project
<cfbolz> It's also an interesting question how well this would work for a language like python
lritter has joined #pypy
otisolsen70_ has joined #pypy
otisolsen70 has quit [Ping timeout: 250 seconds]
otisolsen70__ has joined #pypy
otisolsen70_ has quit [Ping timeout: 246 seconds]
otisolsen70__ has quit [Quit: Leaving]
<arigato> re the copy-and-patch JIT paper, yes, it's interesting and I think the approach would work for RPython JITs too
<cfbolz> arigato: I am still thinking about the latter part
<arigato> I'm thinking, for the backend only, to make somewhat optimized machine code
<arigato> (really I just started reading)
<cfbolz> ok
<cfbolz> arigato: I am wondering whether you could the approach as a first stage instead, to "just" get the ~2x speedup that getting rid of bytecode dispatch brings
<arigato> I haven't read far enough to be sure
greedom has joined #pypy
greedom has quit [Remote host closed the connection]
greedom has joined #pypy
<arigato> I think someone motivated enough could use it manually to get that effect on CPython
<arigato> not sure how well it would work on PyPy without tons of hints to guide it
<arigato> it would be interesting to try to use it with RPython's JIT, as a replacement for the backend and possibly the rewrite step that occurs just before too
<cfbolz> arigato: right
<arigato> but of course the usual problems remain: (1) GC integration, (2) guard failures
<cfbolz> arigato: for GC you would need to have an indirection for the GC constants, plus a way to find roots?
<arigato> the indirection for GC constants might not be needed, because there is already an indirection for all constants (their offset in the machine code)
greedom has quit [Remote host closed the connection]
<arigato> I think that across calls, *all* values are spilled in this model, so it might be easier to find the roots
greedom has joined #pypy
greedom has quit [Client Quit]
greedom has joined #pypy
<arigato> (doing an indirection for GC constants might still be easier, it's what we do already for all backends anyway)
<cfbolz> ok
lritter has quit [Ping timeout: 240 seconds]
<arigato> it's interesting in the paper how they can extract all templates they need from running the LLVM compiler, on any number of architectures
<cfbolz> arigato: yes
<cfbolz> arigato: I also really like the approach of using tail calls
<arigato> yes, together with the modes inside LLVM made for compiling GHC, I think? (unsure)
<cfbolz> not sure it's 'for' ghc, but ghc uses it
<cfbolz> arigato: this is the hacky non-C++ code I played with yesterday, btw: https://gist.github.com/cfbolz/3ffa8746fc44f5d1192c02028a0ce058
<cfbolz> (it's a lot more hacky and low-tech than the paper)
<arigato> "This calling convention has been implemented specifically for use by the Glasgow
<arigato> Haskell Compiler (GHC)."
<cfbolz> ah, cool
<cfbolz> fair enough
<arigato> but also "at the moment only X86 supports this convention"
<cfbolz> yes, just found that too
<arigato> they didn't talk about that point in the paper, or I missed it
<arigato> unsure I understand, because they have somewhere else wording that implies their work works on x86-64, ARM and SPARC
<cfbolz> arigato: yeah, I wonder whether it actually works on say ARM
<cfbolz> yes, but the formulation is vague
<arigato> personally I will stay far away from that whole approach, because it looks like another rabbit hole of unfinished LLVM features
<cfbolz> heh, got burned once by that already with stm?
<arigato> hah, if only that was only once
<cfbolz> oh no, what else?
<arigato> I don't remember exactly, just that every time we tried to use LLVM we eventually failed for that reason
<cfbolz> :-(
<cfbolz> arigato: ok, fine, but I am not sure the paper really needs much of llvm. it mostly uses clang I think
<arigato> it more or less depends on a special calling convention, and this one is only implemented on x86 according to the docs, so... roadblock again?
<arigato> "implementing it inside llvm looks like it should not be too much work" (famouns last words that will not be mine)
<cfbolz> arigato: no, I think the calling convention "just" makes it more efficient, it works with the normal one too, I suspect
<cfbolz> but yes, I get your skepticism
<arigato> right, I think it wouldn't work with the plain calling convention because of the risk that it produces real calls and blow up the stack, but maybe there are other more portable conventions that guarantee tail-calls
<arigato> yes, it seems they have a "tailcc" for precisely that purpose
<arigato> I like how they represent all live values as parameters and arguments, but I'm not sure I see how they handle that in practice
<arigato> e.g. if two pieces of code are both INT_ADD+INT_MUL, but one uses the result of the addition as argument to the multiplication while the other just passes the addition as live variable, how is the difference represented?
<cfbolz> arigato: I think that would be different variants of these two, probably?
<arigato> do they need to insert special instructions to move registers around so that they end up where we need them, or do they instead general many variants of the INT_ADD and/or INT_MUL to pass/get various arguments in various positions?
<arigato> OK
<arigato> s/general/generate
<cfbolz> arigato: but yes, I don't quite get whether they have lots and lots of variants for "I have 8 other live things that I pass along"
<arigato> and is there many variants like "INT_ADD(a,b,c,d,e,f) which adds b to e"?
<cfbolz> indeed
<cfbolz> arigato: the paper says this: "We cannot naively enumerate all possible combinations of function prototypes for the different types of values that may be passed through, since the total number of combinations grows exponentially. The crucial observation is that each stencil only cares about its own inputs. The contents stored in the other registers do not matter, as long as they are not clobbered by the stencil. Therefore, for those
<cfbolz> registers, it is sufficient to always represent it by the longest type (uint64_t or double), and pass it from the argument to the continuation verbatim."
<arigato> somewhat doubtful, because otherwise they wouldn't end up with just 35kb of templates (in the simple case without too many superinstructions)
<cfbolz> ah no
<cfbolz> that's about types
<arigato> yes
<arigato> or maybe they do, and really have only a small number of live variables at most?
<cfbolz> "In our current implementation, we only use registers to store temporary values while evaluating expression trees." (but those can be deep, of course)
<Corbin> Hm. Could we compare copy-and-patch to compiling-to-closures for JIT codes?
<cfbolz> I don't know what compiling-to-closures is
<Corbin> Old Lisp technique. Each AST node or bytecode is turned into a call to a runtime function taking two arguments, an environment (the "closure") and the input value. It's amenable to CPS, just like in the copy-and-patch paper.
<krono> CPS, CEK, compile-to-colsures sound like all in the same ballpark…
reneeontheweb has joined #pypy
greedom has quit [Remote host closed the connection]
slav0nic has quit [Ping timeout: 250 seconds]
reneeontheweb has quit [Ping timeout: 252 seconds]