cfbolz changed the topic of #pypy to: #pypy PyPy, the flexible snake https://pypy.org | IRC logs: https://quodlibet.duckdns.org/irc/pypy/latest.log.html#irc-end and https://libera.irclog.whitequark.org/pypy | the pypy angle is to shrug and copy the implementation of CPython as closely as possible, and staying out of design decisions
jcea has quit [Ping timeout: 268 seconds]
mgorny has quit [Quit: No Ping reply in 60 seconds.]
mgorny has joined #pypy
lritter has joined #pypy
[Arfrever] has quit [Ping timeout: 256 seconds]
[Arfrever] has joined #pypy
dmalcolm has quit [Ping timeout: 264 seconds]
Dejan has joined #pypy
johnny_nick has joined #pypy
johnny_nick has quit [Client Quit]
dmalcolm has joined #pypy
otisolsen70 has joined #pypy
jcea has joined #pypy
dmalcolm has quit [Ping timeout: 240 seconds]
otisolsen70 has quit [Ping timeout: 256 seconds]
dmalcolm has joined #pypy
itamarst has joined #pypy
lritter has quit [Quit: Leaving]
<itamarst> hello! I was curious about how the JIT interacted with benchmarking
<cfbolz> itamarst: hey!
<cfbolz> what exactly do you mean?
<itamarst> e.g. in Rust benchmarking frameworks there's typically a blackbox() function you feed variables into, so that the compiler doesn't optimize the code into just a constant
<itamarst> "oh hey you're doing repetitive math with constant arguments" kinda thing
<cfbolz> itamarst: does that mean you're mainly talking about microbenchmarks?
<cfbolz> for microbenchmarks this is definitely a problem with pypy, yes. if you measure a loop invariant computation in a big loop, the computation will be moved out of the loop and the times are just for an empty loop
<itamarst> yeah microbenchmarks I guess
<itamarst> is there an equivalent blackbox function?
<cfbolz> yesish, but using it introduces other timing distortions
<cfbolz> because then you need a non-inlinable function call, which is severely expensive
<cfbolz> you can achieve it with the help of `pypyjit.residual_call(callable, *args, **kwargs)`
<cfbolz> what I do instead is to make sure that the computation I'm trying to measure is not constant-foldable (eg depends on the loop counter somehow)
<cfbolz> then I that the JIT didn't manage to cheat by looking at the JIT compiler IR (but that is not really something I can recommend in general, of course)
<cfbolz> itamarst: do you have a concrete use case in mind right now? then we could try it
<itamarst> at the moment it's benchmarks for twisted networking framework
<itamarst> so it's not really arithmetic-heavy code, lots of method calls
<itamarst> (sorry, in and out dealing wtih plumber)
<itamarst> I'm doing benchmark integration with codspeed.io, which uses Cachegrind to get CPU instruction counts
<itamarst> so by default they only run benchmarked code once
<itamarst> and that's fine for CPython, but with PyPy that means you won't get a JITed version?
<itamarst> so I was thinking of adding a warmup phase of running the code in a loop
<itamarst> but then there's the worry about the computation being moved out
<itamarst> and... sounds like residual_call() is exactly what I want for that?
<itamarst> since the function is specially a _benchmark_, not library code
<itamarst> I guess I should also investigate if this is a problem with pytest-benchmark, and file an issue there if so
<cfbolz> itamarst: that sounds more like a real program than super small synthetic code
<cfbolz> so it's a bit unlikely that the JIT will be able to const-fold it completely
<itamarst> well
<cfbolz> residual_call is a bit subtle. it only prevents passing information across that call boundary
<itamarst> I want to get this as a feature into pytest-codspeed and I guess pytest-benchmark
<itamarst> and they're used for wide variety of code
<itamarst> ah but _that_ call could still be optimized into a constant
<cfbolz> I don't think there can be a one-size-fits-all solution, I fear
<itamarst> stupid compilers being too smart
<cfbolz> yeah, sorry ;-)
* cfbolz hard at work making it smarter
<cfbolz> itamarst: do you have a link to the Rust mechanism?
<cfbolz> it's definitely good to think about it, the issue will only become more relevant for cpython in the future
<itamarst> yeah I'm probably going to write an article about it
<itamarst> thank you, this is very helpful
<cfbolz> so yes, you can express something like the rust black box directly
<cfbolz> sec
<cfbolz> itamarst: I'm happy to read a draft of the post, if that would be in any way helpful
<LarstiQ> looking at that rust blackbox I thought another trick might be passing in values from the commandline so they're not known constant, but that doesn't really work for a tracing jit
<cfbolz> LarstiQ: no, that does work
<LarstiQ> cfbolz: ah, pypy doesn't at some point specialize based on the data seen?
<cfbolz> not in most cases
<LarstiQ> well, maybe I did contribute something useful then ;)
<cfbolz> I'm currently trying to mimic the example from the rust docs, but can't get pypy to cheat yet ;-)
<itamarst> so one thought is... if the benchmark framework calls the function-being-benchmarked with a different argument each time (0, 1, 2, ...)
<itamarst> and uses residual_call
<itamarst> and the benchmark author can use that to ensure it's not a const calculation
<itamarst> and then benchmark authors don't need to use residual_call() or CPython equivalent?
<cfbolz> itamarst: yes, I think that is actually enough
<cfbolz> you must use the different arguments
<cfbolz> and that's a much better guard against distortion than anything else
<cfbolz> cpython doesn't need anything like blackhole atm
<cfbolz> itamarst: then the benchmarking framework could use the result for something (xor the hash into a accumulator or something like that). that would prevent the jit from thinking the result is unneeded
<itamarst> wouldn't residual_call() suffice?
<cfbolz> itamarst: the problem is that residual_call has a huge overhead
<itamarst> mmm
<itamarst> I'm not sure that's a problem? if the benchmark framework is smart enough to remove that overhead
<cfbolz> I'm not sure that's possible
<itamarst> ah
<cfbolz> basically my philosophical point is: you fundamentally cannot really automate any of this in a robust way that will prevent people from making benchmarking errors
<itamarst> :(
<cfbolz> (and I'm including myself :-P)
<cfbolz> itamarst: sorry for being pessimistic :-(
<itamarst> but maybe you can cover 90% of cases
<itamarst> and I guess in some cases this isn't an issue at all
<itamarst> so let's say... 1% badness vs 0.1% badness
<itamarst> I can imagine arguments that you actually want benchmarks to be broken 100% of the time
<itamarst> if you do them wrong
<itamarst> so it's easier to notice?
<itamarst> or I guess that's maybe another question
<cfbolz> heh, no, I am definitely not against benchmark harm reduction
<itamarst> maybe you can't _prevent_ it with automation, can you _detect_ it with automation?
<cfbolz> itamarst: basically a positive way to make my point would be: communicating the pitfalls is necessary and ultimately more important than purely technical solutions
* itamarst nods
<itamarst> harm reduction (or detection?) is probably important too though, given (a) people don't ofetn read docs and (b) LLMs are making this worse
<itamarst> but I guess if it ends up in pytest-benchmark docs that'll help people use the correct template
<itamarst> thank you for all the feedback, this is very helpful
<itamarst> I will report back as I come up with suggestions for the relevant frameworks (docs / code changes)