mgorny has quit [Quit: No Ping reply in 60 seconds.]
mgorny has joined #pypy
lritter has joined #pypy
[Arfrever] has quit [Ping timeout: 256 seconds]
[Arfrever] has joined #pypy
dmalcolm has quit [Ping timeout: 264 seconds]
Dejan has joined #pypy
johnny_nick has joined #pypy
johnny_nick has quit [Client Quit]
dmalcolm has joined #pypy
otisolsen70 has joined #pypy
jcea has joined #pypy
dmalcolm has quit [Ping timeout: 240 seconds]
otisolsen70 has quit [Ping timeout: 256 seconds]
dmalcolm has joined #pypy
itamarst has joined #pypy
lritter has quit [Quit: Leaving]
<itamarst>
hello! I was curious about how the JIT interacted with benchmarking
<cfbolz>
itamarst: hey!
<cfbolz>
what exactly do you mean?
<itamarst>
e.g. in Rust benchmarking frameworks there's typically a blackbox() function you feed variables into, so that the compiler doesn't optimize the code into just a constant
<itamarst>
"oh hey you're doing repetitive math with constant arguments" kinda thing
<cfbolz>
itamarst: does that mean you're mainly talking about microbenchmarks?
<cfbolz>
for microbenchmarks this is definitely a problem with pypy, yes. if you measure a loop invariant computation in a big loop, the computation will be moved out of the loop and the times are just for an empty loop
<itamarst>
yeah microbenchmarks I guess
<itamarst>
is there an equivalent blackbox function?
<cfbolz>
yesish, but using it introduces other timing distortions
<cfbolz>
because then you need a non-inlinable function call, which is severely expensive
<cfbolz>
you can achieve it with the help of `pypyjit.residual_call(callable, *args, **kwargs)`
<cfbolz>
what I do instead is to make sure that the computation I'm trying to measure is not constant-foldable (eg depends on the loop counter somehow)
<cfbolz>
then I that the JIT didn't manage to cheat by looking at the JIT compiler IR (but that is not really something I can recommend in general, of course)
<cfbolz>
itamarst: do you have a concrete use case in mind right now? then we could try it
<itamarst>
at the moment it's benchmarks for twisted networking framework
<itamarst>
so it's not really arithmetic-heavy code, lots of method calls
<itamarst>
(sorry, in and out dealing wtih plumber)
<itamarst>
I'm doing benchmark integration with codspeed.io, which uses Cachegrind to get CPU instruction counts
<itamarst>
so by default they only run benchmarked code once
<itamarst>
and that's fine for CPython, but with PyPy that means you won't get a JITed version?
<itamarst>
so I was thinking of adding a warmup phase of running the code in a loop
<itamarst>
but then there's the worry about the computation being moved out
<itamarst>
and... sounds like residual_call() is exactly what I want for that?
<itamarst>
since the function is specially a _benchmark_, not library code
<itamarst>
I guess I should also investigate if this is a problem with pytest-benchmark, and file an issue there if so
<cfbolz>
itamarst: that sounds more like a real program than super small synthetic code
<cfbolz>
so it's a bit unlikely that the JIT will be able to const-fold it completely
<itamarst>
well
<cfbolz>
residual_call is a bit subtle. it only prevents passing information across that call boundary
<itamarst>
I want to get this as a feature into pytest-codspeed and I guess pytest-benchmark
<itamarst>
and they're used for wide variety of code
<itamarst>
ah but _that_ call could still be optimized into a constant
<cfbolz>
I don't think there can be a one-size-fits-all solution, I fear
<itamarst>
stupid compilers being too smart
<cfbolz>
yeah, sorry ;-)
* cfbolz
hard at work making it smarter
<cfbolz>
itamarst: do you have a link to the Rust mechanism?
<cfbolz>
it's definitely good to think about it, the issue will only become more relevant for cpython in the future
<itamarst>
yeah I'm probably going to write an article about it
<itamarst>
thank you, this is very helpful
<cfbolz>
so yes, you can express something like the rust black box directly
<cfbolz>
sec
<cfbolz>
itamarst: I'm happy to read a draft of the post, if that would be in any way helpful
<LarstiQ>
looking at that rust blackbox I thought another trick might be passing in values from the commandline so they're not known constant, but that doesn't really work for a tracing jit
<cfbolz>
LarstiQ: no, that does work
<LarstiQ>
cfbolz: ah, pypy doesn't at some point specialize based on the data seen?
<cfbolz>
not in most cases
<LarstiQ>
well, maybe I did contribute something useful then ;)
<cfbolz>
I'm currently trying to mimic the example from the rust docs, but can't get pypy to cheat yet ;-)
<itamarst>
so one thought is... if the benchmark framework calls the function-being-benchmarked with a different argument each time (0, 1, 2, ...)
<itamarst>
and uses residual_call
<itamarst>
and the benchmark author can use that to ensure it's not a const calculation
<itamarst>
and then benchmark authors don't need to use residual_call() or CPython equivalent?
<cfbolz>
itamarst: yes, I think that is actually enough
<cfbolz>
you must use the different arguments
<cfbolz>
and that's a much better guard against distortion than anything else
<cfbolz>
cpython doesn't need anything like blackhole atm
<cfbolz>
itamarst: then the benchmarking framework could use the result for something (xor the hash into a accumulator or something like that). that would prevent the jit from thinking the result is unneeded
<itamarst>
wouldn't residual_call() suffice?
<cfbolz>
itamarst: the problem is that residual_call has a huge overhead
<itamarst>
I'm not sure that's a problem? if the benchmark framework is smart enough to remove that overhead
<cfbolz>
I'm not sure that's possible
<itamarst>
ah
<cfbolz>
basically my philosophical point is: you fundamentally cannot really automate any of this in a robust way that will prevent people from making benchmarking errors
<itamarst>
:(
<cfbolz>
(and I'm including myself :-P)
<cfbolz>
itamarst: sorry for being pessimistic :-(
<itamarst>
but maybe you can cover 90% of cases
<itamarst>
and I guess in some cases this isn't an issue at all
<itamarst>
so let's say... 1% badness vs 0.1% badness
<itamarst>
I can imagine arguments that you actually want benchmarks to be broken 100% of the time
<itamarst>
if you do them wrong
<itamarst>
so it's easier to notice?
<itamarst>
or I guess that's maybe another question
<cfbolz>
heh, no, I am definitely not against benchmark harm reduction
<itamarst>
maybe you can't _prevent_ it with automation, can you _detect_ it with automation?
<cfbolz>
itamarst: basically a positive way to make my point would be: communicating the pitfalls is necessary and ultimately more important than purely technical solutions
* itamarst
nods
<itamarst>
harm reduction (or detection?) is probably important too though, given (a) people don't ofetn read docs and (b) LLMs are making this worse
<itamarst>
but I guess if it ends up in pytest-benchmark docs that'll help people use the correct template
<itamarst>
thank you for all the feedback, this is very helpful
<itamarst>
I will report back as I come up with suggestions for the relevant frameworks (docs / code changes)