#pypy on 2025-04-09 — irc logs at libera.irclog.whitequark.org

2022-11-09 10:48 cfbolz changed the topic of #pypy to: #pypy PyPy, the flexible snake https://pypy.org | IRC logs: https://quodlibet.duckdns.org/irc/pypy/latest.log.html#irc-end and https://libera.irclog.whitequark.org/pypy | the pypy angle is to shrug and copy the implementation of CPython as closely as possible, and staying out of design decisions

01:56 jcea has quit [Ping timeout: 268 seconds]

03:01 [Arfrever] has quit [Ping timeout: 246 seconds]

03:03 [Arfrever] has joined #pypy

05:01 itamarst has quit [Quit: Connection closed for inactivity]

08:25 auk has quit [Quit: Leaving]

09:51 [Arfrever] has quit [Ping timeout: 265 seconds]

10:03 [Arfrever] has joined #pypy

12:01 <arigato> cfbolz: the prospero challenge sounds cool :-)

12:02 <cfbolz> arigato: it's really good, I'm having lots of fun

12:02 <cfbolz> arigato: you should look into the 3d part

12:02 <cfbolz> it sounds like it might be right up your alley

12:02 <arigato> what do you exactly measure as the "runtime"? does it include parsing the text file?

12:03 <cfbolz> no, I exclude that

12:04 <cfbolz> arigato: you mean in the "parsing" sentence? maybe that is simply too confusing

12:04 <arigato> uh, sorry, I don't understand your question

12:06 <arigato> I was just asking when you start and stop the timer. Apparently you start it after you have turned the input text file into some bytecode format

12:06 <cfbolz> yep

12:06 <arigato> but it might not make a big difference in your approach

12:07 <cfbolz> originally I had an O(ops**2) parser in C ;-)

12:07 <arigato> OK :-) I meant mostly, as opposed to "turning the input text file into thousand lines of CUDA code and compiling that with GPU drivers"

12:08 <cfbolz> arigato: yeah, right

12:08 <arigato> which, OK, is very fast after you start the clock, but takes a long time if you count from opening the input text file

12:08 jcea has joined #pypy

12:08 <cfbolz> arigato: some people did cudo, see the "challenge" web site

12:08 <cfbolz> cuda

12:09 <arigato> yes, I saw

12:09 <cfbolz> ok :-)

12:09 <cfbolz> but yes, I personally find the "add parsing to then runtime" approach kind of more fun

12:09 <arigato> I'm thinking about solutions where the clock starts after you open the input file---but still GPU-based solutions

12:10 <arigato> maybe a simple interpreter that runs on the GPU instead of the CPU would be good already

12:10 <cfbolz> arigato: I think the original author did cuda implementations of some of the interval approaches too

12:10 <cfbolz> https://www.mattkeeter.com/research/mpr/keeter_mpr20.pdf

12:12 <arigato> thanks, reading

12:16 <arigato> something that sounds a little strange to me: in the optimization around min/max signs, if you take the same program but add "m' = m + epsilon" at the end, I expect the final image to be almost identical, but suddenly the min/max optimization fails completely

12:18 <cfbolz> arigato: Yeah, I'm super unsure about that part

12:18 <arigato> I guess that's just not something that occurs in practice from the way the input text file is generated

12:18 <cfbolz> Yes, the text file is generated from some kind of Acyclic graph

12:18 <cfbolz> With deduplication and everything

12:19 <arigato> OK

12:19 <cfbolz> arigato: I think to solve the 'max + epsilon' question you could introduce an explicit 'threshold' operation

12:20 <cfbolz> Which is probably a much better idea anyway

12:20 <arigato> right, and then there is "max * 0.5" too, etc.

12:21 <cfbolz> Yes

12:23 <cfbolz> arigato: one of the submitted solutions (in C++) does a neat trick: it computes the output intervals for all operations for all four quadrants together. That way you can use SIMD for the interval Computation too

12:28 <arigato> nice :-)

12:28 <arigato> the quad tree solution is hard to run on a GPU though

12:30 <cfbolz> I don't have a good mental model for GPU perf at all, I fear

12:30 <arigato> ah no, the paper you pointed me too says explicitly that it runs on the GPU

12:31 <cfbolz> but it also seems to have been hard to make it fit, maybe?

12:31 <arigato> but they managed, and it seems to be one of the main points of the paper

12:32 <arigato> I've become a GPU guy since working on VR :-)

12:34 <cfbolz> yes, I realize :-)

12:36 <arigato> ...too bad you can't have self-generating code on GPUs...

12:36 <cfbolz> arigato: yes, I keep wondering that. but the instruction set is simply secret, right?

12:36 <cfbolz> it's not even about self-generating, it sounds like it's even generally hard to produce "machine code"?

12:36 <arigato> kinda, but more to the point is that it changes all the time

12:37 <cfbolz> ok :-(

12:38 <arigato> it's an "internal implementation detail" but it means that you need to go via CPU drivers to produce more machine code

12:40 <cfbolz> right

12:48 <cfbolz> arigato: in the fidget repo there are some 3d models that don't "end" with tons of min/max operations

12:48 <cfbolz> for them, the whole interval approach kind of falls down

12:49 <cfbolz> https://github.com/mkeeter/fidget/blob/main/models/bear.vm#L662

12:53 <arigato> ...right, I see it ends with a division by 11? but that's still an "easy" case

12:53 <cfbolz> ok

12:54 <cfbolz> but it also simply was way fewer percentage min/max

12:54 <arigato> right, indeed tracing back the variables used in the last operations, there aren't much min/max

12:55 <arigato> but that doesn't mean the interval approach fails, right?

12:55 <arigato> or rather:

12:56 <cfbolz> you can maybe generalize it, but the interval approach as done right now only optimizes min/max

12:56 <arigato> the part about simplifying the expression in the sub-quads fails, but the interval analysis that drops whole quadrants works

12:56 <cfbolz> yes

12:57 <cfbolz> I wonder whether bear instead does something like exp(...) + exp(...) where you could then conclude that one of the factors doesn't matter if their argument value becomes too negative

12:57 <cfbolz> could do something like that, I mean

12:58 <arigato> in the paper you pointed out, they compare three approaches: a GPU interpreter, sending the precompiled code to the GPU, and their interval approach with an interpreter. But you could also do the interval approach with precompiled code, if you're only interested in dropping quadrants and not trying to simplify the expression

12:58 <cfbolz> right

12:58 <cfbolz> the interpreter is just a special case of intervals anyway ;-)

12:59 <arigato> yes :-)

13:00 <cfbolz> so yes, bear has lots of sums of exp functions, fwiw

13:19 <arigato> I'm pretty sure you could do the same as Keeter's paper with GPU code that needs to be precompiled (once):

13:19 <arigato> by analysing the bytecode you know which parts could be removed in sub-quads

13:20 <arigato> and then you emit GPU code that contains branches

13:20 <arigato> there are restrictions on efficient branching in GPU code, but I think that this would be a very good case

13:22 <arigato> the restrictions are just the same as with the approach of using an interpreter on the GPU, actually: the interpreter branches all the time, but neighboring computations follow the same path

13:23 <arigato> according to their measure of 19x for interpreter overhead, this would result in a 19x speedup on their approach

13:26 <arigato> (...with the major disadvantage of the CPU driver needing to compile code at the beginning, of course)

13:42 Dejan has joined #pypy

13:54 jcea has quit [Ping timeout: 268 seconds]

14:20 <cfbolz> arigato: interesting

14:21 <cfbolz> so basically you would add a bunch of bools that are kind of always the same value on pixels that are close to each other?

15:08 <arigato> yes

15:09 <arigato> like in the paper, you compute these bools on 64x64 and then 8x8 tiles, and then in each 8x8 tiles they have the same value

15:10 derpydoo has joined #pypy

15:14 <arigato> ouch, converting the input file into shader instructions for Unity works great except that compilation takes ages (~1 minute)

15:16 <arigato> runs at about 1ms per image on my laptop (which has a good GPU), for 1110x1110

15:17 <arigato> with no tiling at all, just computing the full expression at every pixel

15:23 <cfbolz> arigato: yeah, gpu's can just brute force the problem it seems :-)

15:26 <cfbolz> arigato: but note that the compiler or the gpu must have done something smart about min/max: 1110*1110*7000 floating point operations / 1ms = >8000 TFLOPS

15:52 <arigato> unless I mess up my computation, I get 8 TFLOPS

15:53 <arigato> and my GPU has a theoretical maximum of 15.62

15:55 <arigato> yes, 1 TFLOP is really a crazy number---it's 1000 computations for each pixels of a 1000x1000 image, 1000 times per second

16:13 <cfbolz> ah yes, T=1e12, so it checks out

16:13 <cfbolz> thought 1e9 for some reason

16:29 <arigato> OK, the particular example of the Prospero text is rather extreme: it is actually a min of 665 different values, each of which can be computed in typically 10-20 instructions

16:30 <arigato> even duplicating SSA operations that are used in more than one of these 665 values, we grow the total not more than ~50%

16:31 <arigato> so trying to be clever about not recomputing them is likely not worth it

16:32 <cfbolz> arigato: yeah, it's too easy to micro-optimize this one particular input

16:33 <arigato> I'm thinking about computing a (short) list of indices for each tile, and then it's just an "interpreter" with 665 opcodes that computes just the values in that list and returns the minimum of them (or even jumps out early if it finds any negative number, or something)

16:43 <cfbolz> Yeah makes sense

17:18 derpydoo has quit [Quit: derpydoo]

20:02 Dejan has quit [Quit: Leaving]

21:05 jcea has joined #pypy

21:50 itamarst has joined #pypy

22:18 <Hodgestar> cfbolz & arigato: While we are on cool parallel computation stuff, have you encountered Vine / Ivy [https://vine.dev/docs/ivy-ivm/ivm] and Bend [https://higherorderco.com/]?

22:19 <Hodgestar> And https://en.wikipedia.org/wiki/Interaction_nets.

22:20 auk has joined #pypy