itamarst has quit [Quit: Connection closed for inactivity]
auk has quit [Quit: Leaving]
[Arfrever] has quit [Ping timeout: 265 seconds]
[Arfrever] has joined #pypy
<arigato>
cfbolz: the prospero challenge sounds cool :-)
<cfbolz>
arigato: it's really good, I'm having lots of fun
<cfbolz>
arigato: you should look into the 3d part
<cfbolz>
it sounds like it might be right up your alley
<arigato>
what do you exactly measure as the "runtime"? does it include parsing the text file?
<cfbolz>
no, I exclude that
<cfbolz>
arigato: you mean in the "parsing" sentence? maybe that is simply too confusing
<arigato>
uh, sorry, I don't understand your question
<arigato>
I was just asking when you start and stop the timer. Apparently you start it after you have turned the input text file into some bytecode format
<cfbolz>
yep
<arigato>
but it might not make a big difference in your approach
<cfbolz>
originally I had an O(ops**2) parser in C ;-)
<arigato>
OK :-) I meant mostly, as opposed to "turning the input text file into thousand lines of CUDA code and compiling that with GPU drivers"
<cfbolz>
arigato: yeah, right
<arigato>
which, OK, is very fast after you start the clock, but takes a long time if you count from opening the input text file
jcea has joined #pypy
<cfbolz>
arigato: some people did cudo, see the "challenge" web site
<cfbolz>
cuda
<arigato>
yes, I saw
<cfbolz>
ok :-)
<cfbolz>
but yes, I personally find the "add parsing to then runtime" approach kind of more fun
<arigato>
I'm thinking about solutions where the clock starts after you open the input file---but still GPU-based solutions
<arigato>
maybe a simple interpreter that runs on the GPU instead of the CPU would be good already
<cfbolz>
arigato: I think the original author did cuda implementations of some of the interval approaches too
<arigato>
something that sounds a little strange to me: in the optimization around min/max signs, if you take the same program but add "m' = m + epsilon" at the end, I expect the final image to be almost identical, but suddenly the min/max optimization fails completely
<cfbolz>
arigato: Yeah, I'm super unsure about that part
<arigato>
I guess that's just not something that occurs in practice from the way the input text file is generated
<cfbolz>
Yes, the text file is generated from some kind of Acyclic graph
<cfbolz>
With deduplication and everything
<arigato>
OK
<cfbolz>
arigato: I think to solve the 'max + epsilon' question you could introduce an explicit 'threshold' operation
<cfbolz>
Which is probably a much better idea anyway
<arigato>
right, and then there is "max * 0.5" too, etc.
<cfbolz>
Yes
<cfbolz>
arigato: one of the submitted solutions (in C++) does a neat trick: it computes the output intervals for all operations for all four quadrants together. That way you can use SIMD for the interval Computation too
<arigato>
nice :-)
<arigato>
the quad tree solution is hard to run on a GPU though
<cfbolz>
I don't have a good mental model for GPU perf at all, I fear
<arigato>
ah no, the paper you pointed me too says explicitly that it runs on the GPU
<cfbolz>
but it also seems to have been hard to make it fit, maybe?
<arigato>
but they managed, and it seems to be one of the main points of the paper
<arigato>
I've become a GPU guy since working on VR :-)
<cfbolz>
yes, I realize :-)
<arigato>
...too bad you can't have self-generating code on GPUs...
<cfbolz>
arigato: yes, I keep wondering that. but the instruction set is simply secret, right?
<cfbolz>
it's not even about self-generating, it sounds like it's even generally hard to produce "machine code"?
<arigato>
kinda, but more to the point is that it changes all the time
<cfbolz>
ok :-(
<arigato>
it's an "internal implementation detail" but it means that you need to go via CPU drivers to produce more machine code
<cfbolz>
right
<cfbolz>
arigato: in the fidget repo there are some 3d models that don't "end" with tons of min/max operations
<cfbolz>
for them, the whole interval approach kind of falls down
<arigato>
...right, I see it ends with a division by 11? but that's still an "easy" case
<cfbolz>
ok
<cfbolz>
but it also simply was way fewer percentage min/max
<arigato>
right, indeed tracing back the variables used in the last operations, there aren't much min/max
<arigato>
but that doesn't mean the interval approach fails, right?
<arigato>
or rather:
<cfbolz>
you can maybe generalize it, but the interval approach as done right now only optimizes min/max
<arigato>
the part about simplifying the expression in the sub-quads fails, but the interval analysis that drops whole quadrants works
<cfbolz>
yes
<cfbolz>
I wonder whether bear instead does something like exp(...) + exp(...) where you could then conclude that one of the factors doesn't matter if their argument value becomes too negative
<cfbolz>
could do something like that, I mean
<arigato>
in the paper you pointed out, they compare three approaches: a GPU interpreter, sending the precompiled code to the GPU, and their interval approach with an interpreter. But you could also do the interval approach with precompiled code, if you're only interested in dropping quadrants and not trying to simplify the expression
<cfbolz>
right
<cfbolz>
the interpreter is just a special case of intervals anyway ;-)
<arigato>
yes :-)
<cfbolz>
so yes, bear has lots of sums of exp functions, fwiw
<arigato>
I'm pretty sure you could do the same as Keeter's paper with GPU code that needs to be precompiled (once):
<arigato>
by analysing the bytecode you know which parts could be removed in sub-quads
<arigato>
and then you emit GPU code that contains branches
<arigato>
there are restrictions on efficient branching in GPU code, but I think that this would be a very good case
<arigato>
the restrictions are just the same as with the approach of using an interpreter on the GPU, actually: the interpreter branches all the time, but neighboring computations follow the same path
<arigato>
according to their measure of 19x for interpreter overhead, this would result in a 19x speedup on their approach
<arigato>
(...with the major disadvantage of the CPU driver needing to compile code at the beginning, of course)
Dejan has joined #pypy
jcea has quit [Ping timeout: 268 seconds]
<cfbolz>
arigato: interesting
<cfbolz>
so basically you would add a bunch of bools that are kind of always the same value on pixels that are close to each other?
<arigato>
yes
<arigato>
like in the paper, you compute these bools on 64x64 and then 8x8 tiles, and then in each 8x8 tiles they have the same value
derpydoo has joined #pypy
<arigato>
ouch, converting the input file into shader instructions for Unity works great except that compilation takes ages (~1 minute)
<arigato>
runs at about 1ms per image on my laptop (which has a good GPU), for 1110x1110
<arigato>
with no tiling at all, just computing the full expression at every pixel
<cfbolz>
arigato: yeah, gpu's can just brute force the problem it seems :-)
<cfbolz>
arigato: but note that the compiler or the gpu must have done something smart about min/max: 1110*1110*7000 floating point operations / 1ms = >8000 TFLOPS
<arigato>
unless I mess up my computation, I get 8 TFLOPS
<arigato>
and my GPU has a theoretical maximum of 15.62
<arigato>
yes, 1 TFLOP is really a crazy number---it's 1000 computations for each pixels of a 1000x1000 image, 1000 times per second
<cfbolz>
ah yes, T=1e12, so it checks out
<cfbolz>
thought 1e9 for some reason
<arigato>
OK, the particular example of the Prospero text is rather extreme: it is actually a min of 665 different values, each of which can be computed in typically 10-20 instructions
<arigato>
even duplicating SSA operations that are used in more than one of these 665 values, we grow the total not more than ~50%
<arigato>
so trying to be clever about not recomputing them is likely not worth it
<cfbolz>
arigato: yeah, it's too easy to micro-optimize this one particular input
<arigato>
I'm thinking about computing a (short) list of indices for each tile, and then it's just an "interpreter" with 665 opcodes that computes just the values in that list and returns the minimum of them (or even jumps out early if it finds any negative number, or something)