<olleolleolle[m]> I had some luck this morning: https://github.com/jruby/warbler/pull/533 It was possible to get to green builds on jruby-head with Warbler - by picking the right RubyGems.
<olleolleolle[m]> So, warbler is now building green - and I hope to get a dependabot PR merged, too. The game's afoot!
IlyaBylich[m] has quit [Quit: You have been kicked for being idle]
<headius> olleolleolle: oh that's great!
<headius> I have been meaning to get in there as well but 9.4 issues have taken priority
genpaku has quit [Remote host closed the connection]
genpaku has joined #jruby
<headius> @enebo so, a bit of a surprise on the recent Ruby parser library changes
<headius> The biggest change recently was splitting off string parsing into a subparser, which is now small enough for us to compile to bytecode if we bump up the IR limit
<headius> I'm running a benchmark locally and it seems like it does get compiled by C2 eventually, or much more quickly if I turn off tiered compilation
<headius> That seems like a good justification to bump our IR limit up
<enebo[m]> so how much to the bench numbers change?
<headius> I'm grabbing latest released parser now
<headius> which doesn't have these changes
<enebo[m]> I think the only reason we have the limit was something was not JITTing larger but who knows how much we have changed what we emit and whether we were ever close to having a method fail
<enebo[m]> Since that value happened I think some indy stuff had become default so that would maybe give us more space
<headius> if it fails because it's too big for JVM it should error in ASM and just be rejected
<headius> but being small enough for JVM obviously doesn't guarantee it will native JIT ever
<enebo[m]> yeah
<enebo[m]> But we don't want it to bother if it is obviously too big so I guess perhaps that is the motivation
<headius> right now limit is at 1k IR instructions... the new string lexer advance is 2998 and seems to native JIT fine if I allow it to bytecode JIT
<headius> not sure how high we'd want to go but 3000 would be a good number, although right on the edge
<headius> 5000 would allow more things to compile but may get too big for JVM JIT
<headius> hard to know
<headius> oh parser 3.2 was released recently, so I'll go back to the november release to compare
<headius> he must have decided to push out some of the improvements already
<enebo[m]> I believe there are a bunch of tags which can get untagged
<headius> released parser gets down to about 0.291s per iteration of this bench
<headius> some time after that best result, it deoptimized to 0.38s and never came back... might have if I'd let it keep going
<headius> the tail is very long on these peaks though
<headius> parser master with jit.maxsize=1000 gets to about the same time per iteration... and also deoptimizes partway through
<headius> I'm letting it go for a bit longer
<headius> doesn't seem like it's coming back
<enebo[m]> so something is compiling but the overall speed didn't change?
<headius> well a lot more is compiling since many methods have also been outlined, but this is without any of the advance methods compiling
<headius> I'm trying a run with jit.maxsize=5000 now to allow at least the string lexer advance to JIT
<enebo[m]> but the before/after numbers are still more or less the same or is it faster?
<headius> about the same
<enebo[m]> ok
<headius> I ran a profile earlier and of course it's dominated by our interpreter
<headius> stuff is still compiling this run but it is already faster, 0.28s range
<enebo[m]> yeah the non-compiled parts calling newly outlined compiled parts may potentially be same or small loss in many cases since those methods may not have been doing too much work to begin with
<headius> into 0.27s range occasionally
<enebo[m]> jit verbose doing a before/after comparison might help with tweaking sizes
<enebo[m]> If for no other reason to see if increased size causes more failures
<headius> yeah before after shows just the two advance methods being too large
<headius> the one is 2998 and the other is 9k something
<enebo[m]> what over 9000
<headius> so they have reduced a lot from these changes and he has more planned (pulling kwarg parsing out)
<enebo[m]> pattern matching would also be worth pulling if possible
<headius> yeah so 0.280 seems like stable peak with jit.maxsize=5000
<enebo[m]> since it is like a replicated part of 20% of the parser
<headius> ah yeah could be
<headius> I am using JVM +PrintCompilation to get an idea when it has finished
<headius> last run I will try for now is with JVM node limits bumped up, I think more could inline then and I saw 0.25-0.26 ranges on a run earlier
<enebo[m]> I would be surprised if warmup time is a big issue for this gem
<enebo[m]> like cold perf probably is most likely use right?
<headius> it takes much longer to get there but then it suddenly jumps from like 1s per iteration to 0.25
<enebo[m]> I am sure there is a server somewhere continually reparsing but I would expect more short runs
<headius> still chewing, 1.3s
<headius> chew chew chew
<enebo[m]> LSP or something like that could run a long time
<headius> I wouldn't have seen this speed up if I didn't just set it aside and let it run, it takes a long time
<enebo[m]> It is good to see it improving nonetheless
<headius> boom, there it goes, probably around some C2 limit
<headius> it's hard to quantify the gain from JIT because the main advance method still bogs everything down so much
<headius> without everything compiling it just skews the numbers
<enebo[m]> yeah
<headius> hey graal jit is faster at something big finally
<headius> down as low as 0.240 after some time
<headius> still improving, 0.237
<enebo[m]> probably runs our interp better
<headius> took forever to get there but graal EE is down to 0.232
<headius> still stuff compiling and recompiling... gonna let this run while I get some tea
<headius> well after a very long run it's as low as 0.225
<headius> not sure how valuable that is but it's a number
<headius> CRuby is around 0.152
<headius> I wonder if any amount of tweaking would let the other advance method compile
<headius> "instruction count 9692 exceeds threshold of 3000
<enebo[m]> HEH we could look at making megahelpers in IRRuntimeHelpers so minimal asm instrs are emitted
<enebo[m]> not sure how much we can do that
<headius> I kinda do that for almost everything right now
<enebo[m]> It usually is just pushing a few values and calling one method but not everywhere
<headius> the bytecode is almost always load, load, load, indy, load load load, indy
<headius> but I haven't audited the bytecode JIT in a while to see if anything could be smaller
<headius> I am just eager to see what this could do if it all JITs
<enebo[m]> yeah almost none of this is anything but setup and a single call
<enebo[m]> some more complexity with some kwargs stuff but very little
<enebo[m]> I suppose some temp writes and then reads could be removed as a peep.
<enebo[m]> I thought I had added some of that somewhere but maybe that never landed
<headius> yeah it could help some
<enebo[m]> ah yeah canOmitLoadStore but very very limited
<enebo[m]> looks like perhaps only for eqq
<headius> hmm this is odd
<headius> AsString is compiled as doing a call plus calling IRubyObject.asString
<enebo[m]> rearranging the stack would end up making instrs so doing a better job could end up with more instrs
<headius> interpreter only loads receiver and calls asString
<enebo[m]> heh
<headius> AsString Instr
<enebo[m]> interesting
<enebo[m]> it does the call if it might be refined
<headius> yeah only then
<enebo[m]> headius: the call looks like a merge mistake?
<enebo[m]> invokeOther was from a big merge from jruby-9.2?
<headius> hmm
<enebo[m]> I find these interesting because clearly there is missing coverage but interp get lion share of testing
<headius> this could be done better with a smart indy site that skips the call when it's already a string
<headius> unless refined
<headius> which is basically what the interpreter is trying to do
<enebo[m]> Am I misreading this?
<enebo[m]> JIT is basically calling asString but then doing the concrete one a secnod time
<enebo[m]> So isn't this just filling up the stack?
<enebo[m]> oh...heh it is just doing it twice
<enebo[m]> since the second direct call is calling it on the result of the first dyn one
<headius> yes it is
<headius> it does the call to cache the to_s like a real call, and then the asString because it wants to make sure it is a RubyString
<headius> I'm not sure that is exactly right
<headius> right as in correct logic
<enebo[m]> yeah
<enebo[m]> and if it needs to do something like this then perhaps a helper would be better
<enebo[m]> (obviously indy works too)
<enebo[m]> The helper is just convenient so both jit and interp can call the same thing
<headius> it's tricky to make the helper do a cached call to to_s
<enebo[m]> ah yeah
<enebo[m]> well point stands but that is an issue
<enebo[m]> my other question about asString is where does thatn ormally occur "foo".freeze?
<enebo[m]> How often does this show up in a hot path
<headius> yeah I'm not sure
<headius> that commit says it is used for interpolation but I'm not sure it is anymore
<headius> I think we have a new instr for interpolating strings
<enebo[m]> oh well I do see it for dynamicPiece in IRBuilder
<enebo[m]> but only if EvStrNodes
<enebo[m]> So "foo #{bar} har"
<enebo[m]> result of bar will asString
<headius> ok
<enebo[m]> So this could be fairly hot in the right app
<headius> I see build_compound_string instr
<enebo[m]> I believe there is an improvement that could be made here
<enebo[m]> The heuristic returns an estimatedSize of 4 or the last encountered StrNode
<enebo[m]> I believe this should be adding all StrNodes together
<enebo[m]> in my example above foo<sp> and <sp>bar would be 8 but it will return 4 from <sp>bar (last one wins)
<headius> how do you get an EvStr
<headius> your example above is not using AsString
<headius> oh it only uses AsString when refined now maybe?
<enebo[m]> jruby -S ast -e '"foo #{bar} gar"'
<headius> } else if (pieceNode instanceof EvStrNode) {
<headius> if (scope.maybeUsingRefinements()) {
<enebo[m]> %v_3 = build_compound_string(frozen:"foo ", %v_2, frozen:" gar")
<enebo[m]> hmm
<enebo[m]> interesting
<headius> otherwise it uses BuildDynamicString
<enebo[m]> yeah I did not look at what is calling this particular method
<enebo[m]> weird I see this for DStrNode which contains an EvStrNode
<headius> I got it to trigger with "using" nearby
<headius> so it is not used other than refined calls right now
<headius> maybe unnecessarily
<enebo[m]> I am still confused how this works now...buildDStr will individually call dynamicPiece on each part
<enebo[m]> One part should be an EvStr
<enebo[m]> oh I see...and you said it too
<enebo[m]> it only happens for potential refinements
<enebo[m]> Otherwise it unwraps the EvStr and just returns what is within it (e.g. Call)
<headius> yeah and then BuildDynamicString does the dynamic bits for all elements
<headius> I will look at that too
<enebo[m]> My comment earlier is wrong about estimatedSize too since this is getting called repeatedly
<enebo[m]> The signature makes it look like it is processing the whole thing
<enebo[m]> but that is to allow return of estimate AND to change out piece
<headius> ALOAD 0
<headius> ALOAD 11
<headius> INVOKEDYNAMIC asString
<headius> ALOAD 2
<enebo[m]> I blame Java on not giving us a nicer mechanism for mutliple returns
<headius> originally
<headius> so it saves the interface call and will do nothing at all if the object is already string
<headius> I might as well commit this to a branch even though it only affects refined code
<enebo[m]> headius: it is also called in any call of .frees or -@
<enebo[m]> frees :)
<headius> bleh, refinements don't use indy yet
<headius> I guess I can just put the optimization in place and it will work once we do that
<headius> getting indy support for refined sites would probably be a big step forward... there seems to be more and more refinement use out there
<enebo[m]> yeah
<headius> block_given? calls could be optimized to just look at passed-in block for normal method scopes
<headius> that would stop them deoptimizing
<headius> yeah I'm down to compound array and string now and they could be shrunk down a lot with indy
<headius> assuming it's ok to build all the inserts and then do all the concatenation at the same time
<headius> given N inputs, it loops over each doing either "cat" or "appendAsDynamicString" which does an uncached "to_s" via asString
<headius> that would be a single indy instruction receiving all N inputs, or N indy instructions doing it all cached but keeping order of operations
<headius> heh well here's a big one
<headius> every heap variable access
<headius> I can get that down to just the one indy
<headius> well, and the two loads
<headius> so that's another invokevirtual bytecode removed for every access
<headius> hmm only affects blocks but it's something
<headius> method bodies do not need to check null and return nil because of liveness checking
<enebo[m]> dark matter
<headius> it will eliminate the depth and offset pushes for blocks too
<headius> ok that's another improvement
<lavamind[m]> happy to share news that, barring any major issues, jruby 9.3.9.0 will be part of the next Debian stable release
<lavamind[m]> after being absent from the previous release
<headius> lavamind: that is excellent news!
<headius> enebo: for code accessing the 11th local variable, 3 levels down...
<headius> before:
<headius> oops
<headius> bug
<enebo[m]> The windows irb thing is fixed by and explainable by renaming Win32API to win32api. classpath is case sensitive and whatever I was doing the other day must be wrong because I can see it can load the mixed case file in an installed location. 9.3 and earlier was not an issue because reline was not used which was not directly calling this
<enebo[m]> well that was not what I expected to see :)
<headius> ok
<headius> we need to get off that file but the reline code uses fiddle and I'm not sure it's up to it
<enebo[m]> it is fixed enough to not worry about that for a point release
<headius> yeah
<headius> ok so I think I have it working
<headius> before:
<headius> after:
<headius> for the simplest case, it will be ALOAD 6, INVOKEDYNAMIC rather than ALOAD 6, INVOKEVIRTUAL
<headius> perhaps could drop the indy in that depth 0, location < 10, no nil case
<enebo[m]> I wish we had good tooling on how big things are from what hotspot wants
<headius> the tooling is out there but I don't know how to use all o fit
<enebo[m]> but even asking I never got a clear answer on what the actual limits refer too
<enebo[m]> to
<enebo[m]> It is not neccessarily incoming bytecode but trivially dead things sound like they are included which implies it is literally bytecode
<headius> there's a lot of hand-waving for sure
<enebo[m]> peephole does not matter until it does
<enebo[m]> unless you ask someone else and then perhaps the answer is different
<enebo[m]> I think the sands of time have complicated things where it is complicated :)
<enebo[m]> I have also wondered if anyone has a stack analysis tool which could rewrite the instrs given
<headius> we can reduce callInfo bytecode by adding a method for each combination of flags but unsure if it is worth it
<headius> since we'll eliminate this eventually
<headius> right now it is [load context, load flags int, setCallInfo]
<enebo[m]> well it probably won't need to be bytecode but it will live somewhere (likely in callsite/indy equiv)
<headius> it could be just [load context, setCallInfo for these flags]
<enebo[m]> I thought about pushing it through a helper on one side and adding it to indy on the other but it is a lot of work for something which likely will change
<enebo[m]> but the notion of flags will exist
<headius> oh I guess I also talked about rolling this into the call
<enebo[m]> you need to know if the callee and caller agree on what was passed and what it accepts
<headius> which would just setCallInfo for now but eventually the flags just go along the call path
<headius> so in that case indy would eliminate the CallInfo bytecode altogether
<headius> yeah we discussed this and the other sticking point was rolling CallInfo instr into CallBase as an operand
<headius> or just a field
<headius> so I can use it when I do the call
<enebo[m]> I have a branch where I am trying to eliminate the bifurcation so that zero kwargs is not important
<enebo[m]> It passes both suites now but it errors out with **nil
<enebo[m]> so that has to be reimagined a bit
<headius> hmm
<enebo[m]> I am trying to make the state completely static
<enebo[m]> CALL_KEYWORD_EMPTY is dynamic
<enebo[m]> so by always being capable of processing it we can eliminate it at the cost of being able to ignore it (or dup it) means callInfo is just a few static flags which can be used by call infra
<enebo[m]> IRBuilder does mark empty when it is static.
<enebo[m]> I am "fixing" this: https://github.com/jruby/jruby/issues/7484
<enebo[m]> lopex: pointed out the code we commented out which created the slowdown and I am trying to figure out if anything is broken any more
<enebo[m]> There was one test added in the revert of the code which slowed things down and that appears to work now so?
<headius> aha
<headius> yeah ship it
<headius> I'll get back to random
<enebo[m]> running test:mri right now locally
<headius> sidetracked with bytecode size but it's clear we can shrink a lot of stuff down
<enebo[m]> 0.005s to 40s is probably pretty disappointing :)
<headius> yeah so final verdict on compound string... if order of operations is not important, or there's only one interpolated value, then it could be reduced down to a single indy call
<headius> "foo #{bar}" would be an indy call that embeds the "foo " string contents and passes in bar and gets a string back
<enebo[m]> but order would in general be important no?
<headius> so basically load context, load bar, indy
<headius> the order is important for only one reason I can think of: if an element returns a value that can't be appended
<enebo[m]> I guess one interp in a string no
<headius> I shouldn't say order in general
<headius> I'd still evaluate the elements in order, but all at once before concatenating
<enebo[m]> foo #{bar} #{gar}
<enebo[m]> oh I see...you mean not build up intermediate string in order
<headius> if bar.to_s returned something bogus it should error before getting to gar
<headius> or if bar.to_s just errored
<enebo[m]> yeah and you could type check and still do this
<enebo[m]> so the bogus part may not be a problem
<headius> but if it evaluates gar before bar errors that is a difference
<headius> is it a difference that matters?
<enebo[m]> definitely
<enebo[m]> if they both modify object state
<headius> JVM folks got bug reports because their indy string interpolation does it this way
<headius> evaluate all elements and then slam them together
<headius> yeah unclean methods would be a problem
<headius> still this would be a good optimization for single element interpolation at least
<enebo[m]> ruby is dirty
<headius> and for multiple elements, it would cache to_s calls better for each
<enebo[m]> those methods are covered in dirt
<headius> so it would be target string, push value, indy for to_s + concat, ...
<headius> for each interpolated element
<enebo[m]> yeah...that is how it works in MRI if you watch ObjectSpace
<headius> right now it is push target string, push value, uncached to_s, concat
<enebo[m]> crazy number of Ruby Objects made from interpolation
<enebo[m]> we are just using Java Objects
<headius> we could also profile this to pick a better initial string size
<headius> right now it starts at zero size so it will immediately realloc based on first element of compound string
<headius> if it could say "last time this string was 50 wide, I'll do that again" it could save a bunch of intermediate allocs and copies
<enebo[m]> the buildPiece logic uses 4 doesn't it?
<enebo[m]> Or is that not used
<headius> not sure in interp
<enebo[m]> 4 or size of all static pieces
<headius> JIT just pushes an "empty string" on the stack to start
<enebo[m]> oh well that info I think is in the instr
<enebo[m]> It is being calculated
<headius> ah yeah
<headius> estimatedSize
<headius> never got into JIT
<enebo[m]> That likely would be a decent improvement right there
<headius> yeah
<enebo[m]> Although 4 seems meager but I suppose that is virtually never ever used anyways
<headius> I'll do that on this branch
<enebo[m]> 4 per piece or the size of the piece when it is known
<enebo[m]> interesting. we build from 0
<headius> build from 0/
<headius> ?
<enebo[m]> 0 length buffer
<headius> ah
<headius> well I updated JIT to use the estimated size * 1.5 for string interp
<enebo[m]> I imagine the more segments the better this plays out
<headius> yeah could be
<enebo[m]> not too hard to imagine but I do want to see the microbench :)
<headius> profiled would be cooler
<enebo[m]> oh yeah profiled could also remove waste memory
<enebo[m]> headius: anything about updated jcodings which would affect 9.3?
<headius> Hmm
<headius> Nothing I can think of
<headius> dregexp could be shoved into a single indy instr too