AlenSebastian[m] has quit [Remote host closed the connection]
lopex[m] has quit [Remote host closed the connection]
ahorek[m] has quit [Remote host closed the connection]
JasonLunn[m] has quit [Remote host closed the connection]
Davinci[m] has quit [Remote host closed the connection]
duc[m] has quit [Remote host closed the connection]
headius has quit [Read error: Connection reset by peer]
nilsding has quit [Remote host closed the connection]
enebo[m] has quit [Remote host closed the connection]
enebo[m] has joined #jruby
subbu has joined #jruby
subbu has quit [Ping timeout: 245 seconds]
nilsding has joined #jruby
duc[m] has joined #jruby
ahorek[m] has joined #jruby
Davinci[m] has joined #jruby
JasonLunn[m] has joined #jruby
lopex[m] has joined #jruby
headius has joined #jruby
razetime has joined #jruby
razetime has quit [Client Quit]
razetime has joined #jruby
razetime has quit [Client Quit]
kalenp[m] has joined #jruby
<kalenp[m]>
Good morning JRuby friends. Kalen from Looker here. We're in the process of upgrading from 9.2 to 9.3 and are investigating a potential memory issue. Looking to see if you might have some tips for investigation so we can confirm if it's an issue on our end or yours.
<kalenp[m]>
So first an error about running out of heap space, then the CodeHeap warnings. The second might just e a side effect of the first, but maybe somebody with more understanding of those systems can tell if they're not.
<kalenp[m]>
Are there any known issues or changes between 9.2 and 9.3 that would lead to increased memory usage?
<enebo[m]>
kalenp: there shouldn't be a difference but a lot of code has changed over time
<enebo[m]>
kalenp: I imagine 9.4 is too big a jump for your app? I only ask because there is a lot more effort on 9.4 in the last year or two
<enebo[m]>
A heap dump earlier and later may expose what is growing enough to get a clue
<kalenp[m]>
yeah, we're not quite ready to do the 9.4 jump. even 9.3 required updating quite a few gems and addressing bugs, so we're taking it stepwise
<enebo[m]>
yeah makes sense
<enebo[m]>
Since OOME is first I don't think we can really trust what is reported later
<enebo[m]>
9.2 -> 9.3 is largely just incremental changes but so many moving parts
<kalenp[m]>
this might be a general jvm question, but I added some logging for Total and Free memory during the run and don't see it approaching 4GB at any point. are there other rough metrics I could/should be watching before trying to take a full heap dump?
<enebo[m]>
when you say that do you mean it is saw horsing up and down on the 4Gb but collections knock it back down low or you mean it never gets close to 4Gb?
<enebo[m]>
JVM tends to be really lazy generally so I expect it to use a lot of that heap to do less collections
<kalenp[m]>
Total is never reported above 2GB. Threw in a GC to get more stable numbers, because it was sawtoothing at lot and so it was hard to see what was actually live.
<enebo[m]>
headius: you around?
<enebo[m]>
We have had some issues over time where we leak in a special way where it is non-heap but it is uncommon
<enebo[m]>
This would be a weird test but you could run with JIT disabled and see if you see the problem
<kalenp[m]>
Looker is good at hitting those sort of bugs historically :)
<enebo[m]>
non-heap memory issues would be us doing something wrong like re-making the same method over and over but not losing the reference to the old one. so you would see zillions of classloaders (and its non-heap counterpart growing)
<enebo[m]>
If -X-C (or --dev) do not cause any issues then something in JIT is getting tripped up but since interp is slower it will take more time probably to figure out if it is actually ok
<enebo[m]>
If you are using -Xcompile.invokedynamic that would be a case of more generated code. I do not expect that is buggy but just thinking outloud
<kalenp[m]>
already in slow repro land. tests take 30 minutes to start failing, plus build times in CI. so trying to get a few different ideas going in parallel
<enebo[m]>
hooking up with visualvm may show you something on one of the pages
<enebo[m]>
I am on a new laptop and have nothing helpful setup yet beyond my IDE :)
<enebo[m]>
I think some of the mbeans will show how many methods have been JITTed (in Java). JRuby has some beans showing how much we JIT in Ruby. It should show how many classloaders are there
<enebo[m]>
for JIT we use on classloader per JITted methods so it won't be a small number
<kalenp[m]>
ok, so some things to try: kick off a run with -X-C (is that an argument to jruby directly, or a java argument?). run it normal and attach visualvm to look for classloaders or other unexpected outliers being allocated
<enebo[m]>
we also have a very long tail for JITTing since we use method call count as the metric (I think 50 calls by default)
<enebo[m]>
-X-C is a call to JRuby not Java
<enebo[m]>
-Djruby.compile.mode=OFF is what it does
<kalenp[m]>
cool. I can go get those things started and see if I get some more data. I'll keep this open in case Headius jumps in with more ideas. thanks for the tips!
<enebo[m]>
interestingly we still compile methods at IR level using OFF but it does not generate java bytecode it just makes more complicated IR
<enebo[m]>
-Djruby.jit.threshold=-1 will disable doing that but I doubt it matters in this case
<enebo[m]>
kalenp: if you see it happen at a consistent enough point in time and it is a function of JIT then changing threshold to let's say 100 from default of 50 would cause the OOME to report later
<enebo[m]>
but we don't know it is JIT.
<enebo[m]>
The other angle would be native exceptions but then I think it would be untracked memory and you would not see OOME but you would hit some setrlimit process size thing
<enebo[m]>
or I think so anyways
<kalenp[m]>
oh, another thing which I noticed is that this is our coverage CI job, so it's running with --debug to get coverage data. for our non-coverage tests, we're not seeing OOME, but it's also broken into smaller slices. 9.2 worked even for coverage, but it's another piece of the matrix here
<headius>
hey hey
<headius>
non-heap leak you say>
<enebo[m]>
hmm coverage
<headius>
OOM is only heap so that's just a leak or it's using more memory than available heap
<enebo[m]>
I would say if it was coverage specifically then it would just be a heap problem but --debug does more than jsut enable coverage
<headius>
the CodeHeap stuff may be related but I would not expect that unless the OOM is a symptom of the host system not having enough memory
<enebo[m]>
so they do not see much looking while it runs but something ends up using a lot of heap
<headius>
yeah this OOM says heap space so I'd expect a normal sort of leak on the heap for that
<headius>
it would look different if it were failure to allocate more native memory or something
<enebo[m]>
I fixed a gnarly issue on 9.4 with coverage (which I could backport to 9.3). It is at least the third time I have tweaked around calls needing proper line number separate from profiling/coverage line numbers
<headius>
so I'd say a heap dump is the next step I'd take
<enebo[m]>
it is not this problem but a coincidence
<enebo[m]>
yeah if OOME can only be from heap issue then definitely
<enebo[m]>
non-heap is so rare I did not even realize that
<kalenp[m]>
looks like we're running with -J-XX:+HeapDumpOnOutOfMemoryError, but we're not actually saving those. working with our CI team to get those saved
<headius>
coverage has improved over time to track more/better data so I would not be surprised if it's using more memory, but I wouldn't expect it to be using 100s of MB more
<headius>
kalenp: ahh yeah nice
<headius>
you can also just use visualvm or jconsole to get a dump when it's obvious a process is on its way to OOM
<enebo[m]>
it will be the filename in every line as a separate char[]
<headius>
enebo: OOM is used for lots of things but the one kalenp posted explicitly has the heap error
<headius>
other types of OOM will say failure to allocate memory or thread or whatever
<headius>
or stack...I think you can see OOM for stack size exceeded
<enebo[m]>
I guess it literally says ran out of heap so lol
<enebo[m]>
I am hoping for calls at least I never have to deal with this again
<headius>
CodeHeap could be related if we are jitting too much stuff or hanging on to transient jitted methods that should go away
<enebo[m]>
this did make me ponder what IR would look like if we baked line into instr like we do for AST nodes
<enebo[m]>
It has a big advantage for interp by not having those instrs but a number of disadvantages too
<headius>
you mean not emitting LineNumber instrs>
<headius>
?
<enebo[m]>
I suppose though for JIT since all instrs are in-order that is reasonably simple
<enebo[m]>
yeah
<enebo[m]>
for IRbuilding it is complicated because operand building is not in order but addinstr is in order
<headius>
for JIT it would make little difference... I just have a "current line" value and whenever I see a Line Number I update that and emit the line number bytecode stuff if it changed
<enebo[m]>
so I ended up comparing lastLineNum or whatever field and whackiness
<enebo[m]>
It adds 4 bytes to each instr but then we lose line num
<enebo[m]>
and it is possible there are a number of instrs which do not need line as a field but that is complicated
<headius>
I'm going to review that launcher PR from mrnoname so we can merge it
<enebo[m]>
yeah I just want it in sooner than later
<headius>
oh so the other order of business for me this week is to wrap up the mavengem stuff
<headius>
I still need to get the "bundler API" endpoints working, which will require some exploration
<enebo[m]>
sounds like you are running all tests now?
<headius>
I think we should move this under org.jruby groupID since the old one is org.torquebox.mojo and TB is defunct now anyway
<enebo[m]>
yes
<enebo[m]>
all the things should be in the org at this point
<headius>
I am "running" all tests, but the ones related to bundler API or still dependent on the dependencies?gems=foo,bar multiple result API are failing
<headius>
other than that everything works up the stack... the failing features are not used by mavengem itself
<enebo[m]>
I sort of feel like unless we get enough community mass around something we should just put it in this org
<headius>
I could leave them in place in that library and disable tests in a pinch
<headius>
nobody will be using the new group so they would not see any breakage... but of course it will stop working for those APIs anyway on Aug 8
<enebo[m]>
ah yeah but as far as you know nothing is using the bundler api stuff?
<headius>
nothing I know of right now
<headius>
it might be used by the maven-tools gem based on this rubygems-tools Java library
<enebo[m]>
heh...I suppose you could wait for the shoe to drop or just make it work
<headius>
this stack is large... it makes me realize how much integration work was going on around TB at the time
<enebo[m]>
I have been wondering about how much magic is lost in tb after ben stopped working on it
<enebo[m]>
He was opting hard on the techempower benchmarks
<headius>
yeah it was good work
<headius>
enebo: waiting on one review question for launcher fixes but otherwise it seems fine
<headius>
I have mrnoname on chat now
<headius>
enebo: org.jruby or org.jruby.maven or what for groupID?
<headius>
TB isolated this stuff under org.torquebox.mojo
<enebo[m]>
ok
<headius>
Ok what
<enebo[m]>
ok
<enebo[m]>
lol when you said you had mrnoname on chat I thought he was coming on here then forgot about it
<enebo[m]>
but he wasn't so it didn't matter
<headius>
So org.jruby.maven maybe
<headius>
This shouldn't be a top level artifact in org.jruby