andrewsmedina has quit [Ping timeout: 255 seconds]
andrewsmedina has joined #pypy
<arigo_>
cfbolz: let me just say, aaaaaaaa what?
<arigo_>
this bug seems easy to reproduce and was there since the introduction of incminimark, but wasn't discovered until now?
<arigo_>
(easy once you know it of course)
<cfbolz>
arigo_: yes :-(
<cfbolz>
arigo_: in practice you need to be extremely unlucky though, things need to happen in *just* the wrong order
<arigo_>
and writing references into prebuilt objects is kinda rare
<cfbolz>
yes
<cfbolz>
it's a hypothesis of mine, but maybe it started happening more because I added some more caching mechanisms into pycode (of which there are quite a few prebuilt ones)
<cfbolz>
arigo_: it's definitely an example where a model checker would be cool :-)
<cfbolz>
or maybe i should write a hypothesis stateful test
<arigo_>
I thought we already had some random tests
<cfbolz>
for some things like the minimarkpages, yes
<arigo_>
maybe the GC is already correct except for the assert
<arigo_>
but it would require proof
<cfbolz>
arigo_: yes, that's entirely plausible, except for the fact that I find crashes that are exactly freeing objects that are still reachable (in the same programs that fail with the assert earlier)
<cfbolz>
but I didn't manage to reproduce the crash in a small test
<cfbolz>
so I'm not entirely satisfied with the current situation yet
<cfbolz>
and so far it looks like the crashes go away with my changes too. but I would still like to find a test :-(
<arigo_>
the line that says "black -> white-but-prebuilt-so-dont-care", did you add that just now?
<arigo_>
ah I see, the flag NO_HEAP_PTRS is removed, so you get a "black -> white-but-prebuit-but-no-longer-NO_HEAP_PTRS"
<arigo_>
what did you change?
<cfbolz>
arigo_: what I changed is that if NO_HEAP_PTRS is removed, I also add the object to the to-be-marked list (objects_to_trace) if we're in the MARKING phase
<cfbolz>
arigo_: yes, I'm pretty sure it's not "wrong" conceptually
<arigo_>
maybe a different fix would be to record the length of self.prebuilt_root_objects at the start of a collection, and at the end, if it grew, trace the extra objects
<cfbolz>
what I'm doing now is to write a random test that would have found the same assertion error
<cfbolz>
arigo_: because right now it's tracing them all?
<arigo_>
ah, good question. maybe or maybe not, but what I said would still fix it, no?
<cfbolz>
yeah
<arigo_>
at worst it's looking a second time unnecessarily into some prebuilt objects, but at most once per prebuilt object for the lifetime of the process
<cfbolz>
yeah. I think this is essentially what my fix does too, at the cost of adding it to objects_to_trace (instead of saving the index)
<arigo_>
but you need to fix some _debug_objects_to_trace_dict1/2 too in DEBUG mode, no?
<cfbolz>
no, why?
<cfbolz>
_debug_objects_to_trace_dict1 is the dict version of objects_to_trace
<arigo_>
I'm just confused because you're adding stuff to objects_to_trace but not _debug_objects_to_trace_dict1
<cfbolz>
the debug mode is quite inefficient, it just recreates these dicts at every collection
<cfbolz>
it does this: self._debug_objects_to_trace_dict1 = self.objects_to_trace.stack2dict()
<arigo_>
ah, and the write barrier is outside collections, of course
<cfbolz>
yes
<cfbolz>
I also had to swap back in a lot 😅
<arigo_>
"XXX do we need to add the dest_addr to objects_to_trace too?" I would say yes
<cfbolz>
yeah, I think so too
<arigo_>
it's a really rare case, and it's safe to do so anyway
<cfbolz>
I wonder who copies things into prebuilt arrays ;-)
<cfbolz>
ok, a random test finds the assert immediately, good
<cfbolz>
if I undo the change
<cfbolz>
but so far I cannot find a crash if I turn debugging off
jcea has joined #pypy
<arigo_>
cfbolz: actually... I think that prebuilt_root_objects is going to be completely walked anyway at the end of marking?
<arigo_>
a bit confused
<arigo_>
doesn't this mean that there was never any problem in the GC? apart from the debug-mode asserts triggering when they shouldn't?
<arigo_>
I think it should have no effect, apart from adding the object to _debug_objects_to_trace_dict1 earlier, which silences the assert
luckydonald has quit [Quit: Uh oh, bouncer going down! (Survived for 2y 15w 1h 30m 42s, this time.)]
luckydonald has joined #pypy
<cfbolz>
arigo_: there is definitely a problem in the GC, because the nanobind tests crash with asserts turned off
<cfbolz>
so even if it "just" fixes the asserts, that sounds useful to me, because then I can try to re-run nanobind and get an *actual* assertion error
<cfbolz>
I agree with your argument, fwiw
luckydonald has quit [Quit: Uh oh, bouncer going down! (Survived for 23m 4s, this time.)]
luckydonald has joined #pypy
<cfbolz>
(the crash is unfortunately somewhat rare, so it will take a while to repeatedly run it and see whether it reoccurs)
<cfbolz>
arigo_: or would you instead simply change the assert to also count the objects in prebuilt_root_objects as grey?