<NishanthMenon>
uggh. no clue what happened there.. trying again..
<NishanthMenon>
sjg1: I think we may forgotten to deploy func_test.py (following log from a container that did python -m pip install patch-manager) https://www.irccloud.com/pastebin/gO78fFtF/
Clamor has joined #u-boot
<xypron>
sjg1: Why does bind_drivers_pass() stop the boot process if a single driver's bind method indicates that the device is not available? For instance for drivers/rng/riscv_zkr_rng.c it would be preferable to detect the presence of the device in bind() instead of probe(). I think it is preferable to continue booting even if some device is not available.
<xypron>
sjg1: My problem was that I used ENODEV and not ENOENT. And this is documented doc/develop/driver-model/design.rst.
hanetzer has quit [Ping timeout: 260 seconds]
<xypron>
sjg1: I see a lot of drivers using ENODEV. Just run git grep -n ENODEV drivers/rng/.
hanetzer has joined #u-boot
hanetzer has quit [Ping timeout: 258 seconds]
hanetzer has joined #u-boot
hanetzer has quit [Ping timeout: 255 seconds]
hanetzer has joined #u-boot
hanetzer has quit [Ping timeout: 240 seconds]
camus has joined #u-boot
camus has quit [Read error: Connection reset by peer]
camus1 has joined #u-boot
camus1 is now known as camus
mmu_man has joined #u-boot
Clamor has quit [Ping timeout: 255 seconds]
Clamor has joined #u-boot
mmu_man has quit [Ping timeout: 240 seconds]
___nick___ has joined #u-boot
___nick___ has quit [Client Quit]
___nick___ has joined #u-boot
hanetzer has joined #u-boot
apritzel_ has joined #u-boot
mmu_man has joined #u-boot
dsimic has quit [Ping timeout: 245 seconds]
dsimic has joined #u-boot
ikarso has joined #u-boot
<rfs613>
marex: upon closer inspection, it turns out its not really shifted by 32 bytes. Rather, a few words happened to match at 32 byte offset.
<rfs613>
marex: I had thought to try with cache disabled as the next step, but then $KIDS got home...
<rfs613>
marex: the DRAM passes memtest and also the u-boot memory test, but of course the latter doesn't check the region where u-boot code lives.
<rfs613>
marex: the other odd thing though is the problem goes away when I stick in some printfs, or tweak the u-boot configuration slightly (turn on some unrelated feature)
<rfs613>
almost like it only happens when the branch instruction is at a certain position (or maybe a certain alignment)... and perhaps the jump target also a certain position/alignment?
* rfs613
was reviewing the A7 erratas to see if anything like that appears, but no luck there.
<rfs613>
there are certainly a few "interesting" errata, but none that seem relevant...
stipa has quit [Quit: WeeChat 3.0]
stipa has joined #u-boot
flom84 has joined #u-boot
slobodan has joined #u-boot
stipa has quit [Ping timeout: 258 seconds]
stipa has joined #u-boot
<marex>
rfs613: so the branch target where the CPU lands is wrong ?
<marex>
rfs613: could it be a compiler bug ?
<rfs613>
marex: prior to the branch being taken, the memory at the branch target is correct (according to JTAG dump), matches what objdump shows.
<rfs613>
but immediately after the branch, the same memory shows different contents (again according to JTAG)
<rfs613>
I didn't believe the JTAG at first, but if I single step, the registers are updating exactly according to the "wrong" instructions now in memory
<rfs613>
obviously it goes very wrong quite soon after that, and eventually triggers a data abort, prefetch abort, etc.
<rfs613>
marex: I've mostly ruled out a compiler bug, because the exact same binary works on another hardware variant - which as 1GB DDR rather than 256MB.
<rfs613>
the memory size however doesn't seem to be the differnce, because I have modified the hardcoded the memory size at 256M, and so the 1GB board runs with the same layout as the 256M board. And yet it still fails on the 256M board, while the 1G board boots fine (exact same binaries on each).
<marex>
rfs613: ah
<marex>
rfs613: so, DRAM controller configuration might be weird somehow for the 1 GiB chip ?
<marex>
rfs613: which controller is this, synopsys/designware or cadence ?
<rfs613>
marex: it's working on the 1GB DRAM, when it is configured for either 1GB or 256M (not just size, but all the other DDR controller parameters)
<rfs613>
there are two sets of tuning parameters (for 256M and 1G respectively) under include/renesas/
<rfs613>
I removed the run-time DDR detection, so use the fixed 256M params on both boards, this boots without problem.
<rfs613>
on the 1GB board. But hangs on the 256M board.
<rfs613>
I think turning off caches will be my next test, maybe during $KIDS naptime this afternoon ;-)
joeskb7 has quit [Quit: Lost terminal]
<marex>
rfs613: um, doesn't this obviously point at DRAM timing issues ?
joeskb7 has joined #u-boot
<marex>
rfs613: if 1 GiB board works, 256 MiB board fails, diff the DRAM timings of both boards and crosscheck them with DRAM datasheet ?
<rfs613>
marex: there are actually not many differences between the two DRAM parameter sets... i had diffed them before, when I was thinking to store only the diffs, instead of two copies.
<rfs613>
but the 256M params have been in use for years, and we never had any issues
<marex>
rfs613: but now they do not ?
<rfs613>
the 1G is more recent. Running with the 256M config on 1G board is not really a correct test, but it does happen to work
<marex>
rfs613: what changed ?
<marex>
rfs613: the upstream rework ?
<rfs613>
marex: u-boot config got minor tweak - we turned on CONFIG_CMD_I2C - but that specific change isn't really the cause. Turning on some other features also triggers it. So it's more likey a size/alignment thing.
<marex>
rfs613: are you able to simulate this with plain assembler ?
<rfs613>
there was no change to the toolchain (this was my first check)
<marex>
rfs613: maybe create a page of nops with asm volatile, and add single jump at the beginning somewhere into the middle ... errr, like this, sec
<rfs613>
marex: i've not yet reduced it to a simple assembly case, mostly because I don't yet understand exactly what conditions trigger it.
<rfs613>
eg. does it happen at a particular memory offset or alignemnt? do the previous instructions matter or not?
<marex>
rfs613: the first jump will jump across cache line boundary, and land somewhere in the next page ... either it will land on 1: , which you can see with JTAG debugger, or it will be executing onward and hang on 2: which you will also see with JTAG debugger
<marex>
just plug it into the code somewhere wrapped in asm volatile(" ... ");
<marex>
obviously you will need to recover the machine after this, the 2: b 2 and 1: b 1 will spin in a loop
<rfs613>
yup, i see what you mean. I will try some experiments like that too. But I think disabling cache is a good test too, if only to rule out that as a factor.
<marex>
maybe it is better to use 1: b 1b and 2: b 2b on line 5 and 9 respecitvely
<marex>
rfs613: maybe start with a simple reproducer first and then start making larger changes like disabling cache ?
<marex>
rfs613: up to you really
<rfs613>
right now the only configuration that fails is full u-boot, loaded via a fist-stage boot loader, with OP-TEE in the mix as well.
<rfs613>
i've tried to reduce that, but bootign straight into u-boot doesn't fail (but it also doesn't do SMC calls)
<rfs613>
of course the u-boot config in that case is slightly different, code layout/offsets not the same, etc.
<marex>
rfs613: can you reproduce it in SPL then ?
<rfs613>
all to say that reducing this, while maintaining the failure, has been a challenge
<rfs613>
the first stage bootloader in thsi case is not SPL, it's a schneider custom job
<marex>
rfs613: can you first-stage log raw binary then ? :)
<marex>
rfs613: one which does very little, just the nops and jumps ?
<rfs613>
yeah I could do that... if I can get it to fail...
<marex>
rfs613: or can you reproduce it in optee-os ?
<rfs613>
i tried in fact to eliminate optee-os as a factor, boot from the first-stage direct to u-boot... that works, but i have to modify u-boot to not make SMC calls, otherwise it will of course hang (since no OPTEE to handle those calls)
stipa has quit [Quit: WeeChat 3.0]
<marex>
rfs613: so maybe optee changes something ? Surely it does configure the MMU and possibly some memory protection things, and drops EL
<rfs613>
i can also boot directly into u-boot (BootROM -> SPKG -> u-boot) and that works fine too... but again u-boot isn't configured quite the same way in that case (no SMC calls, different hyp mode settings, etc)
<rfs613>
marex: i modified op-tee to do fewer things, in particular to *not* enable trustzone, etc. This did not change the behaviour.
<rfs613>
obviously this doesn't eliminate op-tee... since so far the failure only happens when booting with optee in the path...
<mps>
just noticed that I wrote incorrectly 'riscv64 S-mode targests: opensbi' in doc/build/gcc.rst (targests instead of targets) in commit id d1c758ed3163531627b663911af1506e60d643a5
prabhakarlad has quit [Quit: Client closed]
<mps>
should I create patch for this just one letter change?
<rfs613>
marex: but i think padding with nop's, as well as disabling cache, are some good experiments to try. Will hopefully get a chance this afternoon during son's naptime. For now he's demanding my attention, so thanks for suggestions, will let you know what I find...
flom84 has quit [Quit: Leaving]
<xypron>
mps: yes we need a patch to change doc/
<mps>
xypron: ok, will post it
<marex>
rfs613: SPKG ... sounds familiar
<marex>
rfs613: sorry, was afk
<marex>
rfs613: I wonder, can't you try to reproduce the error just before jumping to U-Boot, in optee already ?
<marex>
xypron: have you ever tested the wget command in u-boot ?
<marex>
Tartarus: I was wondering, do we have some basic file editor in uboot ? Or at least 'cat' ?
<Tartarus>
No? I think the closest we have is hexdump but I don't recall if that gives ascii in some columns or if it's just a little different than md
<marex>
Tartarus: so another todo entry
<sjg1>
NishanthMenon: I will hold off doing a new pip release for now...needs a little testing
mmu_man has quit [Ping timeout: 240 seconds]
Clamor has quit [Ping timeout: 260 seconds]
Clamor has joined #u-boot
mmu_man has joined #u-boot
Stat_headcrabed has joined #u-boot
vagrantc has joined #u-boot
umbramalison has quit [Ping timeout: 255 seconds]
umbramalison_alt has joined #u-boot
sigmaris has quit [Ping timeout: 255 seconds]
sigmaris has joined #u-boot
apritzel_ has quit [Ping timeout: 255 seconds]
umbramalison_alt has quit [Ping timeout: 255 seconds]
umbramalison has joined #u-boot
Stat_headcrabed has quit [Quit: Stat_headcrabed]
Stat_headcrabed has joined #u-boot
<rfs613>
marex: i guess I could have optee load a dummy payload (mostly nop's, with a few branch instructions), instead of u-boot.
<rfs613>
marex: but there seems to be more to it than just a branch. The problem occurs quite a long way into u-boot, during env_init() after relocation.
<rfs613>
many many many "bl" instructions have been executed by that point, and they've all worked normally.
apritzel has joined #u-boot
<marex>
rfs613: well do you know the exact point where it goes weird ?
vagrantc has quit [Quit: leaving]
<rfs613>
marex: yes, it is consistenly happening in the same spot.. except when I add/remove code, or change the uboot configuration, etc.
camus has quit [Remote host closed the connection]
camus has joined #u-boot
<marex>
rfs613: can you add a jump just before that location, to some other buffer ?
<marex>
rfs613: if so, then you can use that other buffer for whatever experiments
<marex>
rfs613: or, even, just add hw breakpoint at that location and load your experimental payload via JTAG and then continue execution on it
<rfs613>
marex: i'm currently setting a hw breakpoint shortly before the probematic jump. And then I use "x /10i jump_target" to disassemble at the jump target. It shows the prolgue of a function, storing registers on stack, etc. All is normal.
Stat_headcrabed has quit [Quit: Stat_headcrabed]
<rfs613>
then I single step ("si") right to the jump insrtruction... disassemble again, all still well.
<rfs613>
step once more, so now the PC changes to the jump (bl instruction)
<rfs613>
dissassemle again, the contents have changed
<marex>
rfs613: uh, how is the JTAG debugger accessing the memory ?
<marex>
rfs613: is this openocd or something better ?
<rfs613>
marex: it's a Segger J-Link
<marex>
rfs613: openocd had I think two modes, one directly via DAP (?) and another by running some helper on the CPU
<rfs613>
marex: operating in SWD mode
<marex>
they had different views of the memory, the former didnt suffer cache issues because it was hanging right off the AXI bus, but the CPU caches had to be synced on each change
<marex>
the later, had cache issues iirc
<marex>
JLink with OpenOCD or their tool?
<rfs613>
marex: their tools, JLinksomething and then gdb.
<rfs613>
marex: I totally did not believe the JTAG result at first.
<rfs613>
marex: but after the jump, I looked at the "wrong" code it was showing me. THen single stepped once more instruction. And the registers updated exactly as per the "wrong" code.
<rfs613>
marex: eg. it should have been a "stmdb !sp, {...}" as the first instruction in the subroutine being called
<rfs613>
but in fact it was a mov instruction. I could see the dest register of the mov being updated.
<marex>
rfs613: where exactly does this error trigger, somewhere in board_f.c or board_r.c ?
<Forty-Bot>
marex: 2 modes depends on the processor; generally you see what the processor sees
<rfs613>
marex: it is happening in board_r, below the env_init() initcall. Specifically as there is no env, it calls env_set_default, which parses the built-in environment.
<rfs613>
marex: parsing means builing a hash table of all the env variables, using himport(), which calls hsearch().
<rfs613>
marex: and within hsearch(), you will see two calls to env_callback_init() followed by env_flags_init().
<rfs613>
marex: it is on that call to env_flags_init() where things go wonky.
<marex>
Forty-Bot: no, not really
<marex>
Forty-Bot: as far as I recall, there are two modes, one which has direct access to the bus, the other which operates on the CPU core, right ?
<marex>
at least on arm that is
<Forty-Bot>
there's a dap bus, but that generally doesn't include the regular memory map
<Forty-Bot>
on some targets there can be a separate debug port with an uncached view of memory
* rfs613
points out that prior to getting the JTAG cable, I was debugging this with printf. And i was finding the same thing, that env_callback_init() completes, but env_flags_init() never returns.
<marex>
rfs613: so could you perform the nop test before that point ?
* Forty-Bot
hates jtag T_T
<rfs613>
whe I put extra prints inside those functions, the problem would vanish
<rfs613>
(esp. in env_attr_lookup and env_attr_walk, which is called by both of those functions)
<marex>
Forty-Bot: such strong words ...
<rfs613>
marex: I did try putting some nop's at the beginning of env_flags_init(). But being in C, they get inserted after the function prologue. So the first statement is still "stmdb" of course.
<rfs613>
marex: I also tried adding nops the function that comes right before (according to objdump).
<Forty-Bot>
I have been really enjoying working on sandbox recently since I can use native debug tools
<marex>
rfs613: __attribute((naked))
<rfs613>
marex: and with that, it seems the memory is still changing values, but it doesn't happen exactly on the branch instruction, but rather somehwere slightly earlier... i have to single step more carefully to find out whhere.
<rfs613>
marex: ah yes, I suppoise I could use that
<marex>
sec, I currently hate SDIO
<marex>
and microscopic buffers with barely readable labels
<rfs613>
Forty-Bot: I did try to reproduce this on sandbox, by copying my board's default env to the sanbox config. But of course it parses just fine, no strange hang...
<Forty-Bot>
a lot of times if something corrupts .text, it will not take effect for a bit (due to caching)
<marex>
rfs613: if you can do hardware breakpoint already, then you can even start another code in memory which you upload there right after the HW breakpoint triggers, to perform whatever on-cpu test you might need to perform
<Forty-Bot>
so you may want to check code which runs before the hang
<Forty-Bot>
and definitely try to use hardware breakpoints
<Forty-Bot>
software breakpoints are generally useless
<marex>
Forty-Bot: we really should set up some MMU protection at some point
<Forty-Bot>
yeah
<Forty-Bot>
it would have saved me a few days of trouble a month or two ago
<rfs613>
think my son is about to wake from his nap, so I will probably pick this up on Monday...
<rfs613>
marex, Forty-Bot: I think trying this with caches disabled might still be worthwhile, just to see if it changes thing?
<Forty-Bot>
it can't hurt
<rfs613>
of course it means changing configuration, which means the executable changes, functions move to different addresses, which may avoid the issue (as has been the case for the past ~4 years on the same hardware)
apritzel has quit [Ping timeout: 245 seconds]
<marex>
Forty-Bot: too much on todo already, sigh
<rfs613>
marex: easy to fix that. Just rename linux->u-boot. And then rename kubernetes->linux. All problems solved, and we are all employed forever fixing random failures like the one I am looking at ;-)
goliath has joined #u-boot
<marex>
heh
slobodan has quit [Read error: Connection reset by peer]
slobodan has joined #u-boot
mmu_man has quit [Read error: Connection reset by peer]
mmu_man has joined #u-boot
goliath has quit [Quit: SIGSEGV]
apritzel has joined #u-boot
joeskb7 has quit [Quit: leaving]
goliath has joined #u-boot
pgreco_ has joined #u-boot
pgreco has quit [Ping timeout: 240 seconds]
___nick___ has quit [Ping timeout: 255 seconds]
<jeeebz>
Hello, is it a feature ? :p I got a pinebook pro, I reuse the script for multiple arm devices. I see that for one device, there is an export BL31. First, it wasnt exported, and the build was okay. Then I uncomment this export and the build failed. interresting, it does not find the dtb; even if it was build and is present. I'm retrying few things (and I need to tricks in order to get full log)
Clamor has quit [Read error: Connection reset by peer]
<marex>
Forty-Bot: and yes, it is because of the cells, but in this specific case, I think hacking the DTO to add the cells would grow it massively and the end result would be unhelpful (I tried multiple ways of doing that)