Tartarus changed the topic of #u-boot to: SOURCE MOVED TO https://source.denx.de/u-boot/u-boot.git / U-Boot v2022.01, v2022.04-rc2 are OUT / Merge Window is CLOSED / Release v2022.04 is scheduled for 4 April 2022 / http://www.denx.de/wiki/U-Boot / Channel archives at https://libera.irclog.whitequark.org/u-boot
cJ has joined #u-boot
GNUtoo has quit [Write error: Connection reset by peer]
prabhakarlad has quit [Quit: Ping timeout (120 seconds)]
swiftgeek has joined #u-boot
GNUtoo has joined #u-boot
prabhakarlad has joined #u-boot
camus has joined #u-boot
swiftgeek has quit [Remote host closed the connection]
swiftgeek has joined #u-boot
flyback has quit [Quit: Leaving]
swiftgeek has quit [Remote host closed the connection]
swiftgeek has joined #u-boot
flyback has joined #u-boot
thopiekar_ has joined #u-boot
thopiekar is now known as Guest9315
thopiekar_ is now known as thopiekar
Guest9315 has quit [Killed (osmium.libera.chat (Nickname regained by services))]
flyback has quit [Client Quit]
flyback has joined #u-boot
<Tartarus> sjg1: does the event test have some special requirements? it runs in CI but fails on bill-the-cat.
jclsn78 has joined #u-boot
jclsn7 has quit [Ping timeout: 252 seconds]
urja has quit [Ping timeout: 250 seconds]
urja has joined #u-boot
mmu_man has quit [Ping timeout: 252 seconds]
camus has quit [Read error: Connection reset by peer]
camus has joined #u-boot
vagrantc has quit [Quit: leaving]
mwalle has quit [Quit: WeeChat 3.0]
macromorgan_ has joined #u-boot
macromorgan is now known as Guest8877
Guest8877 has quit [Killed (silver.libera.chat (Nickname regained by services))]
macromorgan_ is now known as macromorgan
michalkotyla has joined #u-boot
sbach has quit [Read error: Connection reset by peer]
sbach has joined #u-boot
swiftgeek has quit [Ping timeout: 272 seconds]
stefanro has quit [Quit: Leaving.]
stefanro has joined #u-boot
stefanro has quit [Quit: Leaving.]
stefanro has joined #u-boot
guillaume_g has joined #u-boot
Rahix has quit [Quit: ZNC - https://znc.in]
Rahix has joined #u-boot
sszy has joined #u-boot
Guest3473 has joined #u-boot
mckoan|away is now known as mckoan
<Guest3473> Good Morning @ALL
lucaceresoli_ has joined #u-boot
<Guest3473> Is it possible to change u-Boot env that the andoid is opening a shell after boot whenn connect only is possible over a USB-Serial adapter ? And if yes how to do this ?
swiftgeek has joined #u-boot
Guest3473 has quit [Quit: Client closed]
Guest34 has joined #u-boot
frieder has joined #u-boot
ac_slater has quit [Quit: WeeChat 3.4]
milkylainen has joined #u-boot
zibolo has joined #u-boot
Harm has quit [Ping timeout: 240 seconds]
matthias_bgg has joined #u-boot
user_name has joined #u-boot
user_name has quit [Client Quit]
nacre has joined #u-boot
mwalle has joined #u-boot
cpackham[m] has quit [Quit: You have been kicked for being idle]
maxim[m] has quit [Quit: You have been kicked for being idle]
Thorn has quit [Ping timeout: 256 seconds]
Thorn has joined #u-boot
lucaceresoli__ has joined #u-boot
lucaceresoli_ has quit [Ping timeout: 256 seconds]
nacre has quit [Quit: leaving]
apritzel has joined #u-boot
mauro_anjo has joined #u-boot
mmu_man has joined #u-boot
lucaceresoli_ has joined #u-boot
lucaceresoli__ has quit [Ping timeout: 272 seconds]
tre has joined #u-boot
Guest34 has quit [Quit: Client closed]
tre has quit [Remote host closed the connection]
matthias_bgg has quit [Read error: Connection reset by peer]
matthias_bgg has joined #u-boot
sughosh has joined #u-boot
lucaceresoli__ has joined #u-boot
lucaceresoli_ has quit [Ping timeout: 272 seconds]
nacre has joined #u-boot
sughosh has quit [Ping timeout: 272 seconds]
Pali has joined #u-boot
<Pali> Hello! Is there any reason why aarch64 U-Boot mask System Error exception but does not mask Synchronous Abort exception?
<Pali> Masking of System Errors exceptions in U-Boot is a big issue... if (buggy) U-Boot code cause System Error then U-Boot do not see it and continue its operations as if nothing happened.
<Pali> And once U-Boot start booting kernel (which unmask System Error) then kernel is immediately killed by pending System Error caused by U-Boot.
sughosh has joined #u-boot
<Pali> But this System Error is reported from kernel context... which is really hard to debug that not kernel caused it, but rather bootloader.
<Pali> On the other hand Synchronous Abort exceptions are not masked by U-Boot and if buggy code triggers it then U-Boot run its do_sync() handler which reset CPU.
<apritzel> Pali: that is already fixed in -next, I believe
<apritzel> Pali: and yeah, we ran into the exact same issue multiple times already ...
<Pali> apritzel: Perfect, I'm going to test those patches!
<Pali> apritzel: Patches are working fine! System Error is unmasked and u-boot's handler is called, thanks!
<apritzel> Pali: great, thanks for the test! They should end up in mainline in April, I guess
<Pali> apritzel: Btw, do you know if there is some way/hack how to force aarch64 core to report aborts caused by store instruction as Synchronous External abort?
<apritzel> if you mean to catch "bus errors" synchronously: I don't think so, at least not architecturally
<Pali> external aborts caused by load are reported as synchronous (as core has to wait until data are ready) but aborts caused by store are reported as asynchronous as abort itself is delivered later...
<Pali> I mean data aborts caused by AXI slaves
<apritzel> yes, that's a general problem, but hard to fix if you care about performance
<apritzel> you might get lucky on something simple like an A53
<sjg1> Tartarus: No that I know of...what is the failure?
<Pali> apritzel: and if I would not care about performance on A53, it is possible?
<Tartarus> sjg1: Lost it in my scrollback, but it just doesn't like the output from the test at all
<Pali> I have tried to find something in public ARM A53 doc, but I was not able to find anything on this topic
<sjg1> Tartarus: Well one of the tests looks in the 'u-boot' elf file to find the event-spy linker list. That was the issue I had with LTO (it was dropping some entries)
<apritzel> Pali: I meant architecturally we don't guarantee synchronous aborts, because that would seriously limit implementations
<Tartarus> Hmm
<sjg1> Tartarus: So if the entry is not present it will fail
<apritzel> Pali: in my experience on an A53, in single-core U-Boot you will get those SErrors quick enough, though
<sjg1> Tartarus: Like this:
<apritzel> Pali: I don't know if there is something in some IMPDEF sysreg to wait for writes
marc2 has quit [Ping timeout: 268 seconds]
marc2 has joined #u-boot
<marex> is this yet again related to PCIe ?
<Pali> yes, I'm trying to find out how to handle these errors...
<Pali> They are common across ten different platforms and are aarch64
torez has joined #u-boot
<Tartarus> sjg1: No, https://pastebin.com/6cnmUBtC is the failure I see and "00000000006703a0 d _u_boot_list_2_evspy_info_2_EVT_MISC_INIT_F" is in /tmp/.bm-work/sandbox/u-boot
<Tartarus> The same toolchain as we have for CI being used
<sjg1> sugosh: Re v4 of the TPM RNG series, yes I see it, hope to get to it on the weekend
<Pali> apritzel: I looked into this table https://developer.arm.com/documentation/ddi0500/e/system-control/aarch64-register-summary/aarch64-implementation-defined-registers but seems that there is no impdef register for it
<sjg1> Tartarus: Perhaps drop the ELF file somewhere so I can look? The '?' is supposed to me the function name, i.e. 'f:sandbox_misc_init_f'
<Tartarus> sjg1: You should still be able to pop over to bill-the-cat, if you have a moment :)
<Tartarus> otherwise I'll go poke something
<sughosh> sjg1: Thanks Simon. Btw, you made a comment on not needing malloc for reading the random bytes
<kettenis> Pali: the traditional way to handle this is to do the PCIe access in a critical section with the appropriate barrier before and after the access
<sughosh> but we do need memory to read the random bytes by the rng device. how does it work without allocation memory, whether on heap or stack
<kettenis> if you detect an asynchronous fault while inside that critical section and the reported address matches, you can be sure the fault was caused by the PCIe access
<kettenis> of course this makes things slow
<Pali> kettenis: But this is now how PCIe drivers are written, neither in kernel nor in u-boot
<apritzel> Pali: well, you shouldn't encounter SErrors, normally
<apritzel> if so, it's a bug in the driver (or somewhere else)
<Pali> yes, but if PCIe IPs are broken, then Serrors are normal
<kettenis> I think the sparc64 implementation of pci_config_read/write uses this strategy
<apritzel> Pali: yes, but then your PCIe IP is broken, and you need some other workaround
nacre has quit [Quit: leaving]
<Pali> Lot of PCIe IPs incorrectly maps PCIe CA or CSR responses to AXI SLVERR
<apritzel> Pali: is this about config space probes returning SErrors instead of 0xff?
<Pali> not only to config space request, but response to any PCIe request
mmu_man has quit [Ping timeout: 240 seconds]
<apritzel> Pali: yes, and this is a serious hardware issue, and should be fixed there
<sjg1> sughosh: yes, should not use malloc() willy nilly. Just use 'char buf[64]' or something like that
<apritzel> Pali: you cannot realiably catch SErrors and reason about them. Believe me, I tried, and ran this idea by quite some people
<kettenis> Pali: the idea is that once you know the device is actually there and functioning you know which registers are implemented and those should never fault
<Pali> apritzel: so what to do with lot of hw/SoCs which are caused by this PCIe IP implementation errors?
<apritzel> indeed
<apritzel> Pali: you respin ;-)
<Pali> kettenis: This is not truth, they can cause fault again
<sughosh> sjg1: okay. will do that in v5. will wait for your comments on v4 before posting v5.
<apritzel> or abandon the platform
GNUtoo has quit [Quit: leaving]
<kettenis> Pali: what apritzel says ;)
<Pali> kettenis: in any time when LTSSM changes state to non-L
<kettenis> then you need to disable LTSSM and declare hotplugging broken
<Pali> kettenis: this does not make sense, disabling LTSSM makes PCIe link to go down and you cannot access PCIe card.
<apritzel> Pali: in this particular case there is some halfway working workaround by catching the SErrors very early (by mgmt processor firmware), then build a table of allowed BDFs, and filter by that
<Pali> apritzel: this does not help when LTSSM changes state
<apritzel> which doesn't catch everything, for instance SR-IOV
<kettenis> what I mean is you have to make sure LTSSM doesn't change the state
<Pali> once it drop to config or recovery from L* state those serrors are back
<kettenis> bring up the link and make sure it stays up
<apritzel> Pali: as I said: fix the hardware ;-)
<Pali> kettenis: "have to make sure LTSSM doesn't change the state" --> this is not possible by PCIe design
<Pali> other side of the link may and in some cases must change state
<kettenis> well, if you combine such hardware with PCIe IP like that, your system is broken
<kettenis> however, does mapping the relevant address space as nGnRnE make the faults synchronous?
redbrain has quit [Read error: Connection reset by peer]
GNUtoo has joined #u-boot
<apritzel> kettenis: not for writes, AFAIK
<Pali> it is already mapped as MT_DEVICE_NGNRNE
<apritzel> because the "nE" goes only so far into the interconnect
<apritzel> Pali: what CPU core is this? something ARMv8.2? You can hack something up with an "esb" instruction then to contain the SError, but that's not reliable nor upstreamable
<Pali> In my case it is A53
<kettenis> should work for config space access, but I guess mmio access is inherently posted
<Pali> But this error is general, which I see on lot of different platforms
<Pali> And people are periodically reporting these issues with different PCIe IPs
redbrain has joined #u-boot
<Pali> PCIe MEM write commands are posted
<Pali> Writes which are not posted are only IO and config
<kettenis> yes
<kettenis> I forgot about that
<Pali> Nature of posted/non-posted is on the PCIe bus.
<Pali> But those AXI errors are reported by PCIe controller prior commands are sent from controller to bus
<Pali> So something like armv7 "strongly-ordered" memory mapping could help with this issue...
<apritzel> I think the issue is that you should never signal an SError to the CPU side unless it's a fatal problem
<Pali> Yes, this is the issue -- the bug in PCIe IP
<apritzel> and a good part of the problem is that most so called root complexes are tweaked end points
<Pali> In my opinion PCIe IP designers misunderstood what those PCIe error means and that they cannot be mapped to SLVERR
<apritzel> exactly
<apritzel> the PCIe term #SErr does not help here ;-)
<Pali> But because this is not issue of just one PCIe IP, but at least of 4 or 5 different from different companies, it is really a big problem.
<sjg1> Tartarus: addr2line: DWARF error: can't find .debug_ranges section.
<sjg1> Tartarus: The image seems to me missing the line-number info in the debug tables
<apritzel> Pali: at least for config space accesses there is an SMCCC firmware standard to be able to deploy workarounds in firmware: https://developer.arm.com/documentation/den0115/latest
<kettenis> for config space access you can easily implement a workaround in the host bridge driver
<apritzel> (with the major caveat that this SMC is not implemented by Linux)
<Pali> I know, but this does not help for dynamic reconfiguration of config space memory mapping, plus it was rejected by linux-pci people
<kettenis> you can't do that in linux and u-boot for mmio and io access because drivers use readl/writel directly
<apritzel> and rightly so
torez has quit [Ping timeout: 260 seconds]
torez has joined #u-boot
<Pali> (btw, in my case, I can access config space via different PIO method which does not cause serrors, but this does not help MMIO access and neither does not help other platforms)
sughosh has quit [Read error: Connection reset by peer]
<apritzel> Pali: so why are those other SErrors happen during normal operation? Is that because the PCIe endpoint is something special?
<Pali> it happens anytime when you issue request and endpoint is not in L* state
<Pali> if endpoint is in L1 or L2 state it has to first switch to L0 before it can accept requests... but changing from L2 to L0 is via config or recovery state
<Pali> and if you do readl() writel() at this time when is in recovery state, you get SError
<kettenis> under what circumstances does the device decide to switch out of the L0 state?
<Pali> for example for power saving
<Pali> and also when card is buggy
<kettenis> well, you disable that
<Pali> and do its internal reset
<Pali> wifi cards are known to be bugy and their firmware lot of times crashes and card itself "reboots"
<kettenis> more broken hardware
<Pali> during card reboot PCIe link is down
<Pali> Or it happens if kernel explicitly want to reset card (either via in-band method, e.g. FLR or Hot-Reset, or out-of-band e.g. Warm-Reset)
<Pali> Or also when kernel explicitly re-issue Link Retraining
<kettenis> well, if the kernel initiates the reset, it should know not to access the device until it is back up
<Pali> "until it is back up" --> but this check is done by reading config space
<Pali> PCIe mandates that card should return CRS response
<Pali> and we are at the beginning that broken PCIe IPs maps CRS to Serror
<kettenis> but you can work around that aspect
<Pali> ok, if config space is somehow workarounded... there are still those problems with MMIO
<kettenis> send you pcie device back to the vendor and ask for a refund
<kettenis> between the linux pcie maintainers being somewhat unreasonable, your pcie host bridge IP being broken and you pcie devices being broken you'll have to come up with a compromise that makes most hardware work
<Pali> Show me one pcie wifi card which is not broken... I have tested lot of them and I have not found any non-broken.
<Pali> Just kidding...
<Pali> But situation is really bad as there is no working hw... and I'm trying to find something useful in general, not just for one PCIe controller.
zibolo has quit [Ping timeout: 256 seconds]
frieder has quit [Remote host closed the connection]
adeepv has quit [Quit: %exit%]
apritzel has quit [Ping timeout: 252 seconds]
vagrantc has joined #u-boot
sszy has quit [Quit: http://quassel-irc.org - Chat comfortably. Anywhere.]
___nick___ has joined #u-boot
mckoan is now known as mckoan|away
vagrantc has quit [Ping timeout: 240 seconds]
___nick___ has quit [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
___nick___ has joined #u-boot
vagrantc has joined #u-boot
baltazar has quit [Ping timeout: 256 seconds]
baltazar has joined #u-boot
matthias_bgg has quit [Ping timeout: 272 seconds]
sobkas has joined #u-boot
<Tartarus> Sigh, https://source.denx.de/u-boot/u-boot/-/jobs/402349 means the check has been broken for a while
<Tartarus> I've figured out how to fix the check, but I need to re-migrate a ton of stuff again first
* Tartarus hunts for brown paper bags
torez has quit [Ping timeout: 240 seconds]
<marex> kettenis: pcie maintainers being unreasonable ? :)
<kettenis> somewhat unreasonable ;)
<marex> kettenis: thank you, this made me laugh
mauro_anjo has quit [Ping timeout: 252 seconds]
mthall has quit [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
mthall has joined #u-boot
mthall has quit [Client Quit]
mthall has joined #u-boot
grgy has quit [Ping timeout: 240 seconds]
___nick___ has quit [Ping timeout: 250 seconds]
mmu_man has joined #u-boot
<sjg1> Tartarus: Ah I didn't know about that check. Should we move it to the build?
grgy has joined #u-boot
lucaceresoli__ has quit [Quit: Leaving]
grgy has quit [Client Quit]
grgy has joined #u-boot
prabhakarlad has quit [Quit: Client closed]
sobkas has quit [Quit: sobkas]
<Tartarus> sjg1: I'm fixing the test so that it works correctly, and a small series of re-migrating stuff again
<Tartarus> But no, it's not worth making part of every build
<sjg1> Tartarus: OK
Pali has left #u-boot [#u-boot]
prabhakarlad has joined #u-boot
heijligen has quit [Quit: WeeChat 3.2]