dgilmore changed the topic of #fedora-riscv to: Fedora on RISC-V https://fedoraproject.org/wiki/Architectures/RISC-V || Logs: https://libera.irclog.whitequark.org/fedora-riscv || Alt Arch discussions are welcome in #fedora-alt-arches
esv_ has joined #fedora-riscv
esv has quit [Ping timeout: 252 seconds]
esv_ is now known as esv
bkeys1 has joined #fedora-riscv
davidlt has joined #fedora-riscv
davidlt has quit [Ping timeout: 265 seconds]
davidlt has joined #fedora-riscv
<davidlt[m]> somlo: HW issues or software? Or just things working too slow and thus revealing issues?
<davidlt[m]> It's actually funny. I used to drop clock on Xeon servers to lowest possible to find new bugs :)
<davidlt[m]> Sometimes execution speed hides bugs in the code.
<davidlt[m]> Actually too fast also works, but a lot less.
smudge-the-cat has joined #fedora-riscv
smudge-the-cat has left #fedora-riscv [#fedora-riscv]
smudge-the-cat has joined #fedora-riscv
smudge-the-cat has left #fedora-riscv [#fedora-riscv]
jcajka has joined #fedora-riscv
<rwmjones> davidlt[m]: oh good, it worked
<rwmjones> it was taking a very long time to be picked up by the builders when I looked yesterday
<davidlt[m]> rwmjones: yeah, too few builders and too many tasks. In that case Koji prefers to keep producing those SRPMs before actually building anything.
<davidlt[m]> Annoying. The more builders you have the less likely that is the case.
<davidlt[m]> rwmjones: are you around? could you reconfigure kojid?
<rwmjones> davidlt[m]: sorry got to go out, can you email me the details & I can do it later
<davidlt[m]> rwmjones: ok
<davidlt[m]> Sample PSU arrived, preparing to test a new builder
<davidlt[m]> rwmjones: the final production fan on Unmatched is definitely a lot worse :D
<davidlt[m]> rwmjones: I can feel vibrations on my table from that tiny fan
<davidlt[m]> I just refreshed my knowledge how to reprogram FTDI chip on it (now it has a serial number in ID)
<somlo> davidlt[m]: it's a bit too early to say for sure, but I suspect some sort of software issue (uart linux driver, maybe?); The full boot log is at http://mirror.ini.cmu.edu/litex/fedora_1core_boot_crash.log
<davidlt[m]> somlo: early in the boot, you get:
<davidlt[m]> [ 0.581904] error: hwirq 0xffff is too large for :cpus:cpu@0:interrupt-controller
<davidlt[m]> [ 0.590116] WARNING: CPU: 0 PID: 1 at kernel/irq/irqdomain.c:568 irq_domain_associate+0x134/0x184
<davidlt[m]> which causes for irq affinity map stuff to fail I think
<davidlt[m]> your final crash seems to be related to timer/irq stuff thus the initial issue might be IRQ related?
<davidlt[m]> Is your DT matching HW?
<davidlt[m]> Check your DT for plic0 configuration and in general all refs to plic0 in various devices
<somlo> yeah, that mask comes from the verilog (and dts) generated by Rocket. For some reason, linux complains about it on the 1-core version (was not complaining when I used 4 cores, but was still crashing :) )
<somlo> I need to do trial-and-error to figure out what the value is that would be "just right" (be nice if linux told us by *how much* the value is too large :) )
<davidlt[m]> Are IRQs count and mapping the same between 1 and 4 core versions?
<somlo> yeah
<davidlt[m]> if 0xffff is IRQ number, that would be a large number :)
<davidlt[m]> FU740 has 69 IRQs IIRC
<davidlt[m]> So where did "hwirq 0xffff" came from?
<davidlt[m]> That would be too large.
<somlo> the chisel->verilog elaboration in the rocket chip sources
<davidlt[m]> More than riscv,ndev
<somlo> might have been a bug there, and I cut'n'pasted it. I should check if it's still that after I updated the rocket chip sources, thanks for pointing it out
<davidlt[m]> You can probably figure out a bit from looking at kernel source and adding some printk
<somlo> interestingly enough it's not complaining when I boot my custom kernel, and I use the same value in DT
<somlo> I remember doing the printk experiment a while back, when it *was* complaining :)
<davidlt[m]> Do you use the same kernel config?
<davidlt[m]> But this looks like it failed to while init'ing stuff by looking at DT
<somlo> not the same as fedora (a whole lot fewer things enabled :)
<davidlt[m]> let me get my tea and do a quick git grep on kernel
<davidlt[m]> So that's irq_hw_number_t
<davidlt[m]> WARN(hwirq >= domain->hwirq_max
<somlo> no, I agree with you, I should try to turn it back down to a "sensible" size, but just surprised about why it isn't complaining about it (anymore) in the https://github.com/litex-hub/linux/tree/litex-rebase kernel I'm using to test (using https://github.com/litex-hub/linux/blob/litex-rebase/arch/riscv/configs/litex_rocket_defconfig)
<davidlt[m]> If you compile a debug kernel, you would get:
<davidlt[m]> pr_debug("%s(%s, irqbase=%i, hwbase=%i, count=%i)\n", __func__,
<davidlt[m]> well, only debugging can answer that :)
<davidlt[m]> new board has joined Koji party
esv_ has joined #fedora-riscv
esv has quit [Killed (NickServ (GHOST command used by esv_))]
<javierm> davidlt[m]: thre's no need to compile a debug kernel, I believe pr_debug() is part of the dynamic debug infra so you could just enable that using a kernel cmdline param or sysfs entries
esv_ is now known as esv
esv_ has joined #fedora-riscv
<davidlt[m]> somlo: ^^
<davidlt[m]> javierm: nice to know!
esv_ has quit [Quit: Leaving]
takuma has quit [Quit: WeeChat 3.5]
<somlo> javierm: thanks, I'll study that, should come in handy in the future
<somlo> davidlt[m]: I specifically remember narrowing that down in the past (using printk's :) ) to 0x3F (instead of the default `65535` suggested by the sample .dts automatically generated by chisel)
<somlo> I'm trying again with that value in .dts (and so far I got no complaints from the kernel during that phase of booting, it's still working its way through the later stages, we'll see how it shakes out)
<somlo> but ultimately I think I'll have to remember all the chisel I didn't quite comprehend in the first place and figure out why the rocket generator suggests that wrong 0xFFFF (65535) value in the first place, and maybe suggest a fix
<somlo> side note: generating a matching .dts for the LiteX bitstream is still a semi-manual process, at least when using Rocket as the cpu
<davidlt[m]> Yeah
<davidlt[m]> IIRC generate DTS is not upstream quality for Linux
<davidlt[m]> *generated
<somlo> and fully automating that would strongly benefit from being able to trust the sample .dts produced during chisel's elaboration of rocket sources (and then "pasting" in a bunch of extra device information from LiteX -- mmio register addresses, irq numbers, etc)
<somlo> which is obviously not the case at the moment :)
<somlo> maybe there's a bug in the LiteUART driver -- I used to be able to use the sbi console (ecall-ing into machine mode and having the hypervisor -- BBL -- do the console work on my behalf)
<somlo> wondering if that's something OpenSBI also supports...
<davidlt[m]> this time you didn't have irq thing during boot
<davidlt[m]> apart the final crash :)
<somlo> yeah, I fixed the offending mask in DT
guerby_ is now known as guerby
<davidlt[m]> Good, but did I fail the same?
<davidlt[m]> I don't recall the 1st log.
<somlo> same-ish -- most of my crashes are some interrupt thing, most of the time related to the uart
davidlt has quit [Ping timeout: 252 seconds]
jcajka has quit [Quit: Leaving]
davidlt has joined #fedora-riscv
<davidlt[m]> nirik rwmjones djdelorie : I sent an email about kojid reconfiguration
<davidlt[m]> Feel free to kill any running job. There are none right now. Not sure if Miro will send something for the night.
<nirik> changes made here.
<djdelorie> davidlt[m]: the plugin is giving me errors: AttributeError: 'NoneType' object has no attribute 'origin'
<djdelorie> do I need to install it?
<davidlt[m]> could check that syntax is ok?
<davidlt[m]> I had this before, but it was wrong syntax in kojid.conf
<davidlt[m]> Has to be: plugins = rpmautospec_builder
<davidlt[m]> disk image already has everything installed
<djdelorie> yeah, the file had quotes around it
<davidlt[m]> that's an old typo :)
<djdelorie> rebooted, waiting to see if the timeout works
<davidlt[m]> ok
<davidlt[m]> You can always check Koji status/pings: http://fedora.riscv.rocks/koji/hostinfo?hostID=196
<djdelorie> journalctl is showing me what I need; it failed to login the first time, I'm watching to see if it works the second time
<djdelorie> 'Temporary failure in name resolution'
<davidlt[m]> dns?
<djdelorie> I think it's just missing some dependency on a network service and tries to start too early
<djdelorie> I think "After=network.target" is too soon; perhaps "After=network-online.target" ?
<djdelorie> also, it started up correctly after the timeout :-)
<davidlt[m]> Could be, I don't have it set to start on boot (I probably should)
<davidlt[m]> If you figure out the right dependency we should use it
<djdelorie> rebooting again :-)
<djdelorie> didn't help, but the timeout works, so "good enough for me"
davidlt has quit [Ping timeout: 260 seconds]