#fedora-riscv on 2022-09-21 — irc logs at libera.irclog.whitequark.org

2021-06-01 15:14 dgilmore changed the topic of #fedora-riscv to: Fedora on RISC-V https://fedoraproject.org/wiki/Architectures/RISC-V || Logs: https://libera.irclog.whitequark.org/fedora-riscv || Alt Arch discussions are welcome in #fedora-alt-arches

01:24 esv_ has joined #fedora-riscv

01:27 esv has quit [Ping timeout: 252 seconds]

01:51 esv_ is now known as esv

02:18 bkeys1 has joined #fedora-riscv

03:20 davidlt has joined #fedora-riscv

04:20 davidlt has quit [Ping timeout: 265 seconds]

06:21 davidlt has joined #fedora-riscv

06:42 <davidlt[m]> rwmjones: http://fedora.riscv.rocks/koji/buildinfo?buildID=204688

06:43 <davidlt[m]> somlo: HW issues or software? Or just things working too slow and thus revealing issues?

06:43 <davidlt[m]> It's actually funny. I used to drop clock on Xeon servers to lowest possible to find new bugs :)

06:43 <davidlt[m]> Sometimes execution speed hides bugs in the code.

06:44 <davidlt[m]> Actually too fast also works, but a lot less.

06:55 smudge-the-cat has joined #fedora-riscv

06:55 smudge-the-cat has left #fedora-riscv [#fedora-riscv]

07:47 smudge-the-cat has joined #fedora-riscv

07:47 smudge-the-cat has left #fedora-riscv [#fedora-riscv]

08:11 jcajka has joined #fedora-riscv

09:19 <rwmjones> davidlt[m]: oh good, it worked

09:19 <rwmjones> it was taking a very long time to be picked up by the builders when I looked yesterday

09:24 <davidlt[m]> rwmjones: yeah, too few builders and too many tasks. In that case Koji prefers to keep producing those SRPMs before actually building anything.

09:25 <davidlt[m]> Annoying. The more builders you have the less likely that is the case.

09:29 <davidlt[m]> rwmjones: are you around? could you reconfigure kojid?

09:54 <rwmjones> davidlt[m]: sorry got to go out, can you email me the details & I can do it later

09:54 <davidlt[m]> rwmjones: ok

10:03 <davidlt[m]> Sample PSU arrived, preparing to test a new builder

10:48 <davidlt[m]> rwmjones: the final production fan on Unmatched is definitely a lot worse :D

11:40 <davidlt[m]> rwmjones: I can feel vibrations on my table from that tiny fan

11:40 <davidlt[m]> I just refreshed my knowledge how to reprogram FTDI chip on it (now it has a serial number in ID)

11:44 <somlo> davidlt[m]: it's a bit too early to say for sure, but I suspect some sort of software issue (uart linux driver, maybe?); The full boot log is at http://mirror.ini.cmu.edu/litex/fedora_1core_boot_crash.log

11:45 <davidlt[m]> somlo: early in the boot, you get:

11:45 <davidlt[m]> [ 0.581904] error: hwirq 0xffff is too large for :cpus:cpu@0:interrupt-controller

11:45 <davidlt[m]> [ 0.590116] WARNING: CPU: 0 PID: 1 at kernel/irq/irqdomain.c:568 irq_domain_associate+0x134/0x184

11:46 <davidlt[m]> which causes for irq affinity map stuff to fail I think

11:47 <davidlt[m]> your final crash seems to be related to timer/irq stuff thus the initial issue might be IRQ related?

11:47 <davidlt[m]> Is your DT matching HW?

11:48 <davidlt[m]> Check your DT for plic0 configuration and in general all refs to plic0 in various devices

11:49 <somlo> yeah, that mask comes from the verilog (and dts) generated by Rocket. For some reason, linux complains about it on the 1-core version (was not complaining when I used 4 cores, but was still crashing :) )

11:50 <somlo> I need to do trial-and-error to figure out what the value is that would be "just right" (be nice if linux told us by *how much* the value is too large :) )

11:50 <davidlt[m]> Are IRQs count and mapping the same between 1 and 4 core versions?

11:50 <somlo> yeah

11:51 <davidlt[m]> if 0xffff is IRQ number, that would be a large number :)

11:51 <davidlt[m]> FU740 has 69 IRQs IIRC

11:51 <davidlt[m]> So where did "hwirq 0xffff" came from?

11:51 <davidlt[m]> That would be too large.

11:52 <somlo> the chisel->verilog elaboration in the rocket chip sources

11:52 <davidlt[m]> More than riscv,ndev

11:52 <somlo> might have been a bug there, and I cut'n'pasted it. I should check if it's still that after I updated the rocket chip sources, thanks for pointing it out

11:53 <davidlt[m]> You can probably figure out a bit from looking at kernel source and adding some printk

11:53 <somlo> interestingly enough it's not complaining when I boot my custom kernel, and I use the same value in DT

11:54 <somlo> I remember doing the printk experiment a while back, when it *was* complaining :)

11:54 <davidlt[m]> Do you use the same kernel config?

11:55 <davidlt[m]> But this looks like it failed to while init'ing stuff by looking at DT

11:55 <somlo> not the same as fedora (a whole lot fewer things enabled :)

11:55 <davidlt[m]> let me get my tea and do a quick git grep on kernel

11:57 <davidlt[m]> So that's irq_hw_number_t

11:58 <davidlt[m]> WARN(hwirq >= domain->hwirq_max

11:58 <somlo> no, I agree with you, I should try to turn it back down to a "sensible" size, but just surprised about why it isn't complaining about it (anymore) in the https://github.com/litex-hub/linux/tree/litex-rebase kernel I'm using to test (using https://github.com/litex-hub/linux/blob/litex-rebase/arch/riscv/configs/litex_rocket_defconfig)

11:58 <davidlt[m]> If you compile a debug kernel, you would get:

11:58 <davidlt[m]> pr_debug("%s(%s, irqbase=%i, hwbase=%i, count=%i)\n", __func__,

11:59 <davidlt[m]> well, only debugging can answer that :)

12:07 <davidlt[m]> new board has joined Koji party

12:08 esv_ has joined #fedora-riscv

12:09 esv has quit [Killed (NickServ (GHOST command used by esv_))]

12:09 <javierm> davidlt[m]: thre's no need to compile a debug kernel, I believe pr_debug() is part of the dynamic debug infra so you could just enable that using a kernel cmdline param or sysfs entries

12:09 <javierm> https://www.kernel.org/doc/html/latest/admin-guide/dynamic-debug-howto.html

12:09 esv_ is now known as esv

12:10 esv_ has joined #fedora-riscv

12:10 <davidlt[m]> somlo: ^^

12:10 <davidlt[m]> javierm: nice to know!

12:30 esv_ has quit [Quit: Leaving]

13:43 takuma has quit [Quit: WeeChat 3.5]

14:51 <somlo> javierm: thanks, I'll study that, should come in handy in the future

14:53 <somlo> davidlt[m]: I specifically remember narrowing that down in the past (using printk's :) ) to 0x3F (instead of the default `65535` suggested by the sample .dts automatically generated by chisel)

14:53 <somlo> I'm trying again with that value in .dts (and so far I got no complaints from the kernel during that phase of booting, it's still working its way through the later stages, we'll see how it shakes out)

14:55 <somlo> but ultimately I think I'll have to remember all the chisel I didn't quite comprehend in the first place and figure out why the rocket generator suggests that wrong 0xFFFF (65535) value in the first place, and maybe suggest a fix

14:59 <somlo> side note: generating a matching .dts for the LiteX bitstream is still a semi-manual process, at least when using Rocket as the cpu

14:59 <davidlt[m]> Yeah

14:59 <davidlt[m]> IIRC generate DTS is not upstream quality for Linux

14:59 <davidlt[m]> *generated

15:00 <somlo> and fully automating that would strongly benefit from being able to trust the sample .dts produced during chisel's elaboration of rocket sources (and then "pasting" in a bunch of extra device information from LiteX -- mmio register addresses, irq numbers, etc)

15:00 <somlo> which is obviously not the case at the moment :)

15:06 <somlo> http://mirror.ini.cmu.edu/litex/fed_1c_2.log

15:07 <somlo> maybe there's a bug in the LiteUART driver -- I used to be able to use the sbi console (ecall-ing into machine mode and having the hypervisor -- BBL -- do the console work on my behalf)

15:07 <somlo> wondering if that's something OpenSBI also supports...

15:08 <davidlt[m]> this time you didn't have irq thing during boot

15:08 <davidlt[m]> apart the final crash :)

15:08 <somlo> yeah, I fixed the offending mask in DT

15:22 guerby_ is now known as guerby

15:27 <davidlt[m]> Good, but did I fail the same?

15:27 <davidlt[m]> I don't recall the 1st log.

15:44 <somlo> same-ish -- most of my crashes are some interrupt thing, most of the time related to the uart

16:53 davidlt has quit [Ping timeout: 252 seconds]

17:23 jcajka has quit [Quit: Leaving]

17:55 davidlt has joined #fedora-riscv

18:52 <davidlt[m]> nirik rwmjones djdelorie : I sent an email about kojid reconfiguration

18:53 <davidlt[m]> Feel free to kill any running job. There are none right now. Not sure if Miro will send something for the night.

19:16 <nirik> changes made here.

19:24 <djdelorie> davidlt[m]: the plugin is giving me errors: AttributeError: 'NoneType' object has no attribute 'origin'

19:24 <djdelorie> do I need to install it?

19:24 <davidlt[m]> could check that syntax is ok?

19:25 <davidlt[m]> I had this before, but it was wrong syntax in kojid.conf

19:25 <davidlt[m]> Has to be: plugins = rpmautospec_builder

19:25 <davidlt[m]> disk image already has everything installed

19:26 <djdelorie> yeah, the file had quotes around it

19:26 <davidlt[m]> that's an old typo :)

19:34 <djdelorie> rebooted, waiting to see if the timeout works

19:35 <davidlt[m]> ok

19:35 <davidlt[m]> You can always check Koji status/pings: http://fedora.riscv.rocks/koji/hostinfo?hostID=196

19:35 <djdelorie> journalctl is showing me what I need; it failed to login the first time, I'm watching to see if it works the second time

19:36 <djdelorie> 'Temporary failure in name resolution'

19:36 <davidlt[m]> dns?

19:37 <djdelorie> I think it's just missing some dependency on a network service and tries to start too early

19:41 <djdelorie> I think "After=network.target" is too soon; perhaps "After=network-online.target" ?

19:42 <djdelorie> also, it started up correctly after the timeout :-)

19:42 <davidlt[m]> Could be, I don't have it set to start on boot (I probably should)

19:43 <davidlt[m]> If you figure out the right dependency we should use it

19:43 <djdelorie> rebooting again :-)

19:47 <djdelorie> didn't help, but the timeout works, so "good enough for me"

20:13 davidlt has quit [Ping timeout: 260 seconds]