#osdev on 2021-07-08 — irc logs at libera.irclog.whitequark.org

2021-05-23 01:57 klange changed the topic of #osdev to: Operating System Development || Don't ask to ask---just ask! || For 3+ LoC, use a pastebin (for example https://gist.github.com/) || Stats + Old logs: http://osdev-logs.qzx.com New Logs: https://libera.irclog.whitequark.org/osdev || Visit https://wiki.osdev.org and https://forum.osdev.org || Books: https://wiki.osdev.org/Books

00:05 tacco has quit []

00:10 <geist> you could mask it at the GIC

00:11 <geist> but yeah i think if IRQs are disabled in EL2 and you bounce to EL1 where htey're reenabled, if the GIC is still asserting it it may still fire?

00:11 <geist> what you can't do is just configure the GIC to deliver an IRQ to a particular EL. that would be lovely, but IIRC the cpu still logically has a single /IRQ and /FIQ line

00:11 <gorgonical> I don't *think* the implementation relies on GIC3 behavior, so it must be something about ARM interrupts, I think

00:11 <gorgonical> I see

00:12 <gorgonical> So the logic for how interrupts get delivered is going to rely on the ARM chip's handling logic

00:12 <geist> i think so yeah. there's some switcharoo you need to do when entering the guests and back to the dom0, etc

00:13 <gorgonical> So then the hcr_el2.imo configuration is probably so that when running in EL1 the interrupt gets delivered straight in

00:13 <geist> yah, which may be the case when you're in say a dom0 in a type 2 style

00:13 <geist> but then when you switch to a guest EL1 with no hard IRQs routed to it, you program the HCR_EL2 to route all IRQS to EL2 first

00:13 <gorgonical> Rather than the chip jumping up to el2 and using that vbar. So in the case where we are already in el2 and the interrupt fires, you are thinking that by not eoi-ing the interrupt and switching back to el1, the interrupt might re-fire

00:14 <geist> yah i bet so. which EL1 aer you switching to? a guest or the dom0 (if you have one)?

00:14 <gorgonical> So this is a type-1 hypervisor. It's hafnium, where hafnium boots up and starts the "primary VM" according to ARM's FF-A spec

00:14 <geist> got it

00:15 <geist> so in that case is it that *all* irqs are routed to EL2?

00:15 <gorgonical> And the primary VM is responsible for providing scheduling for the hypervisor, whose only job really is to trap instructions and do context/world switches

00:15 <gorgonical> It seems actually that when running in secondary VMs the IRQs do get bounced into EL2, but when running the primary VM they are not bounced into el2

00:16 <geist> seems like in that case EL2 should probably claim all irqs for itself and then just manually bounce ones it wants to passs through by faking out an IRQ

00:16 <geist> ah, so that seems like the difference in programming of the hcr_el2 when running the dom0 vs guests

00:17 <geist> that sounds like a very loose interpretation of type 1 really... the fact taht there's a single guest that has more provildges than other and also takes IRQS really smells like a type 2

00:17 <geist> if type 1 and 2 even mean anything anymore, frankly

00:17 <gorgonical> so what do you mean by "faking out" an IRQ? because AFAIK the hypervisor just directly switches back into the primary VM when a "general" IRQ happens, no configuring something in the vmcs like in intel

00:17 <geist> faking out as in branching to the EL1s vbar as if an irq was delivered

00:17 <geist> setting FAR, etc etc

00:18 <gorgonical> I guess it could be doing that. It definitely could interrogate the sysregs to find out

00:18 <geist> but hoenstly i have not walked through how this is supposed to work. we have a vcpu thing in fuchsia but i didn't write it

00:18 <geist> i have a general knowledge of the cpu features, but haven't worked through the details of exactly what is reconfigured in EL2 when switching around

00:19 <gorgonical> But the hafnium code doesn't suggest it's doing that

00:19 <gorgonical> Hold on

00:20 <gorgonical> https://hafnium.googlesource.com/hafnium/+/refs/heads/master/src/arch/aarch64/hypervisor/handler.c#674

00:20 <bslsk05> hafnium.googlesource.com: src/arch/aarch64/hypervisor/handler.c - hafnium - Git at Google

00:20 trufas has quit [Ping timeout: 265 seconds]

00:20 trufas has joined #osdev

00:20 <gorgonical> This is linked to from here: https://hafnium.googlesource.com/hafnium/+/refs/heads/master/src/arch/aarch64/hypervisor/exceptions.S#108

00:20 <bslsk05> hafnium.googlesource.com: src/arch/aarch64/hypervisor/exceptions.S - hafnium - Git at Google

00:22 <gorgonical> And basically that function in irq_lower ends up here: https://hafnium.googlesource.com/hafnium/+/refs/heads/master/src/api.c#74

00:22 <bslsk05> hafnium.googlesource.com: src/api.c - hafnium - Git at Google

00:22 * geist nods

00:23 <gorgonical> I know you said this is a little out of your purview, but I'm doing my dilligence to make sure I haven't overlooked something

00:23 <geist> yah

00:23 <geist> dunno, a bit busy right now to try to grok this super complicated topic

00:23 <geist> but i think you're on the right track

00:25 <gorgonical> My heuristic conclusion is basically that I think the interrupt *has* to re-fire if it's active and you enter a different (lower or higher? Maybe only lower?) EL

00:25 <geist> unkless the HCR_EL2 always redirects it to EL2?

00:25 <geist> but it could be reprogrammed in all of that sequence

00:26 <geist> which would makes sense. you're switching to dom0, want the IRQ to fire, so you turn off the 'send everything to EL2' then enter EL1 in which case it instantly fires

00:26 <geist> because the IRQ is still level asserted

00:26 divine has joined #osdev

00:26 <gorgonical> I guess what I don't understand is the logic around how you can receive the irq and just keep going

00:26 <geist> that's what i'd assume happens here

00:26 <geist> what do you mean?

00:26 <gorgonical> It seems strange to me that you can "ignore" the IRQ after getting it and simply choose to switch ELs

00:27 <geist> why? that seems totally what you want to do

00:27 <gorgonical> Oh sure, that is exactly what we want and quite useful. Just a very unfamiliar idea

00:27 <geist> as in, EL2s job is not to deal with IRQs (except maybe a virtual timer) but to route the cpu between EL1 contextx that can

00:27 <gorgonical> I don't think any sort of mechanism like this exists in intel. Once the interrupt fires it's gone and you have to figure out which one fires

00:27 <geist> think of EL2 as the microcode in intel x86 that's running during vmcall/vmexit, etc

00:28 <geist> all of that magic black box is exposed here, but yuo can implement something that more or less does the sameish thing

00:29 <gorgonical> I'll say it's a little cleaner in that it's the same mechanism. IRQ fires but don't want to handle it? Ignore it, switch to a different EL, and hope the handlers there deal with it

00:29 <geist> yah

00:30 <gorgonical> As opposed to intel: an interrupt fired in the guest so it exited, but the interrupt status is in the vmcs and so if you want to deal with it you have to interrogate that and mutate state that way, separate of your own idt

00:30 <geist> and it's even possible they wont. EL1 dom0 may have entered EL2 originally as part of some hypercall, with irqs disabled (at EL1)

00:30 <geist> so you bounce back, it runs some more in EL1 until it unmasks its irqs and then boom it fires

00:30 <geist> that would be the reason you wouldn't want to 'fake an irq'

00:30 <gorgonical> That is a very good point

00:31 <geist> you possibly could, but given that the IRQs delivery mechanism in arm is cheap there's probably little reason to try to short circuit that

00:34 vdamewood has joined #osdev

00:43 vinleod has joined #osdev

00:45 vdamewood has quit [Ping timeout: 240 seconds]

01:02 freakazoid333 has quit [Read error: Connection reset by peer]

01:07 ElectronApps has joined #osdev

01:09 vinleod is now known as vdamewood

01:19 _mrlemke_ has quit [Read error: Connection reset by peer]

01:19 _mrlemke_ has joined #osdev

01:20 CryptoDavid has quit [Quit: Connection closed for inactivity]

01:35 isaacwoods has quit [Quit: WeeChat 3.2]

01:46 heat has joined #osdev

01:52 gog has quit [Ping timeout: 246 seconds]

01:53 iorem has joined #osdev

02:15 aquijoule__ has joined #osdev

02:17 aquijoule_ has quit [Ping timeout: 246 seconds]

02:23 _mrlemke_ has quit [Quit: Konversation terminated!]

02:29 Brnocrist has quit [Ping timeout: 272 seconds]

02:30 Brnocrist has joined #osdev

02:42 heat has quit [Ping timeout: 268 seconds]

02:52 sts-q has joined #osdev

03:06 <gorgonical> Utterly incomprehensible. The Hafnium docs say "interrupts are owned by the primary" and so that gets switched to if one happens while running a secondary (e.g. timer), but it's not via the virtual interrupt system and I honestly have no idea how

03:06 ElectronApps has quit [Read error: Connection reset by peer]

03:08 ElectronApps has joined #osdev

03:24 <gorgonical> Wait! From a programmer's guide for the generic timer, I think I have figured it out!

03:26 <gorgonical> So once the interrupt gets delivered, it's now masking that same interrupt from being received again, as one expects so you can actually handle it. However, doing a world switch back to the primary VM requires an eret, which will clear the interrupt status, and because the timer is *level-sensitive*, it will immediately re-deliver the interrupt after the eret executes. At that point, the target

03:26 <gorgonical> exception level registers will be in effect and that's how you defer the work

03:45 iorem has quit [Quit: Connection closed]

03:54 Izem has joined #osdev

04:03 srjek|home has quit [Ping timeout: 252 seconds]

04:34 nyah has quit [Ping timeout: 240 seconds]

04:41 froggey has quit [Ping timeout: 265 seconds]

04:42 froggey has joined #osdev

05:32 ElectronApps has quit [Ping timeout: 240 seconds]

05:33 ElectronApps has joined #osdev

05:34 <doug16k> gorgonical, figuring out arm docs is more impressive than implementing code :P

05:52 sprock has quit [Ping timeout: 258 seconds]

06:05 <geist> gorgonical: right

06:06 <geist> the interrupt masking bit is banked per level, so as you drop to the lower level it just fires (if the lower level has IRQs enabled)

06:13 silverballz is now known as silverwhitefish

06:13 <doug16k> linear probe hash table with interned hashed string key makes it instantaneous compared to map<string, ...>. this code was mostly just an elaborate series of strcmp calls before :P

06:15 <doug16k> just linear search the whole thing would be fast enough, but na, I have to go and just jump straight to the right one most of the time :P

06:15 <doug16k> combined with constexpr trick to compile-time pre-hash string literals

06:15 <doug16k> so it just knows the hash already :P

06:16 <doug16k> I want to see how fast I can get map<string abuse to run

06:18 <doug16k> kind of intrusive hash table, where your key needs to have a .hash member

06:20 <doug16k> that lets you just tell it the key and saves tons of hashing execution time

06:20 <doug16k> er, just tell it the hash

06:23 <doug16k> so far the actual memcmp of the key is pointless, haven't seen a hash collision yet

06:38 <clever> that reminds me

06:39 <clever> internally, the haskell json library, uses hashmaps for its objects

06:39 <clever> and i have heard of an attack, where a user generates json, with keys that intentionally collide in the hashmap

06:39 <clever> and that makes the hashmap performance take a nosedive, handling keys in the same bucket

06:46 <sahibatko> Is it just my impression or is the June 2021 revision of Intel manuals quite a lot refactored compared to the one from 2016? I mean, not just added stuff regarding PML5, but changed some "naming" too - is there perhaps a "hinghlighted changes" version available?

06:54 <doug16k> I have seen versions that are just the diff

06:59 <doug16k> sahibatko, there's this https://software.intel.com/content/www/us/en/develop/download/intel-64-and-ia-32-architectures-software-developers-manual-documentation-changes.html

06:59 <bslsk05> software.intel.com: Intel® 64 and IA-32 Architectures Software Developer's Manual...

07:00 <vdamewood> Does this mean I need to order new hardcopies from Lulu?

07:00 <doug16k> shows the changes indexed by section

07:03 <doug16k> changes highlighted in that

07:15 <sahibatko> doug16k:that seems to be it, thanks for the link

07:24 Izem has left #osdev [Good Bye]

07:32 <Arsen> I have an arm server that has an empty dtb and theoretically exposes all configuration over ACPI, but it appears that, with linux, if you set amd-smmu.disable_bypass=1 linux prevents the drive controller being written to (unknown stream id 0x800)

07:32 <Arsen> my guess this is meant to come from the IORT ACPI table, but I'm not sure how ARM IOMMU works, so I figured asking here would help

07:33 <Arsen> if it does come from there, then it's probably a firmware bug, upon inspecting that table with acpidump and iasl I found nothing that'd lead me to believe the stream id is referenced anywhere, but I also lack knowledge of ACPI :P

07:34 <Arsen> also I know this isn't strictly osdev but I'm decently sure there's no more appropriate place to ask

07:36 <geist> yah i guess something has to tell it which smmu ids map to what

07:36 <geist> surprised it has no DTB. guess it was generally designed to boot windows

07:36 <Arsen> the code agrees with my guess, and yours too then

07:37 <Arsen> well, it has a dtb, it's just perfectly empty :P indicating that all info should be coming from ACPI

07:37 <Arsen> but if they didn't expose this info in ACPI I should complain to the fw vendor, right?

07:38 <geist> the only arm server i've worked with (gigabyte based board, ThunderX2 cavium) has i think a fairly complete DTB and ACPI

07:38 <Arsen> that sounds like a smarter approach than what lenovo took here

07:38 <gorgonical> ive always wanted to work on one of the thunderx2's

07:39 <geist> yah all in all it's a nice machine

07:39 <gorgonical> I like the subversion of the regular expectation that arm == sbcs and thus low power

07:39 <gorgonical> Really gets my goat when people derisively say arm can't be fast because reason xyz and intel/amd will reign forever

07:40 <geist> well, the apple M1 proved them wrong

07:40 <geist> but indeed

07:41 <geist> as much as i'm not particularly happy that it's a vertical thing like apple, i am pleased that they actually produced a competitive microarchitecture not based on x86

07:41 <Arsen> this server is quite the opposite, it was meant to serve as a hypervisor before it stopped booting because this got merged: https://patchwork.kernel.org/project/linux-arm-kernel/patch/20190301192017.39770-1-dianders@chromium.org/

07:41 <bslsk05> patchwork.kernel.org: [v2] iommu/arm-smmu: Break insecure users by disabling bypass by default - Patchwork

07:41 <Arsen> (the opposite of a low power sbc)

07:42 <gorgonical> and not to mention that fujitsu is currently in the business of building supercomputers with arm

07:47 gioyik has quit [Quit: WeeChat 3.1]

07:48 <Arsen> but yeah this looks like a fun fw bug, does that sound right?

07:59 nyah has joined #osdev

08:00 ElectronApps has quit [Read error: Connection reset by peer]

08:03 ElectronApps has joined #osdev

08:12 gareppa has joined #osdev

08:13 gareppa has quit [Remote host closed the connection]

08:27 sprock has joined #osdev

08:37 sprock has quit [Ping timeout: 272 seconds]

08:45 sortie has quit [Remote host closed the connection]

08:48 sortie has joined #osdev

08:56 ZipCPU has quit [Ping timeout: 268 seconds]

09:02 zoey has quit [Ping timeout: 246 seconds]

09:16 GeDaMo has joined #osdev

09:29 z_is_stimky_ has quit [Read error: Connection reset by peer]

09:29 z_is_stimky has joined #osdev

09:53 dormito has quit [Ping timeout: 252 seconds]

10:14 dennis95 has joined #osdev

10:14 ZetItUp has joined #osdev

10:25 dormito has joined #osdev

10:26 ZipCPU has joined #osdev

10:50 ElectronApps has quit [Remote host closed the connection]

10:55 ElectronApps has joined #osdev

11:32 ElectronApps has quit [Remote host closed the connection]

11:34 ElectronApps has joined #osdev

11:49 isaacwoods has joined #osdev

12:21 gog has joined #osdev

12:51 gareppa has joined #osdev

12:53 gareppa has quit [Remote host closed the connection]

12:55 CryptoDavid has joined #osdev

13:01 ElectronApps has quit [Read error: Connection reset by peer]

13:03 ElectronApps has joined #osdev

13:10 ahalaney has joined #osdev

13:19 iorem has joined #osdev

14:30 Brnocrist has quit [Ping timeout: 265 seconds]

14:40 ElectronApps has quit [Remote host closed the connection]

14:44 srjek|home has joined #osdev

15:05 freakazoid333 has joined #osdev

15:19 Raito_Bezarius has quit [Ping timeout: 240 seconds]

15:32 Raito_Bezarius has joined #osdev

15:43 mahmutov has joined #osdev

15:44 gioyik has joined #osdev

15:56 mahmutov has quit [Ping timeout: 268 seconds]

15:56 iorem has quit [Quit: Connection closed]

16:11 smarton has quit [Changing host]

16:11 smarton has joined #osdev

16:11 srjek|home has quit [Ping timeout: 246 seconds]

16:36 silverwhitefish has quit [Quit: One for all, all for One (2 Corinthians 5)]

16:37 silverwhitefish has joined #osdev

16:38 silverwhitefish has quit [Client Quit]

16:38 silverwhitefish has joined #osdev

16:45 tenshi has joined #osdev

16:57 theseb has joined #osdev

17:00 <theseb> newbie locking question....Imagine you tried to implement a locking mechanism so that only one process had write permission on a file.....How avoid race conditions for that?....e.g. Imagine 10 processes simultaneously check to see the file is not locked....then they will ALL grab write permission at the same time! How avoid that?

17:12 mahmutov has joined #osdev

17:21 <tenshi> test-and-set the lock

17:27 CryptoDavid has quit [Quit: Connection closed for inactivity]

17:33 sts-q has quit [Remote host closed the connection]

17:36 Skyz has joined #osdev

17:47 <theseb> tenshi: k, thanks

17:47 theseb has quit [Quit: Leaving]

17:48 zoey has joined #osdev

18:07 tacco has joined #osdev

18:24 freakazoid333 has quit [Read error: Connection reset by peer]

18:28 dennis95 has quit [Quit: Leaving]

18:44 sprock has joined #osdev

18:50 drewlander has quit [Quit: ZNC 1.7.2+deb3 - https://znc.in]

18:52 drewlander has joined #osdev

19:00 ephemer0l has joined #osdev

19:21 Skyz has quit [Quit: Client closed]

19:38 nly has joined #osdev

19:58 Skyz has joined #osdev

19:58 nly has quit [Quit: Client closed]

20:05 nly has joined #osdev

20:06 <geist> neat. got my floppy drive controller emulator thingy in

20:08 <geist> gotek floppy emulator

20:08 <mjg> :)

20:09 <mjg> emulate atari tape recorder instead

20:09 <gog> you own systems with a floppy connector?

20:10 <mjg> probably nobody does, but their bioses likely still have code for it

20:10 <mjg> still, afair qeme likes the fdd

20:10 <mjg> qemu

20:10 <geist> yes and also these are useful for replacing the floppy drive in old industrial equipment

20:10 <geist> and i have a roland keyboard with a floppy drive

20:10 <gog> ah i see

20:11 <geist> but also figured it might be fun to load it up with a bunch of disk images and then put it on the 486 or whatnot

20:11 <geist> http://www.gotekemulator.com/P_view.asp?pid=54

20:11 <bslsk05> www.gotekemulator.com: SFR1M44-U100 GoTek 3.5Inch 1.44MB USB SSD Floppy Driver Emulator - GoTek USB Floppy Emulator Manufacturer Factory

20:12 <geist> if this works i might order a black one to replace on the roland, since it's floppy disk hasn't worked in forever

20:13 Skyz has quit [Quit: Client closed]

20:18 Skyz has joined #osdev

20:24 gog has quit [Ping timeout: 258 seconds]

20:26 mahmutov has quit [Ping timeout: 252 seconds]

20:28 mahmutov has joined #osdev

20:30 GeDaMo has quit [Quit: Leaving.]

20:31 tenshi has quit [Quit: WeeChat 3.2]

20:32 nly has quit [Quit: Client closed]

20:38 CryptoDavid has joined #osdev

20:43 dormito has quit [Ping timeout: 240 seconds]

20:50 MiningMarsh has quit [Ping timeout: 246 seconds]

20:51 MiningMarsh has joined #osdev

20:52 scaleww has joined #osdev

21:18 Skyz has quit [Quit: Client closed]

21:34 srjek|home has joined #osdev

21:35 Skyz has joined #osdev

21:43 dormito has joined #osdev

21:52 <sortie> OK so I have a VM of my OS in the cloud that has frozen and become unresponsive, but I have attached a qemu monitor over VNC

21:52 <sortie> https://pub.sortix.org/sortix/crazy-os/irc.sortix.org-freeze.png

21:52 <geist> yah i was wondering what was up

21:52 <geist> it hasn't been online in a while

21:52 <sortie> 11a42e:f4 hlt

21:52 <sortie> 11a42f:eb fd jmp 0x11a42e ← %rip

21:53 <sortie> So it's midnight and I'm tired but I do want to poke a bit at this lest any unattended upgrades reboot the host Linux

21:53 <sortie> Anything good at reading registers or knowing qemu able to spot or tell me how to tell whether this VM has triple faulted or something

21:53 <sortie> Or if this is indeed a deadlock

21:54 <sortie> I have something of a feeling that it's correctly sleeping on hlt, but the interrupt worker thread has stalled somehow, so packets aren't being received and delivered to user-space, so the system is idle

21:54 <sortie> Keyboard input doesn't work on the VM, I imagine for the same reason

21:54 <geist> i assume you're running it with KVM?

21:54 <sortie> KVM yes

21:55 <geist> part of the issue i see there is all the registers are reading zero

21:55 <geist> which, iirc, isn't entirely impossible with KVM sometimes

21:55 <geist> there have been points in time in the past where i've gotten incomplete or incorrect register readings

21:55 <geist> if it's just idling, is that your idle loop?

21:55 <sortie> This would be the kernel idle thread btw, so it doesn't have floating point state

21:55 <sortie> Yes, rip is my hlt loop

21:56 <geist> yah sadly that's how it looks a lot. a stuck but otherwise functioning kernel many times just appears to be in the idle loop

21:56 <geist> could have missed a timer, etc

21:56 <sortie> I assume RFL is the rflags and the 0x200 bit is set so interrupts are on

21:56 <geist> oh i see the regs being zero, the main ones are off the screen

21:57 <geist> yah so maybe poke at the interrupt controller state?

21:57 <sortie> Anyone know how to, uh, paginate stuff in the qemu monitor?

21:57 Skyz has quit [Quit: Client closed]

21:57 <geist> that seems to be a problem. i was actually going to look into that in a bit. when running under gnome i remember you have to make sure you have libvde (i think libvde) for it to link in some vt100/terminal support to the consoles

21:58 <geist> i have no idea if that also applies to VNC. was going to test that theory

21:58 <geist> but side note, you telling me that you can use ctrl-alt-2, etc has changed *everything* for my whole VM solution at home

21:58 <sortie> :D

21:58 <geist> i had resigned myself to the idea that you just can't get to it in my setup

21:58 <geist> but now it's all great

21:58 <sortie> I'm happy to help :)

21:58 <geist> libvte?

21:59 <sortie> I just use a qemu from my distro

21:59 <geist> anyway i'll give that a go in a bit

21:59 Skyz has joined #osdev

21:59 <geist> otherwise yeah you can't paginate or scrollback in it, which is a bummer

21:59 <geist> same with serial0 or so

21:59 <sortie> Mind you that I have a live VM here that reproduced the crash and it takes ~5 days to reproduce it

21:59 <geist> but like i said when building it manuyally and using a direct ui if it has vte it can let you scroll back in serial at least

21:59 <geist> sounds like you need more vms

22:00 <geist> these sort of hung system bugsare tough. generalyl the best solution is to start building into it your own watchdog schemes

22:00 <sortie> I believe this also happens on my sortix.sortix.org VM although that could be a different and older issue

22:00 <geist> since a stuck cpu is hard to catch

22:00 <sortie> In any case, just knowing this is a huge hint!

22:00 <geist> we have had to add increasingly complex solutions for zircon as well

22:00 <geist> though mostly in an SMP situation, cross checking each cpu from each other cpu

22:00 <froggey> ctrl+page up/down may or may not scroll the monitor

22:01 <sortie> Good suggestion, froggey, alas, that doesn't work nor any other similar thing I can think of

22:01 Skyz has quit [Client Quit]

22:01 mahmutov has quit [Ping timeout: 255 seconds]

22:01 <sortie> Hmm it could be a deadlock in the interrupt handling

22:01 <geist> froggey: yah that's what i was thinking about with the libvte stuff

22:01 <geist> ie, you dont get it unless qemu is compield a particular way

22:02 <geist> at least with gnome, i dunno if it applies to VNC or windows or mac uis

22:02 <geist> but may only be gnome specific, since libvte seems to be gnomeish

22:03 <sortie> I can probably fix the problem with a thorough code review for deadlocks and things stalling the interrupt worker thread

22:03 <froggey> mm, yeah. it's really hit-or-miss, which is annoying

22:03 <sortie> Is there any way to check if it triple faulted, just to rule that out?

22:03 <geist> sortie: while that's a good thing, it's the stuff you missed that you didn't think to check

22:03 <geist> some subtle race with the interrupt controller, something isn't acked, or some rare case where no timer is set where it should be, etc

22:04 <geist> once you get stable enough all the fun bugs turn into these sort of things

22:04 <sortie> Also a great suggestion

22:05 <geist> so one thing to do for example is maybe run with a no-hlt mode but then have a monotonic timer running at some rate like 1Hz

22:05 <geist> and in your idle loop check that it is incrementing

22:05 <sortie> Hopefully the VM will stay up tomorrow and Saturday cuz' I'll be busy but then it's VACAY

22:05 <sortie> Then I can debug this properly

22:05 <geist> timer irq bumps a counter, etc

22:05 <geist> this is a case where SMP actually helps, because you can have different cpus cross check each other

22:05 <geist> like have each cpu run a 1Hz timer that bumps a counter, and in the timer check that the other cpus haven't stopped, etc

22:05 <sortie> Timer interrupts should still fire

22:06 <geist> should and are are different things

22:06 <geist> that's the point, it's an assert that stuff is still ticking

22:06 <sortie> They don't use the interrupt worker thread, so I can likely see if the current time is still incrementing if I can find the memory location

22:06 <geist> right

22:06 <sortie> If that is the case, then I know the interrupt controller is still working correctly

22:06 <geist> also hardware watchdogs are useful here

22:07 <geist> dunno if qemu x86 has one, but if you enable it then you also set the cpu to pet the dog every second or so

22:07 <sortie> I did want to support the qemu watchdog for reliability

22:07 <sortie> Although I would want to delegate that to user-space, so an occasional self-test of availability must pass, or the vm gets destroyed

22:08 <geist> more layers is good

22:08 <geist> i'd suggest having layers that cross check each other

22:08 <sortie> Well thanks for the great conversation

22:08 <geist> yay this is the fun stuff

22:08 <sortie> It's after midnight so no imma sleep but good pointers

22:09 <sortie> The vnc monitor gave me a really good hint I can pursue and you gave me sone good tests to run to learn more

22:09 <geist> yah *yusually* what happens is your cpu gets i a loop with interupts disabled, and then HW watchdog catches it

22:10 <geist> in your case it's in the idle loop with ints enabled

22:10 <geist> but, perhaps the IRQ controller is not acked so its stuck

22:10 <geist> or, also possible, a critical timer was not set

22:10 <geist> or, also possible, it's simply a high level deadlock

22:10 <sortie> Somehow I doubt the IRQ controller stuff being wrong since I haven't touched that stuff for many years and it's been stable

22:10 <geist> and the kernel is running fine, there's just no activity

22:10 <sortie> I will of course verify

22:11 <geist> yah but have yuo stress tested this way for days, this is where you start seeing subtle races you didn't know about

22:11 <sortie> Indeed

22:11 <geist> 1/10 million times sort of things that only shows up after days of ticking

22:11 <geist> anyway. to bed you

22:13 <sortie> :)

22:13 <sortie> Always a pleasure talking

22:30 <klange> My endless adventure in bouncing between projects and never doing what's on the task list in front of me continues as I sketch out some widget toolkit thoughts once again...

22:46 gog has joined #osdev

22:47 ahalaney has quit [Quit: Leaving]

22:49 Robbe has quit [Remote host closed the connection]

22:54 Robbe has joined #osdev

22:59 Brnocrist has joined #osdev

23:18 freakazoid333 has joined #osdev

23:49 iorem has joined #osdev

23:57 Skyz has joined #osdev

23:57 <Skyz> What's the difference between a syscall and system api?