klange changed the topic of #osdev to: Operating System Development || Don't ask to ask---just ask! || For 3+ LoC, use a pastebin (for example https://gist.github.com/) || Stats + Old logs: http://osdev-logs.qzx.com New Logs: https://libera.irclog.whitequark.org/osdev || Visit https://wiki.osdev.org and https://forum.osdev.org || Books: https://wiki.osdev.org/Books
pretty_dumm_guy has quit [Quit: WeeChat 3.4]
<mrvn> vin: the first thing you should learn is that there is no best possible value.
thinkpol has quit [Remote host closed the connection]
thinkpol has joined #osdev
gog has quit [Quit: byee]
<vin> mrvn: Agreed there can't be a single value that can be the best for all workloads but there can be one for a single workload.
<moon-child> best in what respect?
theruran has quit [Ping timeout: 240 seconds]
theruran has joined #osdev
gxt has quit [Remote host closed the connection]
gxt has joined #osdev
joe9 has quit [Quit: leaving]
joe9 has joined #osdev
knusbaum has joined #osdev
Guest127 has quit [Ping timeout: 240 seconds]
<knusbaum> Hi everyone. I'm working on generating an ELF executable. Based on the docs (https://wiki.osdev.org/ELF, http://www.skyfree.org/linux/references/ELF_Format.pdf) it looks like the section header table is optional for an executable and that only a program header table is required. Is that right?
<bslsk05> ​wiki.osdev.org: ELF - OSDev Wiki
<Mutabah> That's correct
<knusbaum> I'm structuring my file like this: ELF header, Program Header, TEXT
<Mutabah> Executables (anything loadable really), use the program headers
<knusbaum> objdump -d doesn't show me any disassembly. I'm not sure if that means my program is structured improperly or if it's because there's no .text section
<knusbaum> When running I get (from gdb) "During startup program terminated with signal SIGSEGV, Segmentation fault"
<Mutabah> Try `objdump -x` to see all the headers
<knusbaum> and it won't show me where execution failed. That indicates to me that loading failed.
<knusbaum> I'll put it in a paste.
<Mutabah> Or `readelf -l`
<knusbaum> Here's objdump -x
<knusbaum> I've been looking at the headers but can't figure out what's wrong with it.
<klange> objdump does select output based on sections, so I'm not sure what it does if there are none for an executable; my usual go to is -D and even that's "all sections"
<knusbaum> Yeah, I tried -D
<knusbaum> no dice.
<knusbaum> I examined the file and the TEXT appears at the expected offset in the file.
<klange> when you get down to just having PHDRs, there's nothing to say what's actually executable code beyond access hints
<Mutabah> Tried readelf?
smeso has quit [Quit: smeso]
<Mutabah> Also, are you manually constructing this file?
<knusbaum> Yes, I'm manually constructing this.
<knusbaum> Writing an assembler and linker.
<Mutabah> `paddr` should probably be set to `vaddr` iirc
<knusbaum> Hmm, I'm not doing that. I noticed other binaries do that, but the docs say paddr is ignored.
<knusbaum> I'll try that.
<klange> My first thought would be that while it's perfectly valid to produce executables with only phdrs, the other tools will thank you if you keep sections in your final outputs
Jari-- has joined #osdev
<Jari--> morning #osdev
<klange> Ah you're just a few minutes late for it to still be morning here.
<knusbaum> klange, Yeah I was thinking about adding a .text header just for convenience.
<knusbaum> Ok. I added paddr and same result.
<knusbaum> Hmm. Sprunge is barfing. Here's GDB output: https://pastebin.com/ebDxLtDq
<bslsk05> ​pastebin.com: During startup program terminated with signal SIGSEGV, Segmentation fault.(gdb - Pastebin.com
<knusbaum> Is it fair to say that it's not the TEXT that's bad, it's that the program failed to load?
<knusbaum> I think the TEXT should fail anyway, but I expected to get a fault at a program counter.
<knusbaum> Ok well I'll add the .text section and see what the tooling can tell me.
smeso has joined #osdev
<knusbaum> Do you know if linux can tell me more about *why* an executable failed to load?
<klange> is it dynamic? ld.so probably can offer more details; if it's not even getting past kernel load, you can even try invoking ld.so directly...
<geist> also there may be some environment vars you can set
<klange> ld.so takes some flags when called directly, or more normally there's some environmnent variables
<geist> whatsit, LD_DEBUG?
<geist> yeah
dude12312414 has quit [Quit: THE RAM IS TOO DAMN HIGH]
<Clockface> can NASM build position indipendent executables?
<Clockface> or would i have to write stuff to do that manually
<Clockface> as in, do i have to be careful to do something with absolute adresses, or can i tell NASM to only do stuff with relative adressing (and ideally complain if i use an abs adress)
<moon-child> put 'default rel' at the top of your file
<Clockface> thank you
troseman has quit [Ping timeout: 272 seconds]
[_] has joined #osdev
[itchyjunk] is now known as Guest705
Guest705 has quit [Killed (molybdenum.libera.chat (Nickname regained by services))]
[_] is now known as [itchyjunk]
mahmutov has quit [Ping timeout: 272 seconds]
zaquest has quit [Remote host closed the connection]
zaquest has joined #osdev
bauen1 has quit [Ping timeout: 245 seconds]
bradd has quit [Ping timeout: 256 seconds]
Burgundy has joined #osdev
bradd has joined #osdev
masoudd has joined #osdev
ElectronApps has joined #osdev
srjek has quit [Ping timeout: 240 seconds]
the_lanetly_052 has joined #osdev
[itchyjunk] has quit [Read error: Connection reset by peer]
vdamewood has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]
xenos1984 has quit [Remote host closed the connection]
xenos1984 has joined #osdev
bradd has quit [Remote host closed the connection]
bradd has joined #osdev
jeaye is now known as jeaye_
jeaye_ is now known as jeaye
k8yun_ has quit [Quit: Leaving]
mlombard has joined #osdev
nyah has joined #osdev
gog` has joined #osdev
the_lanetly_052 has quit [Ping timeout: 240 seconds]
GeDaMo has joined #osdev
Jari-- has quit [Remote host closed the connection]
the_lanetly_052 has joined #osdev
dormito has quit [Quit: WeeChat 3.3]
bauen1 has joined #osdev
the_lanetly_052 has quit [Ping timeout: 256 seconds]
gxt has quit [Remote host closed the connection]
gxt has joined #osdev
dormito has joined #osdev
bauen1 has quit [Ping timeout: 256 seconds]
ElectronApps has quit [Remote host closed the connection]
immibis has joined #osdev
gog` is now known as gog
Payam has joined #osdev
bauen1 has joined #osdev
bauen1 has quit [Ping timeout: 268 seconds]
dennis95 has joined #osdev
roan has joined #osdev
bauen1 has joined #osdev
ElectronApps has joined #osdev
<klange> I think I've managed to unf*ck my signals sufficiently.
<klange> Only deliver on return to userspace, store userspace context in userspace stacks, support restarting system calls...
<klange> Resizing a terminal with my editor in it works well, can still suspend/resume, ^C things, etc.
<clever> that reminds me, libpam handles EINTR very poorly
<clever> it considers that to be a fatal error, and passes an error back to the user of the library
<clever> and in my case, i was FFI'ing into pam, with SIGALARM based context switching implemented within userland
<clever> so it had a very high chance of interrupting pam, and pam then failed
<klange> I still have an issue with resizing my editor quickly on the M1 but after much digging I think it's not specifically a signal issue, and I'll look at it deeper tomorrow.
<clever> ive run into issues where screen under xterm queues up every resize event
bauen1 has quit [Ping timeout: 256 seconds]
<clever> and if i resize the window the wrong way, i have to sit and wait 15 seconds
<clever> while it repaints itself, for every size i went thru
<klange> I'm going to play some pokémon and then go to bed.
bauen1 has joined #osdev
SikkiLadho has joined #osdev
SikkiLadho has quit [Ping timeout: 245 seconds]
SikkiLadho has joined #osdev
X-Scale` has joined #osdev
X-Scale has quit [Ping timeout: 256 seconds]
X-Scale` is now known as X-Scale
roan has quit [Ping timeout: 256 seconds]
<SikkiLadho> Hi. I have built a tiny hypervisor for RPI4, it can just load a linux kernel for now. I'm trying to compare the uart logs with and without hypervisor. I see a lot of difference, How should I look for errors which might haunt me in future when I get into Stage 2 memory managment? I see "CPU: CPUs started in inconsistent modes" in kernel uart with
<SikkiLadho> the hypevisor? What's causing it?
<bslsk05> ​gist.github.com: uart_log_with_hyp.txt · GitHub
<bslsk05> ​github.com: linux/virt.h at master · torvalds/linux · GitHub
<clever> SikkiLadho: did you change the cpu enable method in DT? are all cores entering linux in SVC mode?
<j`ey> at EL2 that, not SVC mode
<clever> ah right, my mind is still mostly in arm32 mode
<clever> if you have a hypervisor in EL2, then that hypervisor should own EL2 of all cores, and linux should get EL1 i believe?
<j`ey> yeah
<SikkiLadho> I have mpdir gate in my hypervisor to only allow the master core, to avoid race conditions. https://github.com/SikkiLadho/Leo/blob/701aa7d0566bfc657d9967f66ea66325ddcd8022/src/boot.S#L20
<bslsk05> ​github.com: Leo/boot.S at 701aa7d0566bfc657d9967f66ea66325ddcd8022 · SikkiLadho/Leo · GitHub
<j`ey> (so I meant EL1 in the previous)
<bslsk05> ​github.com: linux/smp.c at master · torvalds/linux · GitHub
<clever> which is calling is_hyp_mode_mismatched from the file j`ey linked
<bslsk05> ​github.com: linux/smp.c at master · torvalds/linux · GitHub
<clever> [ 0.075625] CPU2: Booted secondary processor 0x0000000002 [0x410fd083]
<clever> and this line, contains the cpu#(2), the mpidr, and the cpuid
<clever> from read_cpuid_id()
<bslsk05> ​github.com: linux/cputype.h at master · torvalds/linux · GitHub
<clever> which is just MIDR_EL1
<SikkiLadho> Hey celver, I have mpdir gate in my hyp, so only master core is entering it while others are in loop, this might be causing problem? https://github.com/SikkiLadho/Leo/blob/701aa7d0566bfc657d9967f66ea66325ddcd8022/src/boot.S#L20
<clever> SikkiLadho: and how does that function then pass the cpu off to linux?
roan has joined #osdev
<clever> looks like core0 runs kernel_main, while core1/2/3 run proc_hang?
dude12312414 has joined #osdev
<SikkiLadho> yes that is right
<clever> and proc_hang is an infinite loop with no cpu idle'ing
<clever> but, did you tell the arm stub to release the other 3 cores? or is this kernel_old=1?
<clever> it looks like a no
<clever> so core 1/2/3 never actually go into proc_hang
<clever> they are instead still running secondary_spin, from https://github.com/raspberrypi/tools/blob/master/armstubs/armstub8.S#L156-L161
<bslsk05> ​github.com: tools/armstub8.S at master · raspberrypi/tools · GitHub
<SikkiLadho> How do i tell the arm stub to release the other 3 cores?
<clever> you must write to the 3 spin addresses in the device-tree
<j`ey> hm, so it sounds like linux is actually starting them then?
<j`ey> hence the CPU mode mismatch
<clever> j`ey: yeah, the hypervisor never actually grabbed the other 3 cores, so the firmware gave linux those 3 in HYP/EL2 mode
<bslsk05> ​github.com: linux/bcm2711.dtsi at rpi-5.10.y · raspberrypi/linux · GitHub
<clever> SikkiLadho: in the device tree, there is an enable-method of spin-table, and a cpu-release-address of 0xe0, 0xe8, and 0xf0
<clever> if you write a 64bit addr to those slots, and then run the SEV opcode, then secondary_spin will jump to the 64bit addr you wrote
<clever> 0xe0, 0xe8, and 0xf0 matches up to lines 179-188 of armstub8.S
<clever> SikkiLadho: you must also modify the device-tree, to remove that enable method (or re-implement it, or replace it with PSCI)
<SikkiLadho> I have able to "edit" the device tree with libfdt previously, so I will try to  modify it.
<SikkiLadho> Thank you everyone for the direction
<clever> SikkiLadho: so, there are ~4 steps
<clever> 1: delete 3 cores from DT, so linux cant start them in EL2 mode
<clever> 2: your hypervisor starts them itself, so you control all 4 cores
<clever> 3: re-add the cpu cores back to DT (or just modify them), so linux uses something else like PSCI or spin-tables at a new addr
<clever> 4: when the spintable/PSCI gets an event from linux to wake a given core, run linux in EL1 mode on that core
<clever> the same as how you ran linux in EL1 on core0
<mrvn> clever: doesn't the DT say wether to write 32bit or 64bit?
<clever> mrvn: i think its always native width?
<mrvn> I would assume the cell size to match
<clever> Documentation/arm64/booting.rst:- CPUs with a "spin-table" enable-method must have a 'cpu-release-addr'
<clever> arch/arm64/kernel/smp_spin_table.c: .name = "spin-table",
<knusbaum> Hmm. My executable isn't dynamic, so maybe this means nothing, but ld.so gives me this: "./out.o: error while loading shared libraries: ./out.o: ELF load command address/offset not properly aligned"
<bslsk05> ​github.com: linux/smp_spin_table.c at rpi-5.10.y · raspberrypi/linux · GitHub
<clever> mrvn: writeq_relaxed is whats used to write to the addr
SikkiLadho has quit [Ping timeout: 272 seconds]
<clever> mrvn: which i think is 64bit based
<mrvn> knusbaum: if your executable isn't dynamic then it wouldn't call ls.do
<mrvn> clever: doesn't mean that's actually "portable". :)
<mrvn> and who names their executables .o?
<knusbaum> me
xenos1984 has quit [Read error: Connection reset by peer]
<knusbaum> when I'm running a test and don't care what the file is called
<GeDaMo> I knew an engineer who would name all his programs 'fred' :P
<sham1> I suppose that's better than the default from gcc and clang which is a.out
<mrvn> ahh, back in the good old days, when an a.out actually way a.out. Nowadays the default name should be elf.
<mrvn> s/way/was/
<mrvn> knusbaum: it looks like you are trying to execute an object file, not an executable.-
<mrvn> Are you aligning stuff to page boundaries?
<mrvn> padding between sections?
srjek has joined #osdev
dude12312414 has quit [Quit: THE RAM IS TOO DAMN HIGH]
sortie has quit [Remote host closed the connection]
roan has quit [Read error: Connection reset by peer]
sortie has joined #osdev
srjek has quit [Ping timeout: 240 seconds]
ElectronApps has quit [Remote host closed the connection]
X-Scale` has joined #osdev
teroshan2 has joined #osdev
buffet5 has joined #osdev
zid` has joined #osdev
DonRichie2 has joined #osdev
woky_ has joined #osdev
SikkiLadho has joined #osdev
<knusbaum> mrvn, I'm manually putting together an ELF executable, so surely I'm doing something wrong. Here's the readelf -a output: v
<mrvn> Data: 2's complement, little endian
<mrvn> Is there any system with ELF that has 1's complement?
<knusbaum> Not that I'm aware of.
<knusbaum> There are a lot of strange features in ELF.
Brnocris1 has joined #osdev
Terlisimo1 has joined #osdev
<mrvn> 0x0000000000000078 isn't page aligned. so that might be the problem.
__xor has joined #osdev
EtherNet_ has joined #osdev
vin1 has joined #osdev
pg12_ has joined #osdev
<knusbaum> mrvn, Do segments need to be page aligned in the file? That's the source offset.
xenos1984 has joined #osdev
<mrvn> knusbaum: if the ld.so want's to use mmap then the alignment of offset and physaddr must match in the lower bits.
<knusbaum> hmmmmmmmmmmmm
<knusbaum> I did not know that.
<mrvn> Not sure how you got ld to place the section at 0x78.
<mrvn> Did you set pagesize for the ld to 1?
<knusbaum> I put it at 0x78. It's my assembler and linker.
<mrvn> Well, your linker isn't compatible to the ld.so then.
<knusbaum> Yes, that's what I'm trying to figure out :)
<mrvn> The default page size for x86_64 is 2MB.
<knusbaum> wat. I thought the page size was like 8k unless using big pages
<mrvn> 4k is what you use for kernels and might work with linux ld.so but I never tried.
<mrvn> knusbaum: 8k is sparc iirc. Nearly everything has 4k as smallest.
<mrvn> And elf files are aligned to hguge pages on x86_64.
X-Scale has quit [*.net *.split]
nyah has quit [*.net *.split]
blockhead has quit [*.net *.split]
vin has quit [*.net *.split]
kkd has quit [*.net *.split]
zid has quit [*.net *.split]
EtherNet has quit [*.net *.split]
Brnocrist has quit [*.net *.split]
teroshan has quit [*.net *.split]
DonRichie has quit [*.net *.split]
_xor has quit [*.net *.split]
Clockface has quit [*.net *.split]
simpl_e has quit [*.net *.split]
ephemer0l has quit [*.net *.split]
Terlisimo has quit [*.net *.split]
pg12 has quit [*.net *.split]
dh` has quit [*.net *.split]
buffet has quit [*.net *.split]
woky has quit [*.net *.split]
koolazer has quit [*.net *.split]
X-Scale` is now known as X-Scale
teroshan2 is now known as teroshan
DonRichie2 is now known as DonRichie
buffet5 is now known as buffet
<knusbaum> Ok. My memory told me linux bumped from 4k to 8k but I must be misremembering.
<mrvn> absolutely not. they support it but its not what is used on x86_64.
<mrvn> Some hacks need 8k pages, 2 consecutive 4k pages.
<knusbaum> I see.
<mrvn> ARM has 16k pages made up of 4 4k pages even in hardware.
<knusbaum> Ok so let me try aligning to 0x200000
<knusbaum> You say this needs to be both in the ELF itself *and* in virtual memory?
<knusbaum> I must be misunderstanding. There are binaries smaller than 2MiB
<mrvn> I'm not sure what is allowed, as said 4k is probably sufficient. Default is 2MB alignment in ld which trips many people up when they try to get the multiboot signature into the first 8k of the file.
<mrvn> SO just try matching up the lower 12 bit and see if that solves the problem.
<mrvn> If I look at e.g. /bin/bash I see:
<mrvn> [25] .data
<mrvn> PROGBITS 0000000000125700 0000000000124700 0
koolazer has joined #osdev
nyah has joined #osdev
kkd has joined #osdev
garrit has quit [Remote host closed the connection]
<knusbaum> Ok, cool. Let me try that first.
Brnocris1 is now known as Brnocrist
[itchyjunk] has joined #osdev
<bslsk05> ​bxr.su: Super User's BSD Cross Reference: /FreeBSD/lib/libc/stdtime/strftime.c
roan has joined #osdev
k8yun has joined #osdev
Payam has quit [Quit: Client closed]
roan has quit [Quit: Lost terminal]
<knusbaum> mrvn, thanks so much. I just aligned the section to 4k in the ELF and it works now.
<mrvn> knusbaum: you should use the page-size makro to get the systems page size.
<mrvn> just in case you port to other archs later.
<knusbaum> Hmm. good idea. Hard when cross-compiling. I wonder what the safest value is.
<mrvn> knusbaum: it's a runtime function.
<knusbaum> Not cross-compiling the assembler, I mean when the assembler/linker are cross-assembling/linking.
<knusbaum> The ELF is generated on a different machine than it runs.
<mrvn> ahh, yeah. You just have to know then.
<mrvn> I guess you have to use the static define then: /usr/include/x86_64-linux-gnu/sys/user.h:#define PAGE_SIZE(1UL << PAGE_SHIFT)
<knusbaum> Yeah, I know my target OS so I could pull headers or something like that so it stays up to date.
k8yun_ has joined #osdev
<knusbaum> At least up to date as of whenever I build the assembler and linker
<SikkiLadho> Hey clever, you wrote that "CPUs are instead running in armstub(in secondary_spin)" but I used trusted firmware -A instead of default armstub. So where are the CPUs cores now?
k8yun has quit [Ping timeout: 240 seconds]
<clever> SikkiLadho: in the trusted firmware version of that, same rules apply, your hypervisor must claim those cores according to the TF's rules, and then run linux in EL1
<mrvn> knusbaum: It's unlikely to ever change once you have the define.
k8yun_ has quit [Quit: Leaving]
immibis has quit [Remote host closed the connection]
immibis has joined #osdev
immibis has quit [Read error: Connection reset by peer]
immibis has joined #osdev
<mrvn> clever: have you ever measured how much heat the arm cores produce in the secondary_spin compared to sleeping?
<clever> mrvn: the official secondary_spin is using wfe, so the cpu will properly park itself in idle mode, until an sev occurs
<mrvn> and how much it alows down the primary core if you don't put them to sleep?
<clever> and the arm arm has an entire chapter on how the core behaves in wfe mode
<clever> such as if all cores go into that mode, the l2 cache also parks and shuts things off
<clever> so its already in the most optimal state by default
<bslsk05> ​github.com: tools/armstub8.S at master · raspberrypi/tools · GitHub
<mrvn> Oh, that's changed then from when I looked at the fist multi-core RPi. The firmware would just spin endlessly in a busy loop causing heat and using bus cycles.
<bslsk05> ​github.com: tools/armstub7.S at master · raspberrypi/tools · GitHub
<clever> and this is the arm32 version, for any quad-core board
<bslsk05> ​github.com: armstubs: Add wfe to ARMv7/ARMv8-32 stubs · raspberrypi/tools@b23276d · GitHub
<clever> i can then just check git history, and find that WFE was added in may of 2017
<clever> and armstub7.S was created in april of 2016, but the commit msg implies this is after the pi2 release
<clever> wikipedia says pi2 came out in February 2015
<clever> mrvn: so yes, that problem was likely in existance for ~2 years
srjek has joined #osdev
immibis has quit [Read error: Connection reset by peer]
immibis has joined #osdev
<SikkiLadho> Thank you clever. You also said that core1/2/3 never actually go into proc_hang loop but are inside secondary_spin . Why is that when I explicitly branch the cores to proc_hang?
<clever> SikkiLadho: the arm trusted firmware, will only run your kernel on core0
<clever> you have to use another api (probably PSCI) to ask the ATF to start the other 3 cores up, at a location of your choosing
<clever> reading the /cpus node in device-tree will tell you what api you should use
SikkiLadho has quit [Ping timeout: 256 seconds]
ephemer0l has joined #osdev
k8yun has joined #osdev
zid` has quit [Ping timeout: 240 seconds]
xenos1984 has quit [Remote host closed the connection]
the_lanetly_052 has joined #osdev
xenos1984 has joined #osdev
zid has joined #osdev
dennis95 has quit [Quit: Leaving]
dude12312414 has joined #osdev
nur has joined #osdev
<geist> note the model of waking up the cores by writing to an address is not the usual model, but it's what RPI does
<geist> usually PSCI exists and then you make a call to it
<nur> woah I just walked in on something interesting!
<geist> and thats probably wha tyour hypervisor should do, wait for linux to make a PSCI call
<nur> is someone writing an RPI hypervisor?
terminalpusher has joined #osdev
<geist> yah, though they left a little while ago
<j`ey> yes!
<nur> is it up somewhere
<geist> yah there should be a log link in the topic
<bslsk05> ​SikkiLadho/Leo - Leo Hypervisor. Type 1 hypervisor on Raspberry Pi 4 machine. Mailing List: https://www.freelists.org/list/leo (1 forks/8 stargazers/GPL-2.0)
<nur> ah sweet thanks
elastic_dog has quit [Ping timeout: 260 seconds]
<zid> I think my iommu was disabled in my bios, rip
<nur> why rip
<clever> geist: ah, you mean the hypervisor should emulate its own PSCI? and only when linux wants to wake a core, does the hypervisor also wake the core?, but obviously route the core into the hypervisor, so you can init it, and then run linux in EL1
<zid> not had an iommu until now
elastic_dog has joined #osdev
<zid> I had 14 pages of dram timings in the way of finding it
<geist> clever: yeah. that's pretty standard. that's i think the main reason why PSCI is specced to either be accessible via HVC or SMC
<geist> if you're an EL2 you set up the DT to say PSCI exists, and tell the guest to use HVC and then just emulate psci cpu on/off and the other mandatory ones
<geist> makes for a nice interface to bring cores on and off and shutdown the guest
eschaton has joined #osdev
elastic_dog has quit [Client Quit]
elastic_dog has joined #osdev
<mrvn> Is there a hypervisor already that lets me run linux on 2 cores and play with my own kernel of the other 2?
<clever> mrvn: couldnt you just use /dev/kvm for that?
<mrvn> wouldn't that time share the cores?
<clever> there may be tunables to tell the linux scheduler to never schedule to that core
<clever> so only kvm gets it
<bslsk05> ​access.redhat.com: 33.8. Setting KVM processor affinities Red Hat Enterprise Linux 5 | Red Hat Customer Portal
<mrvn> pinning kvm to 2 cores would be a start but I think not a garantee it runs all the time.
<clever> i think another thing, is how the kvm api works
<bslsk05> ​david942j.blogspot.com: Play With Capture The Flag: [Note] Learning KVM - implement your own kernel
<clever> i think when you run ioctl(vcpufd, KVM_RUN, NULL);, the core you did that on, will take a brief pop in EL1, then EL2, then back to EL1 within the guest
<clever> so you just need to pin your userland threads, that are driving kvm, to a given core
<clever> and when the guest hits a trap, it gets handled by that thread, on the same core that triggered the trap
<clever> mrvn: so you basically just need to tweak the default linux affinity, to only use core 0/1, and then pin your kvm threads to core2 and core3, one thread each
<mrvn> is there a default affinity?
<clever> and if linux never has something to schedule on core2/3, it wont pre-empt the guest
<bslsk05> ​unix.stackexchange.com: numa - Globally setting CPU affinity - Unix & Linux Stack Exchange
<clever> according to this, you can set it in systemd
<clever> and i'm assuming that just sets the affinity in pid1, before spawning the rest of the os
<clever> which then just inherits it
vdamewood has joined #osdev
<mrvn> Makes sense. But I bet that needas a reboot.
<clever> `man taskset`
<clever> -a, --all-tasks
<clever> Set or retrieve the CPU affinity of all the tasks (threads) for a given PID.
<mrvn> Bigger problem will be worker threads from kernel and user space. They should only have 2, not 4.
<clever> then you just need a list of all PID's, and pray there is no race condition
<clever> there are also hotplug tricks for that, one moment
<mrvn> I can turn off cpus in linux but that would prevent kvm running on them.
<clever> maxcpus=1 in the kernel cmdline, and it will nearly forget that the other cores exist
<clever> /sys/bus/cpu/devices/cpu1/online
<mrvn> yes. but if the cpu is offline how do I get kvm on it?
<clever> but if you `echo 1 > /sys/bus/cpu/devices/cpu1/online`, it will act like you hot-plugged an entire cpu core, and start doing things
<mrvn> it will actually power down the core.
<clever> and half of the system initialized on the assumption you only had 1 core, so it only made 1 worker
vinleod has joined #osdev
<clever> probably depends on what the drivers and core are able to do
<clever> the rpi cant actually power the cores off, only put them into an idle state
<mrvn> a WFE loop normaly.
<clever> the entire arm cluster is in a single power domain
<clever> turning that off results in a loss of all arm state
<clever> so thats much more in suspend-to-ram territory, rather then parking an unused core
vinleod is now known as vdamewood
vdamewood has quit [Killed (silver.libera.chat (Nickname regained by services))]
<geist> hmm, i dont know of a hard type 1 or even a lower level thing that partitions the cores totally statically
<geist> but as clever says you can get pretty close to the effect by kicknig everything off the other cores
<geist> no matter what you do you probably still need at least a thing hypervisor at EL2 so the different EL1 guests dont interfere with each other
<clever> i think xen does let you do it
<geist> and that basically by definition needs to be shared across the cores, otherwise you'd have guests interfering with each other re: tlb flushes and whatnot
<mrvn> and something to associate peripheries with one of the two kernel or emulate them.
<geist> yah
<clever> in the initial xen config, you can just say to boot dom0 on core0/1
<geist> yah i think lots of big cloud VMs and whatnot hve the avbility for more static configs
<geist> i'd also say esxi (which has some beta port for rpi) but i think fundamentally it' still a type 2 hypervisor. it's just not linux based
<geist> the esx kernel is just a thinner posix kernel whose job it is to be the dom0
MrBonkers has quit [Quit: ZNC 1.7.5+deb4 - https://znc.in]
<mrvn> in clouds you have the dom0 running linux doing nothing and then VMs on fixed cores.
<geist> i wouldn't say nothing though. unless you have dedicated hardware you probably are still running a bunch of network and storage stack on it
<mrvn> I could run a linux VM and my-kernel VM on 2 cores each I guess.
<geist> what the hardware fundamentally doesn't have is some way to at the physical level carve up the system into separate domains
<geist> i bet some of the z machine stuff does from IBM, but i dont think that's much of a requirement anymore
<mrvn> on the rpi most peripherals are page aligned so that could be split
<geist> sure but i'm thinking more fundamentally: you need two level paging to be able to keep two guests from touching the same ram, and some sort of virtualization of the interrupt controller to keep them from touching each others irqs, etc
MrBonkers has joined #osdev
<geist> arm and x86 hardware have this feature, but only via some amonut of software virtualization
<geist> ie, a hypervisor
<mrvn> internal interrupts for me, GIC for linux?
<geist> sure but even internal interrupts involve an interrupt controller
<mrvn> How does it work with kvm? Every write to the interrupt registers faults into the hypervisor?
<geist> but at this point it's just details. you pretty quickly arrive at the same answer: run linux + KVM and then just set up the thread affinity such that you get pretty much complete access to the cpu
<geist> mrvn: for the most part yes. especially with older GICs like GICv2, etc
<geist> v3/v4 have some additional virtualization capability to reduce the amount of traps needed
<geist> but rpi4 has a GICv2, like all cheap arm socs. and prior to that rpi doesn't even have a GIC, so any GIC in any virtualization scheme on a rpi3 or whatnot by definition involves a fully virtualized interrupt controller
<geist> which just means trap-and-emulate
<geist> i was actually reading the manual the other day to see how GICv3 (or was it v4) fully virtualizes interrupts and it's kinda neat. basically has a in memory scheme to let you (the hypervisor in EL2) build a list of pending virtual irqs that will fire when you enter the guest, and vice versa
<geist> the list is limited in size, like 8 or 16 but overflow is doable in software. so in most cases most irqs are dispatched directly in hardware without any exiting
<mrvn> that's something the GIC does? Would have thought the mode change in the cpu would do that
<geist> also you can deliver irqs to a cpu that's currently running in the guest without stopping it
<geist> the mode change does it without this assist, but with the assist you can avoid a mode change even more so
<geist> basically the GIC has a context that you swap in and out as you're switching between geusts, and the context itself has a list of pending irqs and whatnot
<geist> note simpler designs like GICv2 do as well, but they're basically jsut a bitmap of pending IRQs, so you can also do a lot of this without the list, but the list is more flexible and handles more cases
<geist> i think this design also allows for one virtual cpu to deliver an irq directly to another one without exiting, i think
<mrvn> do they have a mask register that hides bits the running VM should not read or write?
<geist> it involves at any point in time the EL2 having a translation of what virtual cpus correspond to physical ones
<nur> is there any mmio cases on rpi that addresses something beyond the 32 bit address range
<mrvn> nur: tons of them
<geist> really? i'm not sure broadcomm has put any mmio > 4GB
<mrvn> you can configure all peripheral to be in 64bit space then everything will be beyond
<mrvn> geist: the VC remaps them
<geist> oh yeah 4 has one of theo config bits that moves everyhting?
<geist> aaaah yes. okay you're right, yep
<nur> the VC?
<mrvn> nur: the actual CPU running the raspberry Pi. The ARM is just a secondary processor running behind it.
<geist> videocore
<nur> just... trying to figure out where to _start_ writing for rpi and I figured, I know, mmio but then I realised I didn't know jack about how to do that. The examples I've seen take a 32 bit address but I figured wait what if you needed more than that
the_lanetly_052 has quit [Ping timeout: 256 seconds]
<geist> nur: it's not really any different if it's 64bit
<geist> just use a 64bit pointer. no difference
<nur> well the function header needs to be different I guess?
<mrvn> nur: the peripherals are at different addresses on every RPi model. You should parse the Device Tree to find out where.
<geist> and if you're doing that you are on a 64bit cpu so it's natural, because the cpu will be using 64bit pointers
<geist> (and yes i know PAE, etc etc lets not confuse nur)
<nur> can I just use one mmio function that takes a 64 bit pointer
<geist> or hard code what you're doing for the particular rpi
<nur> ah
<mrvn> nur: to do what?
<nur> DT
<nur> mrvn, to write bytes to memory
<geist> nur: well, first. precisely what raspberry pi are you trying to write software for?
<nur> to init hardware and stuff
<mrvn> nur: sure. But that's awfully slow.
<geist> and secondarily, which ones do you intend to ever run your software on?
<nur> geist, the one that qemu implements...rpi3 I think
<nur> I don't actually have one of those
<geist> are you intending to run it in 64bit or 32bit mode?
<nur> 64
<geist> they're pretty different environements, i suggest 64
<mrvn> nur: usualy you just decalre the MMIO register volatile and be done with it.
<geist> good. okay.
<nur> mrvn, okay you lost me
<geist> so if you're building 64bit then by definition all of your pointers are 64bit, so any pointer is already 64
<nur> ok
<zid> `void *` already exists and dynamically changes size, basically
<zid> no need to do anything special
<geist> `*(volatile uint32_t *)0x1234 = 99;` would already be derefericing a 64bit pointer even if it is a 32bit value
<geist> etc
<mrvn> nur: There are 3 ways to access the MMIO registers: You can declare them external and set their addresses in the linker script, you can initialize pointers to the address or (in C++) you can use placement new.
<geist> or you can use an accessor function as you said before
<geist> mmio_write(void *address, uint32_t val); etc
<nur> ohhh
<nur> is it not better practice to say what the size of address is
<geist> sure, it is, but i slammed that out in 2 seconds
<geist> there are better styles, cleaner code, etc etc
<geist> just trying to get you aligned in the right direction :)
<nur> okay thanks sorry :)
<mrvn> you should have mmio_write8, mmio_write16, mmio_write32, mmio_write64
<nur> got it
<geist> right, also depends on the language you're using, how fancy you want it, how type safe you want it, etc
<nur> that makes sense
<mrvn> not sure the RPi has anything other than mmio_write32/64 though.
<geist> but fundamentally you're trying to get the compiler to do something like
<nur> language is C
<geist> `mov register, a_pointer; ldr dest_register, [register]` to read from an address
<geist> and a str for the other direction
<nur> ohh
<nur> okay
<nur> ah that makes sense yes
<geist> and since in this case 'register' is a 64bit register, it will by definition be a 64bit pointer, etc
<nur> heck I could write it in asm
<nur> the mmio function
<geist> you could, and there are some reasons to do so, but it'd be overkill and a bit of a risk right now
<nur> right
<geist> in general when getting started i recommend the last amount of inline asm (or asm) as you can get, because all safeties are off and you can have bugs that continually bite you
<geist> since it's subtle and tricky and there are no safety nets there
<nur> juggling chainsaws whee
<mrvn> nur: You must read/write from a MMIO register in exactly the documented size or you get total garbage. So I would suggest making a "struct MMIO32 { volatile uint32_t *reg; }" and struct MMIO64 if you need it. The type for MMIO registers should be incompatible with any other type so you can't accidentally call something with the wrong thing.
<geist> yah the struct method works pretty well. i generally have moved away from it, but it's more of a style thing, and i think it's perfectly usable
<geist> a good clean solution
<mrvn> Saddly the struct is the only way to make the compiler insist on the right type.
<geist> yah
<geist> well or some silly super fancy C++ (like we're doing in fuchsia which i dislike but does weork)
<nur> I see
<mrvn> C++ has more ways but 20:06 < nur> language is C
<nur> I'm keeping it simple
<geist> agreed
<geist> yah i'm in no way advocating super fancy C++, just saying there are optios there in case some day yo ugo that direction
<nur> thanks, got it :)
<geist> for once i think mrvn and i are on the same page and agreeing!
* geist high fives mrvn
<nur> is that uncommon
<geist> both of us like to argue each other into the ground over details
<mrvn> Even if it's a bit more complex having distinct types will save you tons of headaches. It's just too easy to misuse some value for something it shouldn't be used for and distinct type will then give compiler errors.
<nur> I feel like all this good information should go into a blog entry but I don't want to get it subtly wrong and pollute the internet :)
<geist> nur: possibly. but the main problem is everyone has a different style here, so even the fact that mrvn and i agree here means there are 10 people foaming at the mouth ready to attack us if this went to a wider group
<mrvn> nur: a best practices page for the osdev wiki would be better
<geist> but. the real meta here is there are lots of ways to do this and making forward progress is more important than doing it in some ideal way on day 1
<nur> as in life itself
<mrvn> or even just a side-by-side comparison of different methods.
<geist> i'm always a huge advocate of keeping momentum and not getting bogged down in details, as we are all prone to do
<mrvn> nur: The most important part you have to take away is that all MMIO access must be a) right size, b) volatile.
<nur> "volatile" is just "write to register addresses in C" right?
<nur> is there something in the background that happens to ensure that this is done right
<geist> or even more generic, you're trying to make sure the compiler is for every mmio access emitting precisey one load or store instruction, and that it is in the right spot of the program
<nur> like it's telling the compiler "this is a _register_ I am trying to write to"
<mrvn> nur: volatile tells the compiler that any read/write to that register has an observable effect. The compiler must not optimize them out.
<geist> hence volatile which among other things tells the compiler it *must* do the access right here
<mrvn> Also a read might not return the same value you just wrote.
<nur> why not
<GeDaMo> It might be changed by something other than your program
<nur> oh like devices!
<GeDaMo> Like a device changing state
<geist> language lawyers will tell you that there are probably a ton of edge cases etc, and there are memory order and memory barrier issues to deal with in some exotic situations, but when getting started you can generally ignore that. and double plus so when against an emulator
<GeDaMo> :D
<zid> unless you're doing threads you can infact, completely ignore memory barriers, volatile is a compiler barrier and that's plenty
<geist> zid: weeeeeelllll not really. but lets not side track it
<mrvn> nur: Take the UART as example. When you write to it it outputs a character to the serial port. But when you read from it it gives you the character received over serial. Not what you wrote.
<zid> (unless you're an alpha)
<geist> zid: or ARM
<mrvn> zid: write back buffers?
<nur> mrvn, and this is important to tell the compiler "ASSUME NOTHING"
<nur> because the compiler makes assumptions when optimizing
<nur> right?
<mrvn> nur: yes.
<mrvn> nur: volatile tells the compiler it can't do that here.
<nur> I got another one right again.
<nur> group hug.
<geist> awww
<geist> it's nice when i virtually hear lightbulbs going off above someones head
<GeDaMo> Exploding lightbulbs? :|
<geist> it's honestly what this place is all about. screwing in bulbs
<mrvn> nur: used to be volatile was for interrupts but that use case has been basically obsoleted and shouldn't be used.
<nur> :>
<geist> s/going off/turning on/!
<zid> I screwed in some screws earlier, hated it
<mrvn> geist: exploding lightbulbs is so much more fun
<zid> (my cpu cooler is really annoying to mount)
<geist> zid: we can blab about memory barriers on ARM in a bit once this is over if you're interested
<geist> zid: i dunno, i kinda like the screw in type of coolers, vs clips
<geist> i once had the plastic clips holding one of my coolers snap off spontaneously one day
<zid> the stock intel cooler had little push pin clips
<zid> they were dire
<geist> the cooler fell off the cpu, and within about 3 seconds the machine shut down
<geist> was an athlon, late 2000s
<mrvn> lucky it didn't short anything important
<geist> yah totally
<zid> I've booted a pentium 4 machine with a glass of water
<mrvn> could have fallen on the PCI cards and break them off or damage the ports.
<zid> It had like 110C tjmax so until all the water boiled away it was perfectly fine :p
<geist> mrvn: yah really what happened is it pivoted over, since the bottom clip was still holding it. it just turn and hit the outside of the case so was at about a 20 degree angle
<geist> but wasn't touching the cpu anymore
<zid> Mine is a pain to mount so I *re*mounted it to see if it'd improve temps, there was a little bald spot of thermal paste
<geist> it was a tall heavy cooler
<zid> but it also was a 2nd pain to mount so I may have just done it again
<geist> yah sometimes with the screw ones you can overtighten and bend the mobo too so it doesnt make great contact
<zid> that's why they have backplates
<mrvn> and springs
<geist> was watching a vid on a current mobo with cheapo backplates that gamers nexus guy was complaing about
<mrvn> and stops. tighten the screws till the stop and then the spring gives the right tension
<zid> same gen as mine, not looked at mine but it's probably not too different
<geist> an issue being that los of mobos have crummy backplates
<zid> I can find gene 3 and extreme 4, but not gene 4 on GiS
<zid> My thing is just annoying to mount because it's basically on a small contact patch of butter, while you're trying to align 4 different screw holes
<zid> would be better with a long shafted screwdriver and 8 hands and no case
<geist> reminds me, i need to build an alder lake PC to test the new E and P core stuff
<geist> been putting it off. i had a parts list all set up at newegg and then a huge bruhah started up so i decided i shouldn't use them
<zid> friendo did a new PC last week
<zid> I told him to just get a 12600k
<zid> his mobo was being dumb wrt ram though
<zid> I blame gigabyte, i hate gigabyte
<geist> yah i tend to prefer ASUS. have had good luc with them and their bioses are usually reasonable
<zid> <3 asus
<geist> yah even their cheapo PRIME series, which is my usual go to if i'm not building a high end gamer PC
<geist> aaaaaand lots of asus boards *still* come with a COM header
<zid> my 2011 mobo is still £200 second hand on ebay
<geist> most do, eactually
<geist> so for osdev it's ❤️
<zid> so if you give a shit about resale at all, asus gud
<geist> you can get a $3 10 pin COM header to DE9 and you're set
<zid> I have the super fancy superio that has EVERYTHING
<zid> but nto everything is wired up cus.. micro atx
<zid> got a soldering iron handy? :P
<geist> yah i thik that's the key to the COM header. usually it's just a few pins off the nuvoton superio chip, which they'll have if they also have PS/2 or whatnot pins
<geist> but they probably wont stuff on an etire superio chip if com was the only thing they needed
<zid> there's a few models of the superio yea, I have the EXTREME MEGA 48003489 PIN PACKAGE one
<zid> so that they can give me ps/2 and stuff
<zid> NCT6776F
<kingoffrance> i noticed ack 16-bit pc86 target, i can int main(void) { char a; a = 5; a = 6; a = 7; return 0; } and it is simple enough not to optimize those out, despite no volatile, despite not being used anywhere; with optimized gcc they disappear from the disassembly :D
<zid> 102 pins
<zid> err 128 pins
<geist> anyway re: memory barriers and IO and ARM: you *should* memory barrier after writing to a device MMIO register bank because it forces the cpu to flush the transactions across the bus
<geist> if the mmio is mapped with 'device memory' (whcih you should) then the transcations are in order, not write combined, etc, but... they're fire and forget
<zid> I had a thing typed out and I cut it, but then copy pasted an image so I lost it
<geist> so the cpu moves on before the devie has acked the transations
<mrvn> kingoffrance: optimization can happen, but doesn't have to happen.
<geist> but a DSB forces the bus to flush it out
<zid> how strong is the arm memory model? does it reorder writes/writes, loads/loads, writes/loads?
<kingoffrance> mrvn, i just mean, older compilers perhaps are less trigger happy
<mrvn> geist: even for device mnemory you should use a barrier?
<geist> zid: it's complicated. with normal memory (which is usually how regular memory is mapped), it's weakly order
<zid> so it's up to the MTRR equivalents?
<geist> mrvn: yes. but its hard to construct a situation where you hit a problem, except when you do
<geist> zid: no. not at all. it really has no equivalent model to MTRR
<zid> You were just going on about 'device memory' and 'normal memory' and 'how it's mapped' so it sounded like it did
<geist> so one thing at a time. to zids question: ARM is weakly ordered, so for normal memory it is free to reorder things as it sees fit *on the wire* but the cpu must still appear to be in order relative to itself
<geist> ie, it can't hoist a load over a store such that it is inconsistent with itself
<mrvn> geist: take a revision 1 RPi and access different peripherals without barrier. The reads and writes do get scrambeled randomly because the bus implementation they connected to is garbage.
<zid> x86 could just set UC on the range even if it wasn't as strong
<geist> zid: reason i say it's not the sam thing as MTRR is ARM doesn't do it the same way. MTRRs refer to physical memory, ARM tags different memory types at the mmu, on the page table entries
<geist> so it's a more complicated thing. it's kinda the same idea but done at a different layer
<zid> I mean.. I'd consider that basically the same
<geist> sure depends on which part you're thinking about
<geist> anyway. so that's the 'weak memory model' that most programmers hit. means between threads without a memory barrier you cannot ensure that memory trasactions appear in any particular order
<mrvn> and here I thought MMIO wouldn't need those barriers anymore with newer (fixed) rpi hardware.
<geist> mrvn: i'll get to the why in a bit
<geist> it's a differnt thing entirely to what i'm talking about with zid
<mrvn> device memory is never reordered on the wire, right?
<geist> so i think zid is satisfied, since i'm hearing no complaints
<geist> and they usually type faster :)
<geist> mrvn: armv8 defines memory model a bit more explicitly than armv7. in the case of ARMv8 you set up memory parameters in the form of a few bits that you set per page
<zid> so ultimately the question is, on 'device memory' (whatever that entails, page table bits to mark it UC or whatever by the sounds of it), does a single cpu need barriers to ensure write ordering?
<geist> okay so onto device memory. it operates compeltely differently from 'normal memory' since it'suncached
<zid> you would still on things like alpha, you don't on x86, but not sure about arm
<geist> so in armv8 you specify memory paraters as a series of bits
<geist> sometimes you'll see something like GRE or nGnRnE or nGnRE or whatnot
<geist> iirc off the top of my head: G = gathering, R = reordering, and E = early acknolwedge
<geist> gathering = can merge subsequnt stores and issue a single transction
<geist> reodering = can rearrange the order of stuff across the bus
<zid> so don't set R and you can ignore barriers by the sounds of it
<geist> early acknlowefge = can issue a write and then 'move on' without waiting for the device on the other end of the bus to ack it
<zid> E sounds useful for devices
<zid> assuming they're clever enough about it
<geist> so in armv8 sense as it maps to armv7 'device memory' == nGnRE
<mrvn> not sure why E would matter
EtherNet_ is now known as EtherNet
<geist> and 'stongly ordered' == nGnRnE
<geist> basically you avoid SE like the plague and only use in very special situations, and map things like MMIO registers as 'device memory'
<geist> what this means: when you read from a mmio it reads it full stop, doesn't reorder other memory transctions *to the same device aperture*
<geist> so 3 reads of the same register happen as 3 reads, or 3 reads from different registers on the same bus, 3 different reads, not reodeered
<geist> writes are similar *except* the cpu issues the writes, in order, not combined, but doesnt' wait for the device to ack it
<geist> so it can take 10s or hundreds of cycles for the device to *see* the write
<geist> and by then the cpu has moved on
<mrvn> and why would that matter?
<mrvn> (other than the MMU)
<geist> why this is a problem: if you write to some device and then move over and write to another one, you can have two write transcations in flight, and they can be reordered *relative to each other* as they go across the bus
<zid> yea I can't come up with a situation off the top of my head where that's an issue, but I bet I could construct one if I tried
<geist> so the canonical exampe is something like you get an irq, go read the devie that caused it, handled the situation, then went back and acked the irq at the interrupt controller
<geist> sine both of them are different deviceds, the ack can appear to the IC before the device got its writes
<zid> involving something like IRQs, yea :P
<mrvn> Huh? I thought the reorder bit would cover that case
<zid> how does it know they're different 'devices'?
<geist> mrvn: 'early acknolwedge'
<geist> setting of it means it fires and forgets the writes across the bus
<mrvn> So you are saying the CPU writes them out in-order but the bus with all its hops reorders them?
<geist> correct
<geist> but there are rules there, but the way AXI busses work on ARM think of it as a tree of point to point
<geist> as the memory transaction drills across the tree, it goes from the root (the cpu) down towards devices. as it forks and heads across branches, they can get locally buffered and appear out of order
<geist> based on contention
<mrvn> Ok, so if the devcie and GIC are on different branches they can get out of order in respect to each other.
<geist> so the ARM ARM is very squishy about it. basically within a 'device aperture' or whatever they call it, you can be pretty assured stuff is in order
<geist> since there's only one path to that device
<geist> but as you start crossing devices, you can't up front know if they're on the same leaf node in the tree
<mrvn> The GIC gets the interrupt cleared before the dvice gets it turned off and you get a spurious interrupt
<geist> mrvn: right. a DSB would force the cpu to stop and wait for all outstanding writes to be acked
<zid> now I want an acking write/read instruction and a pointer tag from C
<geist> this can also happen with things like setting up memory descriptors in uncached memory and then hitting the doorbell register in a corresponding device
dude12312414 has quit [Quit: THE RAM IS TOO DAMN HIGH]
masoudd has quit [Ping timeout: 256 seconds]
<geist> basically the ARM ARM has all sorts of squishy language about how things are in order in some cases and there are some cases where the bus can reorganize stuff, so you have to be aware of what domain various things you're interacting are on
<geist> or... you put in lots of memory barriers
immibis has quit [Ping timeout: 252 seconds]
<geist> i thin the canonical thing to do is interact with a device's registers and then when you're 'done' dealing with it insert a barrier to ensure the device sees what you told it
<geist> so if you read/write 3 regs in a row and are now done (say in your irq handler). isse a DSB and then exit
<mrvn> never thought having writes to devices having to be in a certain order as long as everything ends up where it should, But you are right, with devices connected to other devices (like interrupts) the order can matter.
<zid> so it is not the cpu with the ordering issue here, but the bus itself?
<geist> or between things like setting up in memory descriptors and then liking it in a HEAD/TAIL pointer on your e1000 or whatnot
<zid> hence how it knows about 'different devices'
<mrvn> zid: yep
<geist> zid: yah the cpu itself doesn't know what it's talking to, so its a bit of the way the busses are wired up that's bleeding in
<geist> you *could* map all MMIOs are strongly ordered, but that's *really* slow
<zid> and you work around it with E bits or a manual flush
<geist> since effectively that means the cpu is stuffing in a full bus barrier between every read/write
<mrvn> geist: so we are back to having to add barriers whever we switch between peripherals?
<geist> basically
<geist> or when crossing from memory to peripheral and back
<geist> ie in memory descriptors
<geist> or you can just stuff a memory barrier after every mmio transaction
<geist> ie, mmio_write() { ....; dsb; }
<geist> i think that's *basically* what we do in fuchsia, which is sub optimal, but probably the safest thing to do
<zid> You might be able to do some silly stuff to make it automatic to get the inter-device flushy stuff working
<geist> also one of those reasons the whole struct { volatile ; } thing falls over
<zid> something like a locking pattern
<geist> yah
<mrvn> I had some code for the RPi where the last device used is a template parameter sort of. So at the first access to a peripheral it generates a barrier because it's != the last.
<geist> i'm not a fan of things that automatically do this, but sine most people dont study the ARM ARM it's probably the right thing to do
<zid> Linter can just enforce that you do it
<zid> basically
<geist> i still dont think i fully grok the subtleties of it. especially the exact rules of switching bewee cached memory and device memory and what can be reordered relative to what
<geist> the manual reads like pages of legalese
<mrvn> geist: Today I would use something like pythons "with": with UART0 as uart { uart.data = 'h'; uart.data = 'e'; ... }
<zid> arm is silly news at 11? :p
<mrvn> or similar RAII like conecpt.
<geist> also as you can probably tell, most of this robably wont show up on a VM
<mrvn> What this really needs is a linear type system.
<geist> or when being emulated
<zid> I guess you could make this happen on any platform as long as your device took ages to deal with writes given to it
<geist> also since lots of implicit actions in ARM insert a memory barrier of some kind (though i haven't gotten into all the subtle variants of memory barrir on ARM)
<mrvn> emulated don't have the tree structure of the bus and VMs will flush or just take long enough.
<geist> lots of times you just dont notice
<zid> they just *happen* to all be fast
<zid> so ARM makes a situation that doesn't happen in practice, happen in practice
<geist> zid: honestly i dunno what the x86 memory model is with regards to io transactions
<mrvn> zid: you need 2 devices that are interconnected so the write order matters.
<geist> presumably they also just have a complicated back channel of acks and whatnot to make everythin appear in order
<zid> if everybody knew acking an e1000 took 40uS, we'd already have code to deal with not acking the IRQ controller too early
<zid> it just happens to be sub-instruction instead so we don't
<geist> but soometimes i'll ask someone at work that really groks x86 at the physical level and they'll tell me it's fantastically complicated and there are ways to make things weakly ordered if you really know what you're doing
<mrvn> geist: doesn't x86 with iommu do cache snooping?
<zid> my personal bet is just that the fact threads *do* exist means those barriers always present in practice
<geist> even without
<geist> anyway, so that was my long answer to 'welll, actually' from an hour ago
<geist> me regurgitating this every once in a while refreshes my memory
<geist> i'm getting pretty good at the nGnRnE thing now
<mrvn> geist: is a barrier on every write better than the nE flag?
<geist> that's a good question that i honestly dont know
<geist> i *thik* strongly ordered has more strictness that may be even worse than that though
<mrvn> I would think the barrier is the same as nE except you waste an extra opcode
<mrvn> but ARM has so many barrier types ....
<geist> i think it may have to do with other things like how all of this interacts with normal memory transactions that may be in flight
<geist> yah also note in this case you'd probably do a `DSB SY`
<geist> or maybe... `DSB OST`? i always have to thik about it
<geist> whereas a strongly ordrered barrier is probably closest to `DSB SY` which is the strongest
<geist> and also though i keep saying memory barrier, dont confuse it with DMB, which is different from DSB (DSB is 'stronger')
<geist> and that particular distinction i constantly have to go back to the manual for
<mrvn> and it's a real mess in the manuals for the RPi 1
<geist> yah that was simply broken. and thus i would advise just tossing it in the bin and move on
<mrvn> the barrier opcode was added in ARMv7, right?
<geist> broken hardware is just not worth fucking with
<geist> yes. i think in armv6 it was at best some control op
<mrvn> yes, lots of coprocessor ops for barrier and they are ugly to understand
<geist> fundamentally ARM arch is designed by cpu engineers for cpu engineers. when they have a choice they seem to pretty consistently make the decision to take the route that maximizes the ability for hardware to do what it wants
<geist> and makes it a SW problem
<mrvn> the RISC philospohy
<geist> i always thought there was a lot of potential to build a *really* high end cpu out of the arch (ie, apple M1) but the cost is no one fully groks it
<geist> well anyway, gotta get some work done
<mrvn> sometimes I think CPUs should be even less ordered so errors happen a lot when you use it wrong. Maybe a flag to explicitly mess up the order randomly at the cost of speed.
<geist> yah that's an issue honestly, especially with lower end ARM cores like cortex-a53. it's allowed to do all sorts of crazy stuff, but in practice it only has a limited amount OOO
<geist> so it appears largely in order memory wise, so it really doesn't expose you to the full joy of something that'll bite you
<geist> also why tryig to run on something like M1 is pretty useful. it'll shake out lots of issues pretty fast
<mrvn> I would have thought larger CPUs with more busses would be more critical
<geist> that too
<geist> was just thinking about general weak memory stuff
<geist> the more stuff in flight, the more OOO the core is, the more ability it has to really take advantage of the weak memory model
<nur> I feel like I've read about os stuff for years and I don't know what _any_ of that meant.
<geist> that's pure cpu architecture stuff
<geist> which intersects with osdev, but isn't so much the main goal
dh` has joined #osdev
<nur> is this going to be on the test
<nur> :)
<mrvn> nur: did they teach you about it?
<nur> it was a joke
<nur> I'm not in school
<mrvn> Most people think of a cpu of executing one opcode after the next sequentially
<geist> also in general this channel tends to be skewed towards low level osdev. which is of course a large part of it, but i think largely because usually you start with an empty computer and build a kernel, etc
<mrvn> But all levels of hardware do things in parallel or out of order nowadays
<geist> but in the lomg run most of the meat osdev is kernel and up
<geist> i'm just personally less interested in that, primarily because i like to deal with low level cpu bits
<zid> In the long run, most of osdev is tweaking your voltages and cpu frequency multipliers, silly
<geist> and dont forget memory timings
<nur> geist, I feel like we need to know everything
<zid> That's what I've been doing this week, so ergo it's true
<geist> nur: well doesn't *hurt*
<mrvn> who needs memory. I want to run my RPi in cache only.
<zid> my 1650 doesn't seem like an especially good bin :(
<geist> i persoally find it fun to just learn as much as possible about everything
<geist> but whether or not that matters i dunno
<zid> geist we should buy me 10 more to test
<nur> like subtle interactions that make you tear your hair out
<nur> I'm intrigued by the RPI has a "main CPU which is the GPU" doing the "real running"
<geist> zid: how about we buy one modern thing that completely destroys your old 1650 and move on
<zid> 1390p only got a paper launch though
<geist> it'd be even more power efficient at that
<nur> the GPU bits feel opaque
<mrvn> nur: they are, except when you need to do DMA
<zid> idk about the power efficiency, have you seen the listed TDPs on modern chips
<zid> they stuff them full of cores and run them at 295W TDP out of the box
<mrvn> nur: In some places you have to set the real physical address and not the address the ARM cpu sees.
<nur> uh oh
<nur> the "all addresses are virtual addresses, relax" adage goes out the window then
<geist> zid: and they also do about 10x as much work with that
<mrvn> nur: that's not true in kernel world at all
<geist> and personally i *wouldnt* advise getting modern cpus that burn 250W (*cough* intel)
<zid> w-1390p is actually a solid upgrade to mine but doesn't actually.. exist
<mrvn> geist: if only game programmers would understand multithreading
<geist> my ryzen 5950x comfortably pulls 150 at full tilt and is probably at the minimum many ties the horsepower of your older xeon
<zid> not according to benchmarks
<geist> i think i disagree on that one
<geist> though of course the fun one is the M1 cpus sip power and get pretty darn close, at least single cpu benchmarky
wand has joined #osdev
<zid> yea the M1 is what happens if you don't waste 10 years, I guess
<zid> considering my crap is at least still competitive with 10nm+++++ and it's on 32nm, intel aren't even trying
<geist> but even that the ryzens are pretty darn efficnent. the very new intel stuff is starting to get closer again, because they're finally consistently making 10nm and lower stuff
<geist> but all the new ones are still generally burning a lot more power to do the same thing than a 7 or 5nm ryzen
<zid> there's no way they can't make something with 80GB/s memory bw, 40 pci-e lanes, 4-8 cores at 5GHz without it being a $4000 xeon on 10nm, when they did exactly that in 2011 on 32nm for a cheap OEM cpu
<geist> sure, but does 80GB/sec matter?
<geist> are you doing something that actually saturates that?
<mrvn> geist: factorio :)
<geist> newer DDR tech will at the minimum have much faster access times given 10 years of develpment just due to higher clock rates if nothing else
<zid> They actually don't
<zid> you need to go VERY expensive and high end on ddr4 to get close to ddr3
<zid> it was significantly *slower* than ddr3 for a few years
<mrvn> aren't they slower due do deeper trees?
<geist> it's common that a newer revision is worse than the high end one from before but usually that's made up over time
<geist> 'for a few years' is the key
<zid> as in, for a few years you couldn't even BUY ddr4 that fast
<geist> but then over time it usually superceds the last in every metric, and now we're up to DDR5
<zid> yea ddr5 actually looks decent
<zid> ddr4 just sucked
<mrvn> Why isn't there QDR?
<geist> anyway, i'm not dissig your setup, i just think it's kinda a dead end to keep tryig to microoptimize that thing
<zid> which was made worse by intel not releasing quad channel setups for it so you couldn't even mitigate it
<geist> but of course, i also buy vaxes and old macs and whatnot, so i hvae no legs to stand on
<zid> It's.. fun though?
<geist> totes
<zid> and basically free
<geist> ah yeah if you're for free upgrading icontinue :)
<zid> something tells me your 5950x, which gets what, a most double the perf on a good day for the things I use it for, wasn't £20
<geist> honestl the old power usage of old machines is the reason i tend to use them even if i have them around
<geist> i have a sandy bridge 2600 and a nehalem E5520 box that i could resurrect
<geist> but at the end of the day they're 250W boxes that are a fraction of the horsepower of something modern
<geist> definitely the case with the old G5 powermac from 2005
<geist> 250W space heater
<mrvn> Other than games I never needed the power.
<zid> my cpu runs 30W watching youtube or whatever, dram uses about 5W
<zid> gpu probably uses about that to render it
<geist> i kinda doubt it's *simply* double the perf. if nothing else it has 16 cores
<zid> PCH etc use another 10
<zid> I don't use 16 cores
<zid> I use 1 core, 99.99% of the time
<geist> ah, so now we're at the meat of it.
<zid> It's a desktop, not a webserver
<geist> sure, if single cpu is your jam, then los of stuff in the last 10 years aren't that interesting
<zid> I did say I wanted 4-6 cores :p
<zid> I don't want to pay the heat/power costs on the extra cores, I won't use them
<geist> i do enough big compilation and stuff like fpga working and whatnot that lights them all up
<zid> I'd probably just set up distcc or something if that were my jam
<geist> FWIW modern designs do a darn good job of downclocking unused stuff
<zid> I'd still play elden ring on a quad core
<zid> turbo 3.0 from intel helps
<zid> idk what amd is like wrt that
<geist> same thing, different name
<mrvn> I use 16 cores 1% of the time. But when I do it really helps.
wand has quit [Quit: leaving]
<zid> turbo 3.0 has a single core turbo option, it still drags the other cores up but not as much
<geist> it's all dynamic nowadays
wand has joined #osdev
<mrvn> make clean; make -j
<zid> you can't go core 0 5GHz + 23 cores at 1GHz so far as I know
<zid> they end up at 4.2 or whatever
<geist> yep. same thing on AMD. it's a global optimization. some cores run faster than others, and when others come up it starts downclocking everyting to maintain a reasonable TDW
<mrvn> geist: don't they all run on the same voltage and then the speed is limited to a narrow range?
<geist> 'overclocking' nowadays is generally just pushign that TDW limit up so that it can push more cores longer
<zid> yea, tau timers and shit in your firmware to illegal values etc :p
<geist> mrvn: i dunno. i think it's pretty fancy
<geist> at the minimum on my machine it has two separate dies for each 8 cores so there's a possibility
<zid> I'm actually tempted to try disabling some cores to see what it does to the heat
<geist> but yeah for the ryzen stuff 16 cores is pretty overkill. an 8 core ryzen is a good gaming system. like 5800x or whatnot usually benchmarks a teensy bit better in ames anyway
<geist> simply because you have less L3 cache traversals if nothing else
<zid> even 8 is massively overkill
<geist> and theres a bit more clocking headroom
<zid> you'll run out of vram or memory or something before you run out of clockspeed with 8
<geist> i wouldn't say that necessarily. newer games are getting designed for 8 core machines (PS5, xbox, etc) so you're starting to se them more used
<mrvn> from my understaning those tubro modes are driving the cores above what you can cool in the hope the cpu sleeps enough to not get too hot.
<zid> if you were trying to say, run 18 copies of WoW
<zid> yea but their cpus are 20% as good :P
<geist> they're ryzen 2s
<geist> pretty modern. high end as of 2019 or so
<mrvn> geist: With the average core count being ~5 and single core getting only 25% performance the game industry is changing. The programmers still don't understand it.
<geist> alright.
<zid> *checks what a ps5 has in it*
<mrvn> Plus game design takes years. They can easily be a decade behind the curve.
<geist> i kinda disagree there, but also depends a lot on what kinda game ou're talking about
<zid> 8 core 3.5GHz apparently
<geist> zen 2
<zid> might have some cut-corners to reduce heat, like turning of memory reordering etc idk
<zid> I doubt they'll tell us
<geist> yah so pretty darn modern. the downclock is the main loss. also ps5 and xobx have pretty huge memory busses IIRC
<mrvn> Even today game designers thing a game should do this: while true { read input; game tick; rnder; }
<mrvn> think even
<geist> zid: not really. it's just a PC in a can. the cpu is custom, but it's not exotic
<zid> that's what they did for zen3
<zid> you know?
<zid> err typo
<zid> memory *renaming*
<geist> sure because they came up with a newer/better scheme
<zid> zen2 had memory renaming, zen3 cut it, presumably it was eating too much budget for too little reward
<geist> exactly
<zid> and it's common for console cpus to cut things like that for heat, hsitorically
<geist> and toss a few more load/store units it gets swamped by that
<zid> cus they're ran in tiny little boxes with bad cooling
<geist> *shrug* okay. i mean i guess you have a point of view here so i dont particularly want to argue it
<zid> 360s famously all desoldered themselves
<geist> anyway, need to get back to work
<zid> I mean, neither of us knows either way
<zid> it's obviously not going to be stupidly slow
<zid> but I still get more total bogomips
<geist> but... this is why i want to get ahold of a alder lake. i've been Zen based for the last few years and have been pleased, so will be interseting what intel has come up with the golden cove cores
<zid> 12600k looks seriously legit
<geist> yah
<zid> 6 core, 4.9Ghz stock, accepts high end DDR4 to mitigate the dual channel, ecc supported, 20 pic-e lanes instead of 16
<zid> so now you can have an ssd AND a graphics card
<mrvn> where do you put the NIC?
<zid> it cuts down the stuff I don't *really* use (high bandwidth, more pci-e lanes than I can actually use in practice) and focuses a little on the stuff I *do* use a lot, fast single cores
<geist> (sounds like zid just described a Ryzen circa 2017)
<mrvn> 16 lanes GPU, 2 lanes M2.key, 1 lane SSD, 1 lane NIC?
<zid> lmk when a 2017 ryzen is £20
<zid> 3600x or something looked good, if that's the right gen
<geist> hmm ets see there's one for $60
<geist> what's the exchange rate now?
<zid> 1:1 probably? :P 1.1 1.2
<zid> 3600X is a 6C 4.4GHz
<geist> anywy *really* have to go now. meeting in 8 minutes
<geist> bye!
<zid> £100 on ebay
<zid> new they're still £200, same price as a 12600k
pretty_dumm_guy has joined #osdev
GeDaMo has quit [Remote host closed the connection]
terminalpusher has quit [Remote host closed the connection]
dormito has quit [Quit: WeeChat 3.3]
ZipCPU has quit [Ping timeout: 240 seconds]
ZipCPU has joined #osdev
pretty_dumm_guy has quit [Ping timeout: 240 seconds]
__xor is now known as _xor
pretty_dumm_guy has joined #osdev
wootehfoot has joined #osdev
zid has quit []
Oli has joined #osdev
wootehfoot has quit [Read error: Connection reset by peer]
dequbed has quit [Quit: bye!]
dequbed has joined #osdev
Oli_ has joined #osdev
Oli has quit [Ping timeout: 272 seconds]
dude12312414 has joined #osdev
dormito has joined #osdev
troseman has joined #osdev
pretty_dumm_guy has quit [Quit: WeeChat 3.4]
pretty_dumm_guy has joined #osdev
pretty_dumm_guy has quit [Client Quit]
pretty_dumm_guy has joined #osdev
pretty_dumm_guy has quit [Client Quit]
zid has joined #osdev