<klange>
what CPU is this on? maybe I have a bug in my vectorized alphablitter
<heat>
kabylake r
<heat>
passing -cpu host fixes it I think
<heat>
yeah for sure
<heat>
-cpu qemu64 is buggy, -cpu host/-cpu haswell isn't
gog has quit [Ping timeout: 260 seconds]
<heat>
nehalem is also buggy
<heat>
sandybridge isn't
<klange>
-cpu shouldn't do anything on kvm beyond change the strings reported by cpuid?
xenos1984 has quit [Read error: Connection reset by peer]
<heat>
nah I think it also changes the cpuid
<heat>
yeah
<heat>
kvm -cpu qemu64 doesn't report avx for instance
<klange>
but I don't check any of those things, and the alphablitter is SSE2... and it looks like it's actually a rendering issue with the compositor and not an output problem
<klange>
and I don't see gcc producing any alternate functions for this stuff
<klange>
also I test natively on a nehalem regularly, really feels like maybe this is a QEMU bug somewhere, but I can't reproduce it in anything so I have no idea
ZipCPU_ has joined #osdev
Optimus has quit [Ping timeout: 272 seconds]
ZipCPU has quit [Ping timeout: 252 seconds]
<heat>
if I open a terminal and move it through the screen it bugs out as well
<heat>
sorry for the slowness but software encoding obs was tanking my perf
<heat>
the effect is still the same without obs slowing things down
xenos1984 has joined #osdev
<klange>
that's so weird, it's like occasionally it just decides 0 is the right input for the blit...
<heat>
it's still broken in -cpu Westmere btw
<klange>
Again, I'm reasonably certain that has no actual effect on the emulation beyond the cpuid report.
<klange>
At least in KVM.
<heat>
probably
<klange>
Maaaybe it's a memory issue... source being read as all zeros... and that could actually be an emulation issue as QEMU has full control over where guest RAM comes from.
<heat>
sandybridge introduced xsave, avx
<heat>
maybe its a context switching bug?
<heat>
hmm no you're using fxsave
<heat>
you're not using xsave and avx but this is weird
<heat>
I can reliably show it's broken on westmere and can reliably show it's working on sandybridge+
<heat>
(westmere and before)
Jari-- has joined #osdev
<klange>
ah, actually, fxsave/xsave involved the hypervisor, so maybe it's a bug in handling of that... there's actually a crash around this that was discovered in some 6.2 versions
<klange>
can you try a newer qemu (latest git, or 7 rc?) or an older one? or a different hypervisor altogether?
<heat>
hold on let me reboot
<heat>
i'll try ubuntu's qemu under wsl2
heat has quit [Remote host closed the connection]
smeso has quit [Quit: smeso]
heat has joined #osdev
<heat>
klange, seems to work
<heat>
qemu 4.2.1 using kvm and -cpu Westmere
<klange>
yeah gonna guess it's a hypervisor bug around save/restore in 6.2, lovely - probably because test coverage for legacy fxsave is limited :)
<heat>
also works in virtualbox
<heat>
yeah but why would it work with no issues on newer CPUs?
<heat>
unless -cpu does more than just setting cpuid bits in kvm
<klange>
yes, I did look and it does have hypervisor interaction for this
<moon-child>
is there any issue in practice with not qualifying atomic variables as such, so long as all accesses to them go through __atomic_*?
eroux has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
<Griwes>
in practice, and as long as it's all __atomic_*? no
<heat>
moon-child, no.
<heat>
_Atomic is only a helper if you want to do i.e var += 10; and have that be atomic
<heat>
__atomic_* and __sync_* work on non-_Atomic variables
<moon-child>
ok, thanks
<moon-child>
just checking there weren't any weird memory model problems
<Griwes>
it's how lock-free atomic<T> is implemented in C++
<Griwes>
(and approximately how lock-free atomic_ref<T> is implemented in C++)
<heat>
clang does have __c11_atomic_* which are mostly undocumented and those take an _Atomic Type*
<heat>
(why do they have it? *shrug*, it's only used in compiler-rt and possibly <atomic>)
<Griwes>
there's a bad C11 backend for atomic<T> but I think it just uses _Atomic and not the intrinsics
<Griwes>
(it's bad because unlike the __atomic_* backend, you can't reuse its code to do atomic_ref<T>)
<klange>
heat: if it's used by compiler-rt it's probably just the internal implementation that guarantees the c11 semantics?
heat_ has joined #osdev
heat has quit [Read error: Connection reset by peer]
<heat_>
<heat> I suspect compiler-rt is actually implementing __atomic_* and __c11_atomic* is the backend
<heat_>
<heat> certainly looks like it
<heat_>
<heat> the API is almost one-to-one
heat_ is now known as heat
<heat>
Also TIL lock atomics are also done in compiler_rt
<heat>
they allocate 2 pages full of locks and hash your address
<Griwes>
Yeah, that's one of the ways to do nonlockfree atomics
<Griwes>
And close to the only way to do nonlockfree atomic_ref
<heat>
i genuinely dont see a use for lockful atomics
<heat>
if you're doing atomics with stuff that big you should just use a lock explicitly
<Griwes>
they enable efficient generic code
<Griwes>
just write a concurrent hash table using the atomic interface and it'll be super fast for small types, and still functional for larger ones
mahmutov has joined #osdev
<heat>
i've just noticed that atomics for floating point dont exist
<Griwes>
they do
<heat>
what?
<heat>
which instructions?
<Griwes>
if you're talking about atomic<float>, it always existed, but without fetch_add and friends (and it gets fetch_add and friends in c++20)
<heat>
i was thinking about actual atomic instructions
<Griwes>
it'll boil down to a cmpxchg loop in the instruction stream
<heat>
yuck
<heat>
anyway gtg
srjek has joined #osdev
heat has quit [Ping timeout: 260 seconds]
srjek has quit [Ping timeout: 250 seconds]
<moon-child>
Griwes: if you write a concurrent hash table, you should use a pointer to access large types
<Griwes>
it depends
<Griwes>
the base implementation shouldn't do that
<Griwes>
because it can fairly easily murder your perf because you lose data locality and that can be absolutely worse than having a briefly held lock
<Griwes>
a user can use an indirect type with an indirect hasher if they wish
<moon-child>
if you lock, then you should do an rwlock the whole table
<moon-child>
which is completely different from a concurrent hash table
<Griwes>
then you aren't a concurrent hash table
<moon-child>
yes, exactly
<Griwes>
with an atomic with an embedded lock, you literally just lock it for a memcpy in cmpxchg and that's it
<Griwes>
and I'm not talking out of my ass, we have people doing literally that
<moon-child>
I have difficulty believing that would be faster than locking the whole table
<Griwes>
it is
<Griwes>
we know this from running a concurrent hash table on literally hundreds of concurrently executing threads on a gpu
<Griwes>
I'm not sure if there's easy things to link to with the data yet, but here's our concurrent data structures library: https://github.com/NVIDIA/cuCollections
<moon-child>
ah, ok
<moon-child>
gpus have different performance characteristics
<moon-child>
than cpus
<Griwes>
yes and no
<Griwes>
it's not the fact it's on a gpu, it's the fact that we're running on a massive number of threads
<Griwes>
if you do the same on a many core cpu, you'll want to do pretty much exactly what you do on a gpu
<klange>
woops misfired a shortcut
<Griwes>
if you have like 4 cores doing things on the hash map, you quite possibly don't need a concurrent hash map like this
<moon-child>
I don't know gpu performance well. But a contributing factor would seem to be less ooo to hide load latencies
<moon-child>
also, compacting gc is viable on cpus, probably not on gpus, reducing overhead of indirections
<Griwes>
vOv at a scale of a hundred threads, you don't want to have congestion on a single hash map lock, and the lock granularity is good enough that you end up mostly not spinning if you have good balance factor in your map
<Griwes>
And you end up in a place where the indirection latency can visibly hurt perf
<Griwes>
Also remember that gpus are increasingly more and more similar to cpus - they'll never be the same, because they optimize throughput vs latency, but if you haven't been tracking gpu developments over the past X years, you'd probably be surprised by where we are these days
<moon-child>
curious, do branches still work like they used to? Is there a cleverer method for dealing with them now?
<Griwes>
At least for our gpus, since Volta and up, yes but also no, because you do have access to independent thread scheduling, which allows you to actually do CAS loops on the gpu without deadlocks
<Griwes>
Not sure where amd's at with that
<moon-child>
scheduling is automatic, or you have to request it?
<Griwes>
You mean independent thread scheduling? It's an intrinsic property of the architecture
<Griwes>
It's not free if you have divergence like that, but it results in a correct execution
mahmutov has quit [Ping timeout: 260 seconds]
eroux has joined #osdev
eroux has quit [Ping timeout: 246 seconds]
<mrvn>
The great thing about lockful atomics is that you can use atomic<T> without having to specialize for T being primitive or not in template code.
<mrvn>
moon-child: If your object is about cache-line size and you use atomic pointers in the table then you have false sharing while locked atomic could have one object per cache-line. Can make a huge difference. Or the CAS loop for the lock can suck all the performance out of it and be much slower. There is a HUGE difference in speed depending on access patterns and you basically have to just try all the different
<mrvn>
ways to implement something like a hashtable to find the one that works best for your use case.
<moon-child>
w/avx512 you have atomic read/write of entire cache lines
<moon-child>
not guaranteed, but in practice
<moon-child>
but no rmw :/
<moon-child>
I think arm ll/sc is per cache line though?
<moon-child>
nope looks like it's just per word
<nur>
oh it's that time of year again for PonyOS I see
dennis95 has joined #osdev
m3a has quit [Quit: leaving]
<kazinsal>
yep, it's The Worst Day On The Internet
<nur>
it's not so bad, we still get our daily cat videos
Payam69 has joined #osdev
haliucinas has quit [Quit: haliucinas]
haliucinas has joined #osdev
<Griwes>
you know what I really appreciate about PonyOS?
<Griwes>
it has a very predictable and consistent release schedule
Optimus has joined #osdev
Optimus has quit [Client Quit]
<klange>
I have missed it twice because I was moving (2016, 2019).
<klange>
But otherwise, sure.
Optimus has joined #osdev
Optimus has quit [Client Quit]
<Griwes>
it'd be fairly regular if you missed this year!
<bslsk05>
www.theregister.com: The weird world of non-C operating systems • The Register
Jari-- has quit [Ping timeout: 245 seconds]
Optimus has joined #osdev
<FireFly>
\o/
gog has joined #osdev
GeDaMo has joined #osdev
C-Man has joined #osdev
ckie has quit [Quit: off I go~]
ckie has joined #osdev
vin has quit [Remote host closed the connection]
freakazoid12345 has quit [Ping timeout: 260 seconds]
<mrvn>
ups
<mrvn>
Thinking about the atomic more I don't really get why CPUs don't have atomics for everything up to cache lines. Isn't even a atomic byte going to lock in the whole cache line for the duration of the operation anyway?
<moon-child>
yeah idk
ns12 has joined #osdev
<ns12>
Hello, I am trying to understand the code of xv6 (x86 edition). I am stuck on the assembly code. What should I read to learn about that? Should I read the Intel 80386 programmer's manual? Or should I be reading something more recent?
<mrvn>
you shouldn't need any real understanding of x86 to understand their code.
<klange>
there's not much assembly in xv6, and it's all pretty well commented, imo
<mrvn>
If you know the concepts of CPU, MMU, ... you should be fine with the code and comments and the intel or amd manuals to lookup the meaning of bits and flags for the hardware.
<Ermine>
you may want to read xv6 book along with the code
<mrvn>
Doesn't look like the MIT lectures are online for free. :(
<Ermine>
they describe what asm parts do (like setting up stack)
<Ermine>
xv6 book is available on their site
<Ermine>
Unfortunately, there are various editions
<bslsk05>
twitter: <MalwareMinigun> Dear hardware vendors:  Every CPU et al. needs to implement atomic operations by tweaking their memory consensus protocol to support it. The consensus protocol tends to operate in at least 64b chunks ("cache lines"). So why is the biggest atomic op commonly available 16b?
<clever>
Griwes: arm kinda has that, but only allowing a single store to be atomic
<clever>
the arm "load exclusive" will load data from ram->register, but tag the address as owned by you
<clever>
(but, an implementation detail, is that you own the whole cache line)
<clever>
"store exclusive" will then only store to the addr, if you still own it
<clever>
i can see how you might abuse that, to load-exclusive an addr, then load other addresses in the same cache-line
<clever>
and then finally store the result back to the cache line, only if you havent lost the race
<mjg>
huh. does someone have a usecase for such a thing though?
<mjg>
i suspect any legit use case is already covered by transactional memory extensions
mniip has quit [Ping timeout: 604 seconds]
SikkiLadho has joined #osdev
mniip has joined #osdev
<SikkiLadho>
Hi, I have a confusion with forwarding PSCI SMCs to Trusted Firmware-A. Linux with hypervisor underneath is able to successfully bring the secondary cores only when SMC trapping is disabled. When I enable SMC trapping, everything works except that secondary CPUs won't come online. I get, "failed to come online" and "failed in unknown state : 0x0"
<SikkiLadho>
for all three secondary CPUs. I checked if there was something wrong with the arguments with SMC trapping enabled. However, there's no argument difference with or without SMC trapping, even after this, secondary cores won't come online when SMC trapping is enabled. What could be the reason?
<clever>
my rough understanding, is that you need to emulate psci some
<clever>
when you get a psci call from the guest, you need to schedule running the guest on that core
<clever>
and before you can do that, you need to get the hypervisor itself on that core
<clever>
so, issue your own SMC call into the real PSCI, telling it to run the hypervisor entry-point on the given core
<clever>
and then your hypervisor has to eret back to the guest, running the entry-point the guest asked PSCI to run
zaquest has joined #osdev
dude12312414 has joined #osdev
Vercas8 has joined #osdev
Vercas has quit [Quit: Ping timeout (120 seconds)]
<SikkiLadho>
Code has the assembly routine which checks if the SMC IS is CPU_ON and replaces current address with a label above, where print function would be called. I understand that I need to save the original entrypoint by linux and eret it to it. But for now, I'm just printing at my given entrypoint to check if secondary CPUs are actually coming online to
<bslsk05>
github.com: Leo/utils.S at 1508c19a2be51736f98077c8009b7570fdbd12af · SikkiLadho/Leo · GitHub
<SikkiLadho>
that address. Thank you for patiently answering my questions.
epony has joined #osdev
heat has quit [Remote host closed the connection]
knusbaum has joined #osdev
<clever>
SikkiLadho: but are you saving the original entry-point somewhere? i think this would be a lot more readable if it was in c
<SikkiLadho>
I first wrote it in c, i a function pointer equivalent of "adr, x0, label"? I mean can I replace the original entry point with a function pointer?
<SikkiLadho>
so that secondary cpus come online in that function?
<clever>
you need to keep in mind, the stack pointer wont be initialized when the other core comes up
<clever>
so your better off using some asm, to initialize the stack
<clever>
same way you initialized the 1st core before entering c
dude12312414 has quit [Quit: THE RAM IS TOO DAMN HIGH]
freakazoid12345 has joined #osdev
knusbaum has joined #osdev
<SikkiLadho>
is stack pointer in arm64 shared among all cores or every core has its own stack pointer?
<j`ey>
own stack pointer
Killaship34 has joined #osdev
<j`ey>
it couldnt work if they shared one!
<Killaship34>
hello
<SikkiLadho>
Thank you
<Killaship34>
Recently I've restarted development on my kernel, and I've got a solid error handler with stack tracing, but I don't really know what to add now.
<Killaship34>
Does anyone have any suggestions?
<Killaship34>
I've been thinking about a memory manager, but currently that's a bit too hard for me
knusbaum has quit [Client Quit]
atrapa has joined #osdev
kkd has joined #osdev
knusbaum has joined #osdev
Killaship34 has quit [Read error: Connection reset by peer]
Killaship34 has joined #osdev
<kingoffrance>
mov ax, 5 -> meta_mov_set_r8_to_immediate you dont want my suggestions :D
<bslsk05>
mrvn/moose - My Own Operating System Environment (0 forks/4 stargazers)
<geist>
ah good question. yeah i'm guessing static vars inside functios like that are probably still in .data
<geist>
or .bss or whatnot, unless of course you override those too
<mrvn>
Each directory has a Makefile.obj that can specify a section name to add and can optimize LTO.
mahmutov has quit [Ping timeout: 272 seconds]
<j`ey>
so the easiest way is to probably just compile this separately into a .a
<geist>
or into a combined .o
<mrvn>
No, .a is just an ar file containing a bunch of .o
<geist>
ld -r nicely merges .os together into larger os wich you can then modify
<j`ey>
sure .a, combined .o, whichever
<mrvn>
If you want to use LTO you need .o. Otherwise it doesn't matter.
<geist>
j`ey: oh no you dont get sassy with me!
<j`ey>
geist: :P
<geist>
been doing some tcpip hackery again. talkig to the sortix irc server again
<geist>
always fun
<mrvn>
Changing the section names also requires linking to .o
<j`ey>
geist: fun
<j`ey>
oh hm, with ld -r I don't get undefined references, like I do with .a
<j`ey>
I think there might be another ld flag for that, will check later
<j`ey>
--no-undefined doesnt help, hm
<geist>
yeah .a files can get you into a recursive reference issue
<geist>
between .a files definitely, since there's an implicit notion that the linker can drop references to parts of the .a file that aren't used
<geist>
but a combined .o file is simply a larger .o file
<j`ey>
I want to error if the .o has any undefined references
<j`ey>
maybe objcopy will error, I havwnt tried that
<mrvn>
so you want to build a real executable but then later link multiple of them into a kernel.img?
<j`ey>
mrvn: something like that yeah
<mrvn>
maybe do just that. Build elf executables and then use them as input for a second link pass.
<geist>
in LK i link each module into a separate .o, and then link it
Likorn_ has quit [Quit: WeeChat 3.4.1]
<geist>
not strictly necessary, but it's nice to have the separate .os to then see where the size is going
<mrvn>
At some point you might just build an initrd
dude12312414 has joined #osdev
<mrvn>
Moose does that per directory with the option to map all sections and do an LTO pass. So a module can have multiple compilation units and then gets optimized internally into one .o file.
Burgundy has quit [Ping timeout: 260 seconds]
<mrvn>
also visibility of symbols
<mrvn>
the module can specify which symbols are visible from outside the directory.
<j`ey>
yeah Ill just build a full executable i guess