<doug16k>
it suddenly has essentially infinite decode bandwidth in loops
<mrvn>
if it can fit it in the cache
<doug16k>
yeah but that is like 1000+ insns now
<doug16k>
it's amazing
<mrvn>
why is the simple "loop" slower then if it would just stream the decoded opcodes?
<doug16k>
loop is deliberately bad. I have been avoiding loop since pentium
<doug16k>
the story I heard was, it is used in delay loops, so intel deliberately kept it bad
<doug16k>
since pentium, the simpler instructions are faster
<doug16k>
that is a microcoded instruction
<mrvn>
cound down loops are such a common thing the cpu really should have a fast opcode for them.
<doug16k>
well they kind of do - there is macroop fusion that fuses dec jnz to one op
<mrvn>
doug16k: dec + jnz == loop, makes no sense
<doug16k>
it's just super flexible so you just use the existing test inc dec add sub whatever followed by branch to encode one fused op
<mrvn>
it's not like delay loop would work with all the MHz changes the cou went through
<moon-child>
what I heard is that loop is bad because it has to deal with being interrupted
<moon-child>
you decrement rcx, then potentially jump somewhere; if that somewhere isn't mapped, you #pf, but you also need to restore the previous value of rcx
<doug16k>
you are right, by now it doesn't even slow it down that bad if it was serializing per iteration. it gradually became almost irrelevant how fast it is
<mrvn>
like an IRQ between dec and jnz?
<moon-child>
mrvn: no, you can be atomic wrt that
<mrvn>
ahh, #pf. yeah, that might be tricky.
<mrvn>
With dec + jnz the #pf would have an IP pointing at the jump.
<moon-child>
yep
<doug16k>
interrupts are serializing, so it will wait until everything is retired, then start the interrupt
xenos1984 has quit [Read error: Connection reset by peer]
<mrvn>
It's just so annoyingly complex to use dec+jnz. Adds 6 opcodes to the 6 the loop itself is.
<mrvn>
or 4 actually, dec only changes CF, not OF
<moon-child>
yeah, no problem with interrupts. Just exceptions from the loop insn itself
<mrvn>
it's stupid that x86 always changes the flags. Other archs have a bit for that.
<moon-child>
agreed
<doug16k>
because x86 instructions are small
<doug16k>
other arch have wasted bits they want to use
<mrvn>
although dec+jnz needs to change the flags. :(
<moon-child>
well. Ideally you could ask dec to only set the zero flag
<moon-child>
doug16k: what if it were a prefix you could apply, optionally, only where you could profit?
<doug16k>
you can use lea to inc dec
<doug16k>
no flag change at all
<doug16k>
then what :D
<moon-child>
then it wouldn't waste bits
<mrvn>
moon-child: dec is so nice and keeps the CF flag. But it overwrites OF.
<doug16k>
no flag inc rax -> lea 1(%rax),%rax
<moon-child>
yes but we want only-zero-flag inc rax
<doug16k>
I know - then what. how do you branch or not. you could jmp *%xxx and cmov to that
<mrvn>
doug16k: I need something that jcc can use that leaves CF/OF alone.
<doug16k>
then you need to wreck flags lol
<mrvn>
hence the whish to set the Zero flag
<doug16k>
I think you should try setc then cmp that reg at the top to get it back
<doug16k>
it should overlap your loop inc jcc for free
<mrvn>
doug16k: I need OF, dec doesn't change CF
<doug16k>
oh
<mrvn>
Or maybe the whole idea of using adcx/adox for additions is crazy and it's only useful for multiplications.
<doug16k>
if you did seto then neg that, then at top, cmp that with something that makes it OF or not?
<moon-child>
mrvn: I think the original impetus was for crypto. Not gonna have branches in crypto code anyway (not constant time), and you know the sizes, so you can just fully unroll
<doug16k>
ah crap - trashes CF. I see the problem
<mrvn>
moon-child: you aren't going to unroll an 4096 bit key.
<doug16k>
can't use lahf sahf ?
<moon-child>
use neg to set carry?
<mrvn>
"It is valid in 64-bit mode only if CPUID.80000001H:ECX.LAHF-SAHF[bit 0] = 1." Not sure.
<mrvn>
moon-child: carry is unaffected by "dec"
<doug16k>
moon-child, the idea was to make "1" from "seto" to become all 1's so it is negative, then you can cmp that with something that makes OF be set
<doug16k>
I didn't sit down and figure out the polarity, but it's useless because it would ruin CF we are trying to carry around the loop
<mrvn>
doug16k: adox -1, reg should do
<mrvn>
lahf works here
<doug16k>
3950x supports it
<doug16k>
yours is intel?
<mrvn>
that would make it: sahf; N times 6 opcodes; lahf; 4x lea; dec; jnz
<mrvn>
amd
xenos1984 has joined #osdev
<mrvn>
and whatever compiler explorer has
<mrvn>
I should start measuring. It might all be moot because the limiting factor is ram.
<moon-child>
most people have no use for numbers so large they don't fit in cache :P
<doug16k>
you should unroll it enough that the iteration count is low enough for the predictor to learn the pattern
<doug16k>
for the final branch
<doug16k>
if you can
<doug16k>
if you can get it to speculate right into the return, can't beat that
<doug16k>
then it won't speculate into excessive iterations and not realize until the last dependency chain completely retires
<moon-child>
isn't 'learning the pattern' mainly a function of whether your bigints actually have predictable sizes, which is an application concern?
<doug16k>
if a loop branch is taken a predictable number of times, and it's not too many, it can correctly predict the final not taken and speculate correctly and return and start speculating correctly there, instead of it speculating into the weeds and not realizing until the very last iteration retires
<doug16k>
then flushing pipeline and starting all over
<doug16k>
it can learn taken taken taken taken ... not taken and be perfect
<mrvn>
doug16k: the problem is that this will often be called from "mul", which recursively divides. So you get size 2, 4, 8, 16, 32, ...
knusbaum has joined #osdev
<doug16k>
if you are lucky, the branch history that leads to that can make it have another separate history remembered for that size
<mrvn>
Sizes 2, 4, 8 should probably be spezial and fully unrolled. And then 16, 32, ... as loops
<doug16k>
there can be one branch that mispredicts, but it's right from then on, because that is a different branch history, and the following stuff is using different history values
<doug16k>
and it's right from then on for a while
<doug16k>
the sequence of taken/not taken that have recently occurred cause it to select a different set of history memories
<doug16k>
for example, if you had if (debug_enabled) print("stuff"); repeatedly, then when that is taken or not, it uses different hisotry memory for the following branches, and learns the if is always false or if is always true pattern
<mrvn>
One hope with the tiny loop is that if the loop runs 1024 times the one mispredict at the end is irrelevant.
<doug16k>
and all the following ifs that go the same way predict correctly
<mrvn>
If you unroll then it's taken less often and might be worse in predicting
<doug16k>
I mean, even if you flicked debug_enabled on and off, the first if's mispredict would cause it to select the other branch histories and predict the rest correctly
<doug16k>
the mispredict at the end isn't irrelevant if it would have speculated into a load in the caller, and got started on it sooner
<doug16k>
way sooner
<doug16k>
if you keep it speculating well, it doesn't matter what the instructions do very much, it will decompose everything into a dataflow graph and start everything asap
<doug16k>
I would think of it as the pipeline having excessive integer execution pipelines, and one thread can't even saturate them with realistic instructions
<doug16k>
I picture the carry dependency chain to be the determining factor, and everything else goes through for free in other execution units
<doug16k>
and they have no effect
<mrvn>
doug16k: now rethig that agai with all cores running the same loop
<mrvn>
rethink even
<doug16k>
it must be an epic amount of adding to have two dependency chains back to back on a modern amd
<doug16k>
why not avx2?
<mrvn>
doug16k: because it has no addx. Getting the carry is complex.
<doug16k>
yeah but you can probably get so many carries at once that it isn't bad
<mrvn>
only 4
<doug16k>
ok, now do 2 dependency chains of that like you did with adc/adx
<doug16k>
that would be insane
<doug16k>
carry is just compare less than, then subtract that from destination and it will subtract -1 or 0, adding one or not
Teukka has quit [Read error: Connection reset by peer]
<doug16k>
unsigned
<mrvn>
gcc/clang are a bit stupid there. They compare and then mask it with 1 so they can later add it.
<moon-child>
can probably do as many as 4 at once
<mrvn>
moon-child: but will it be faster than doing them in sequence?
<moon-child>
try it and see
<doug16k>
I think you will get the adc and adx going through in the same cycle and everything else is nothing
<doug16k>
assuming cache hits
<mrvn>
mixing adc + avx is another idea.
<moon-child>
I think adc will steal execution ports from avx
Teukka has joined #osdev
<doug16k>
why?
<moon-child>
so if avx is faster, there'd be no point
<doug16k>
I don't think it using the avx opcode space means it uses fpu pipelines
<doug16k>
if it did then there would be bypass delays that aren't mentioned in bmi
<mrvn>
but if I only need to add 5 numbers I can do avx for 4 and adc for the last.
<moon-child>
doug16k: yeah but it's int ops
<moon-child>
don't scalar and vector int ops use the same ports?
<doug16k>
amd has completely decoupled integer and float
<moon-child>
there's no float here though
<doug16k>
avx would be
<doug16k>
I think we agree anyway
<mrvn>
if avx with interger a float or int operation?
<mrvn>
is ...
<moon-child>
if you use avx to do bigint addition, then you're doing int ops
<doug16k>
adx is a 3-operand instruction isn't it?
<moon-child>
'I think we agree anyway' maybe :P
<doug16k>
it's floating point pipelines if integer avx yeah
<doug16k>
they obviously don't have 256 bit alus in the integer ones
<doug16k>
and no gigantic registers
<moon-child>
oh hmmm
<moon-child>
mrvn: what if you just repeat the adox?
<doug16k>
if one thread was avx and other was integer, it would be glorious
<moon-child>
wait no that doesn't make sense
<mrvn>
moon-child: then I can just adc, it's a shorter opcode
<moon-child>
no I meant from the previous iteration
<moon-child>
but that doesn't work
<doug16k>
avx can do a hell of a lot fewer loads/stores
<doug16k>
that alone is huge
<mrvn>
doug16k: fewer opcodes, same volume.
<mrvn>
if you are waiting for memory it doesn't matter
<doug16k>
you can fit way more bandwidth into the same amount of reorder buffer slots
<doug16k>
speculate further
<moon-child>
mrvn: a simd load and a scalar load have the same throughput
<doug16k>
one load can be a byte or 256 bits. which one is faster
<moon-child>
in terms of # loads/cycle
<moon-child>
but the former does a lot more work
<moon-child>
(ditto store)
<mrvn>
both take 200 cycles to fetch a cache line from memory
<moon-child>
if you're waiting for memory than nothing else matters anyway
<mrvn>
that's what I said,.
<moon-child>
but you want to optimise
<moon-child>
optimisations only matter when you don't hit memory. So focus on that case
<doug16k>
you said it was AMD so that means it is going to hit the cache all the time, unless you have more than the huge L3 of bigints
<doug16k>
you have gigantic caches
<mrvn>
That totally depends on the size of the Bignums you have. If the numbers have a million bits then cache become a bit limited.
<moon-child>
then you're bandwidth limited
<mrvn>
If you do 4 AVX streams in parallel thats 32MBit or 4MB of data for a million bit numbers.
<moon-child>
memory is p fast
<doug16k>
one CCX L3 is 16MB
<mrvn>
6MB for a + b instead of a += b
<gamozo>
what? memory is so slow!
<doug16k>
guessing which gen though
<heat>
i know some of these words
<heat>
computer go brrr
<moon-child>
gamozo: bandwidth
<gamozo>
that's fair!
<doug16k>
if you know for a fact that it won't fit in the cache, then you should be using non-temporal loads
<doug16k>
and stores
<moon-child>
^ that too
<gamozo>
memory bandwidth got so much better
<gamozo>
tbh, non-termporal is kinda spotty? I've yet to find many good situations for it, even with streaming writes
<gamozo>
I don't understand computers
<zid>
NT's just very likely to make things worse unless it DEFINITELY makes them better
<zid>
because of prediction and caches and stuff
<zid>
it's just hard to use in real programs
<moon-child>
a colleague recently worked out how to use nt stores in matrix multiplication
<gamozo>
the main issue si that _most_ compute you can batch results and keep it in cache rather than going to non-temporal memory
<zid>
Like, when was the last time you did a prefetchw
<heat>
is nt defined if you use nt and non-nt accesses?
<moon-child>
haven't implemented it yet, but I made a model of it. Didn't seem to help. But there are second order effects
<doug16k>
gamozo, yeah, it has to be a perfect use case for it to win. everyone uses the data too soon after and that makes it look awful
<zid>
and NT is arguably harder
<moon-child>
heat: I think you can get either the stored value or the previous value
<heat>
zid, most prefetchws are wrong anyway
<zid>
yup
<moon-child>
and if you do a write, not specified which one gets written
<moon-child>
until the next sfence
<doug16k>
gamozo, if you know that your multiword adc chain is over 16MB though
<gamozo>
:gasp:
<gamozo>
That fits in l3!
<zid>
Not in my l3 :(
<gamozo>
:(
<zid>
I have 10M or 12M available
<zid>
unless I figure out how to do dual socket with a pair of 1xxx xeons
<zid>
of different skus
<doug16k>
my gen is 16MB per 4 core CCX, so 64MB total
<gamozo>
I just got new procs with 48 KiB of l1 and it's HOT
<zid>
did you get the one with 768MB of L3 yet
<zid>
Imagine paying £7000 for a 3.8GHz turbo cpu
<doug16k>
what cpu is that? some epyc variant?
<zid>
7373x
<gamozo>
I only have epycs in my storage server, I need avx-512 :(
<zid>
no you don't
<gamozo>
YES I DO!
<doug16k>
gamozo, zen4 will have it
<doug16k>
soon
<zid>
unless you happen to be doing *exactly* an avx-512 on that cpu, all day, you don't :P
<zid>
load*
<gamozo>
zen4 wont have it right?
<gamozo>
they only will have the 16-bit flaot stuff, but not even AVX-512F
<gamozo>
it's gonna be a scuffed implementation I bet
<gamozo>
(at least, that's how I read it)
<gamozo>
They've been really slippery on answering questions
<doug16k>
yeah I am just going by rumor bs
<zid>
avx-512 is a whole family of shit
<zid>
so god knows what you'd get even if they did add it
<gamozo>
yeah
<doug16k>
I am half expecting it to be 2 256-bit ops, but the mask regs would help, if it had avx512f
<gamozo>
I mainly want avx-512f and avx-512bw
<gamozo>
the mask regs are largely what I want, but I really wouold like all 512 bits
<mrvn>
If your Bignum is 16MB then the adc chain will read 32MB and write 16MB.
<doug16k>
it's a 3-operand add?
<mrvn>
doug16k: frequently.
<doug16k>
write allocate could make one load free
<mrvn>
If you multiply then some of the sub-terms you need multiple times. So you need a non-destructive add
<mrvn>
But with a+b = b+a you can probably shuffle stuff around a lot to use 2-operand add a lot.
<doug16k>
I think CPUs are unnecessarily fast already
<doug16k>
to the extreme
<zid>
play dwarff ortress and say that
* mrvn
throws an 6502 at doug16k
<doug16k>
try portal on 2K@165Hz. it's so smooth and perfect, it's distracting
<doug16k>
every time I do a 180 I am like "whoa that was sooo smooth... geez"
pounce has quit [Remote host closed the connection]
air has joined #osdev
pounce has joined #osdev
<Jari-->
hi all
<klys>
hi jari--
foudfou has quit [Remote host closed the connection]
foudfou has joined #osdev
the_lanetly_052_ has joined #osdev
Ali_A has joined #osdev
Celelibi has quit [Ping timeout: 258 seconds]
Celelibi has joined #osdev
kingoffrance has quit [Ping timeout: 265 seconds]
<Jari-->
klys: so hows OS business
<Jari-->
All kernels seem to have this file system, even drivers access the root file system with open close read write lseek.
<Jari-->
I am still manually poking with readwrite block getsize etcs.
<Jari-->
vfs
<Jari-->
I sometimes wondering what parts of the kernel should be using the internal LIBC and what parts should have direct access.
<Jari-->
Drivers for example would probably be better with using internal device API.
sortie has quit [Ping timeout: 240 seconds]
sortie has joined #osdev
Ali_A has quit [Quit: Connection closed]
sortie has quit [Ping timeout: 248 seconds]
sortie has joined #osdev
zaquest has quit [Remote host closed the connection]
bliminse has quit [Quit: leaving]
zaquest has joined #osdev
heat has joined #osdev
bliminse has joined #osdev
<mrvn>
Jari--: Join us in the microkernel world. None of them should have direct access.
<heat>
D:
<mrvn>
Jari--: If you are talking about firmware loading then maybe rethink the approach. Supply the firmware blob from userspace like Linux does. If you are talking about FSes then they kind of need block read/write but that usualy should go through the block cache and have some protection against writing outside the partition the FS is on.
<mrvn>
or the FS is on raid or lvm and needs to access a virtual device.
<heat>
no it's defo not firmware loading
<heat>
you're overthinking this :P
<mrvn>
heat: what other than firmware loading would access files?
<heat>
who said anything about accessing files
<mrvn>
open close read write lseek.
<heat>
he's talking about the vfs
<heat>
also seems confused
<heat>
very unclear question
<Jari-->
I want to run my file system driver on Linux text console, thats why I am thinking of adding some features to my VFS.
<heat>
what's the linux text console, to you
<heat>
what features are you lacking
<Jari-->
a terminal
<heat>
the terminal is just a pipe of text
<Jari-->
lots of dependabilities non-POSIX
<heat>
user process reads, user process writes
<heat>
that's how the terminal works
<heat>
the kernel just displays it
<Jari-->
heat: console instead of virtual machine
<heat>
oh so you want to run a driver as a userspace program?
<Jari-->
heat: yes
<heat>
ok, that's doable
<heat>
wrap your internal API into libc functions
Ali_A has joined #osdev
<Jari-->
heat: my kernel is MS-DOS like, more than a microkernel
<Jari-->
although it is linear memory space, non-segmented
<heat>
how is it MS-DOS like=
<heat>
?
<Jari-->
heat: well I wrote API to be MS-DOS compliant
<Jari-->
MS-DOS and C applications
<heat>
you might be screwed
<Jari-->
DJGPP really
Ali_A has quit [Client Quit]
<mrvn>
You can port your kernel to posix as "hardware", using signals, mmap, mprotect, settimer, ... to emulate all the hardware stuff. But it's a major undertaking. Or add a qemu-user-your-kernel backend to qemu.
<mrvn>
having drivers access the hardware directly will make it basically impossible to do any of it though. You want to go through the API.
<Jari-->
Sorry guys, I get migraine attacks so my talking is probably not the most consistent ever right now.
<mrvn>
coding with a migrane is a bad idea. makes it worse and produces crap. better sleep it off.
<Jari-->
mrvn: I keep rewriting same functionallities, so it is sort of spaghetti code at worst.
<Jari-->
Especially writing interpreters is difficult.
<Jari-->
mrvn: I want my OS able to run Commodore Basic token binary programs.
<Jari-->
Basically what I am now writing on kernel is it to be Linux like as much as possible.
<Jari-->
UN*X OS does not have to be enormous to function, like 386BSD kernel f.e.
<heat>
386BSD was already pretty complex
<heat>
same with all the previous BSDs
<Jari-->
heat: if I drink coffee, my migraine vaporizes instantly
<Jari-->
must be lack of caffeine
GeDaMo has joined #osdev
pretty_dumm_guy has joined #osdev
gorgonical_ has quit [Quit: Client closed]
gildasio1 has quit [Remote host closed the connection]
Jari-- has quit [Ping timeout: 256 seconds]
nyah has joined #osdev
pretty_d1 has joined #osdev
pretty_d1 has quit [Client Quit]
pretty_dumm_guy has quit [Ping timeout: 258 seconds]
pretty_dumm_guy has joined #osdev
<dostoyevsky2>
isn't linux just like 250 syscalls?
Burgundy has joined #osdev
<heat>
400 and something but yeah
<Mutabah>
plus ioctl/etc
<heat>
plus ioctl, plus pseudo fses, setsockopts, etc
<heat>
and probably more that I can't think of right now :P
<heat>
glorified eBPF interpreter? :P
<mrvn>
L4 has 6 syscalls
<mrvn>
just for comparison :)
<dostoyevsky2>
if you have a C program that implements a couple of syscalls, how difficult is it to get that C program boot up in qemu? Do you need to write your own boot code in asm, or could you just reuse something?
<mrvn>
use the multiboot format and you can use qemu --kernel mykernel.elf
<heat>
dostoyevsky2, there's significant code behind loading a program
<heat>
even more significant if you're doing it properly with the vfs and all that
arch-angel has joined #osdev
Clockface has quit [Ping timeout: 240 seconds]
arminweigl has joined #osdev
<dostoyevsky2>
heat: couldn't you just compile a -fPIE/PIC program and thereby be able to simply load that blob into your memory and just jump to the code without any fancy loading?
<mrvn>
with or without -fPIE/PIC makes no difference
<heat>
those still need to be loaded
<zid>
problem imo is cpu modes
<heat>
a PIC program isn't just a blob you can run directly
<mrvn>
And you need to setup the C runtime environment, meaning you need a stack.
<dostoyevsky2>
if you don't have position independent code you'd need to setup proper virtual memory addresses, no?
<mrvn>
dostoyevsky2: -fPIC is not position independent code
<zid>
It's just as easy with or without, if you can specify the load address
<zid>
I do wonder what you intend to provide the syscalls for though if you're not expecting to be loading stuff properly
<mrvn>
and what will the syscalls do without malloc or printf or anything else
<mrvn>
it's odd, 0x400000000 is 16GB. So it didn't just punch a hole where the PCI regio is but remapped the ram. But then it should go up to 0x43fffffff. So something is stealing from that.
<mrvn>
heat: you have a huge hole there in the lower 4GB
xj0hn has quit [Ping timeout: 240 seconds]
ethrl has joined #osdev
ethrl has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
ethrl has joined #osdev
ethrl has quit [Client Quit]
<ddevault>
I bet I can port doom fairly soon
<ddevault>
without audio, that is
ethrl has joined #osdev
ethrl has quit [Client Quit]
lg has quit [Ping timeout: 260 seconds]
Likorn has joined #osdev
<mrvn>
ddevault: do you have pong? Snake? frogger?
<ddevault>
no, but why walk when you can run
mahmutov has joined #osdev
mahmutov_ has joined #osdev
mahmutov_ has quit [Client Quit]
heat has quit [Ping timeout: 248 seconds]
henistein has quit [Quit: Client closed]
<doug16k>
neat, I didn't know my system had RAM all the way up to 0xdfffffff. I wonder if the hole would be bigger if I booted with CSM
ethrl has joined #osdev
ethrl has quit [Client Quit]
lg has joined #osdev
vdamewood has joined #osdev
ethrl has joined #osdev
ethrl has quit [Client Quit]
vinleod has joined #osdev
vdamewood has quit [Killed (lead.libera.chat (Nickname regained by services))]
vinleod is now known as vdamewood
ethrl has joined #osdev
ethrl has quit [Client Quit]
Burgundy has quit [Ping timeout: 240 seconds]
srjek has joined #osdev
<geist>
also if you're using a discrete GPU it tends to be bigger, since 'stolen' graphics ram tends to be just off the top of what appears to be the end of ram
<geist>
sometimes you can probe past it and actually find the framebuffer
<geist>
er smaller. . okay to rephrase, integrated graphics tend to steal ram off the top of where ram appears to stop
<geist>
ie, it'll say ram goes up to 0xb000.0000 but actually there's a chunk at b... to 0xc000.0000 that's just not accounted for
<geist>
but it actually ends up being a chunk of ram that is given to the graphics card
<qookie>
do dgpus actually steal any CPU memory?
<geist>
not that i know of, aside from whatever stuff the driver might allocate locally
<geist>
dgpus have their own little address space in their universe, and their own mmu to see their own stuff
<qookie>
ah I misunderstood what you said
<geist>
yah that's cause i wrote it backwards on the first line. igpus are the one that steals cpu memory
<qookie>
yeah
<qookie>
but these days afaik they don't steal much, most memory is mapped in via the GTT
<qookie>
on my system (integrated AMD Vega 6 or 7) only 512M is stolen, but games can use way more
<geist>
yah, though usually something like 64-256MB in my experience
<geist>
yah exactly. enough that it causes your end of lower ram to appear to stop shorter than it should
xenos1984 has quit [Read error: Connection reset by peer]
<geist>
considering where PCI space starts up, etc
<qookie>
yeah
<geist>
i found a lot of this by fiddling with the TOLUD MSRs and whatnot on AMD and the intel equivalent
<geist>
it's how some of that sausage is made basically
<geist>
the registers that control where the cpu stops trying to decode DRAM and starts trying to decide mmio space
sonny has joined #osdev
xenos1984 has joined #osdev
<geist>
awww crap. a week after installing the new motherboard: server locked up, exactly the same way
<zid>
Turns out once a week, the cleaner comes past the server, plugs a vacuum cleaner into the same outlet, and vacuums the floor
<geist>
pretty much
<geist>
this mobo is kinda neat though: it has the build in aspeed thing so i can log into a web page and see the console and reset it and whatnot
<geist>
nothing interesting on the event log though
<zid>
oh yea I've seen those controllers
<zid>
It's a bit like ME but vendor specific I guess
<zid>
I happened to have looked at this one last week
<geist>
yah i think they're fairly ubiquitous. has a little OS on it that is running some thing. yep exactly that one
<zid>
I think the main reason is that if you're adding a 2MB VGA framebuffer from matrox, you might as well get this thing instead
<geist>
if you have a DGPU on board it doesn't show up on PCI, but somehow the bios has some sort of knowledge to enable it's VGA feature if it doesn't see a DGPU
<geist>
100% and you can then be actually headless with it and still get a console
<geist>
though the little web page it serves leaves a bit to be desired, and i would not expose it to anything from a security point of view
Burgundy has joined #osdev
<zid>
yea nor ME, as it turns out
<zid>
it's had exploits before in its various stacks, which always ends up making the news
<geist>
so now i'm starting to believe that zen 2s are simply not stable as a long term system. i've now seen a 3900x and 3950x fail the same way. they're running kinda hot but not super hot
<zid>
I wonder how it gets its firmware and stuff, I guess the bios is just adapted to knowing its there and blats it in at device discovery time or something
<geist>
i suppose it could be the PSU or memory though, so i guess i can start popping pairs of ram and see. need to establish what the new MTBF is in the new regime. ran for 8 days before locking up this time
<zid>
You could still blame your PSU if you like
<zid>
or ram yea
<geist>
it just seems so unlikely
<zid>
eh my machine is perfectly stable until you load enough of the cores for long enough
<zid>
which on a "random server setup" is actually very unlikely
<zid>
(the 3.3V rail issue I discussed before)
<geist>
in this case it doesn't seem to be load related, or even related to being a VM host or not. seems to fail equally fast if i am running a bunch of qemu instances or not. usually fails when it's not loaded at all
<zid>
Could be failing to come out of a sleep state then?
<zid>
asks the VRMs for more juice, they try draw it from the psu, psu ramps it too slow
<geist>
i suppose it's possible it could simply be a linux bug, but that seems highly unlikely
<zid>
and everything is undervolted for a bit
<geist>
maybe?
<zid>
I can't rule it out, at least
<zid>
so a psu is a thing you could certainly try
<geist>
yah, actually plan on moving to another case tomorrow anyway, so i'll switch to another equal but different PSU at the same time
<geist>
have basically the same case for a test machine that has more vents, so going to move it there and install some more fans so it can hopefully run cooler
<geist>
it's a nice cheapo case. corsair 100r case. nice solid cheap case
<geist>
lots of drive bays and can hold a full atx
<zid>
My case is a bit of cheap whatever that I took the window out of to make the cooler fit :P
<zid>
the ssds are hanging down from their psu cables
<geist>
yeaaaaaaah
kingoffrance has joined #osdev
sonny has quit [Ping timeout: 252 seconds]
sonny has joined #osdev
GeDaMo has quit [Quit: There is as yet insufficient data for a meaningful answer.]
sonny has left #osdev [#osdev]
heat has joined #osdev
<heat>
geist, TOLUD is a thing on chipsets for intel
<heat>
you'll see it in your chipset docs and there are a bunch of references to it in i915 docs
<geist>
yeah there's some AMD equivalent
<geist>
like many things i think the AMD one is more straightforward, but its called something vaguely similar
<geist>
ah TOPMEM
<heat>
i've seen those "hidden" devices for my chipset
<heat>
mine has a device that gets hidden after booting (somewhere in the SEC phase IIRC)
<bslsk05>
fuchsia.googlesource.com: zircon/kernel/platform/pc/pcie_quirks.cc - fuchsia - Git at Google
<heat>
completely stops responding to PCI accesses
<heat>
it's fascinating
<geist>
thats te AMD equivalent. basically trying to read where the PCI allocation space starts, for both the 32 and 64bit regions
<heat>
what for?
<geist>
i think the idea at the time is if we have to allocate space for PCI devices we need to compute what the aperture is available to us
<geist>
i dont think it's really used, but we went ahead and wrote the code anyway, in case it was needed
<geist>
but looking at the proper TOLUD/TOPMEM was needed because of the stolen graphics memory thing i was talking about before
<geist>
RAM may appear to stop at some address, but it may actually extend past it in stolen graphics that appears as a unused chunk in the e820 stuff
<geist>
so if you check TOLUD/TOPMEM you can find where the proper end of DRAM is
<heat>
TOPMEM just the lower part?
<heat>
is TOPMEM*
<geist>
yah and there's a TOPMEM2 that is the end of the 64bit mapping
<geist>
i dont know how you find that second spot on intel hardware. this is where AMDs is much more straightforward
<geist>
just a pair of MSRs that tell you precisely what you want to know
<heat>
btw I've seen AMD is adding some stuff to the EFI memory map on their new supercomputer platform
<heat>
basically the gpu mem gets put in the memory map as well
theruran has joined #osdev
<geist>
yah, wonder if that generally starts >4GB above TOPMEM2?
<geist>
that would be fair game, since that's not decoded as ram
<heat>
does TOLUD/TOPMEM include SMRAM?
<geist>
dunno what SMRAM is
<heat>
system management ram
<geist>
dunno
<geist>
does SMRAM even show up in the cpu's address space at all?
<heat>
the fw steals a good chunk of memory to have as smm state and smm code
<heat>
I believe so, you just can't touch it
* geist
nods
<geist>
in that case it's probably contained within TOLUD sice the idea is that's where he cpu stops trying to decode these as memory controller addresses
<heat>
fun fact: it grows based on the number of cpus
<geist>
(640k hole nothwithstanding)
<geist>
i remembe ron AMD at least there's a set of MSRs that control the 640k hole. iirc a bitmap of 64k chunks. you can configure it such that the 640k hole doesn't exist if you want
<heat>
how are those MSRs?
<heat>
those should be chipset details afaik
<geist>
yes, but all modern x86s are chipsets as well
<geist>
this is the SOC side of the world
<heat>
intel exposes everything on the PCI bus
<geist>
AMD has fully embraced this and simply created a bunch of MSrs to configure stuff like this, root pci bus stuff, even the memory controller itself
<heat>
just pci device registers everywhere
<geist>
intel stucks to the old model and exposes it as pci stuff
<geist>
yeah this is where AMDs solution is far mor straightforward
<heat>
actually shouldn't you have catch any exceptions when reading those?
<heat>
how do you know they're there?
<geist>
when you look at it through that lens an AMD SOC looks a hell of a lot like a standard ARM SOC. a cpu with a bunch of system control registers to set up te world
<geist>
intel at least pretends that there's some chip on the other side of the bus that configures everything
<geist>
the pci bus that is
<heat>
you're just checking if its an AMD cpu. how do you know if that specific platform has it?
<geist>
you are the bios, you proibably simply know
<geist>
this is bios level stuff
<geist>
you can read the cpuid and see what cpu it is
<heat>
I mean, this particular fuchsia code
<geist>
checks for vendor AMD
<heat>
and all AMDs have it?
<geist>
all AMDs we care about it, but we also have a safe msr routine that catches the trap
<geist>
it's not foolproof for sure. but i also am not sure this code is even called anymore since we moved PCI into user space
<geist>
this particular routine may be vestigial
<geist>
the general concern in my mind is when you boot on an virtual machine that exposes AMD but acts like something else
<geist>
but in general not having a safe msr routine is annoying. i think down in the exception code we have some sort of mechanism for that
<geist>
like if the #GP address is a particular non inlined msr instructon set an error code and return
<geist>
annoying but basically necessary at some point
eroux has quit [Ping timeout: 244 seconds]
eroux has joined #osdev
<qookie>
geist: instead of poking at platform regs, i think you can just ask acpi about the root bus resources if you need to allocate bar space for devices etc
<geist>
yah i think that's the actual correct way, i think. problem is of course i think that involves the complex, bytecode parsing parts of ACPI
<geist>
which we dont want in the kernel
<geist>
but now that we moved pci driver into user space it's possible to get that stuff
<qookie>
and besides without driver support for having bars move somewhere else, you're at the mercy of the fw for how much space it assigned to the bridge if you're allocating to a device behind one
<qookie>
linux just bails if it's not enough
gorgonical has joined #osdev
<gorgonical>
Porting this OS is a lot of work
<gorgonical>
When do I get my certificate of genuine hackerman from geist
<geist>
whatcha porting?
<gorgonical>
A linux-ish kernel used in hpc stuff
<gorgonical>
The ARM64 port that I half-did is also wildly incomplete in some areas. E.g. I'm 80% sure processes cannot receive signals on the arm64 port lol
<geist>
ah
<gorgonical>
We're doing some in-house risc-v design and having a kernel we can modify simply would be nice
<gorgonical>
Linux works on a lot of the boards but good luck modifying any major subsystem
<geist>
yah makes sense
<gorgonical>
It's obvious to any of you I'm sure but there's just a lot of small details to resolve -- where does the TLS pointer go? How is the kernel stack arranged? What does context save/restore look like? What about traps/exceptions? All the hand-asm for things like atomics, etc. And they all vary by architecture
<qookie>
speaking of TLS, is there any concrete docs on how it's supposed to work on aarch64? i only found a fuchsia.dev page about it and got the rest of the info i needed from looking at musl, guessing, and looking at our existing code for x86
<qookie>
(the last part to figure out how TLS works in general :P)
<gorgonical>
my understanding is that the TLS ptr is stored in tpidr_el0
<gorgonical>
I don't know of any concrete docs. Is it just agreed-on convention between libc and the kernel? "Let's use tpidr_el0 and both of us agree not to clobber it?"
<geist>
yah that's to me the fun part. figuring out all the arch specific details
<geist>
seeing how one things maps to another thing
<qookie>
yeah that much i figured, i mean the docs on how userspace expects it to be laid out
<geist>
qookie: pretty sure it's well documented in the arm docs github
<geist>
thats where the official ELF specs and whatnot exist
<qookie>
i haven't found anything there about that in the elf abi supplement
<gorgonical>
geist: it's definitely very exciting and I love the feeling of programming the machine itself. But each file I have to adapt and don't get to test makes me more afraid to run it
<qookie>
for example, userspace expects that tls blocks start at TPIDR_EL0+0x10
<geist>
i think it may be a similar one
<geist>
ah yes. that may be where the ELF spec stops and the OS specific spec begins
<geist>
or at least libc specific spec begins
<qookie>
and the linker will sometimes hardcode an offset based on that assumption, for example in the local-exec model
<geist>
yep
<gorgonical>
and there's just a certain amount of work until the kernel will build at all, much less run
<qookie>
but i have not found an official document from arm (nor GNU or anyone who makes toolchains) about that layout requirement
gorgonical has quit [Quit: Client closed]
gorgonical has joined #osdev
<gorgonical>
Very rude network
<geist>
but yes, FWIW TPIDR_EL0 is the user space TLS root
<geist>
there's a TIPIDRRO_EL0 which as far as i know has no real use anywhere. not even sure linux uses it for anything
<geist>
may do something like put the cpu # in it or whatnot
<qookie>
linux uses it to stash a register in the interrupt handler in some code path
<geist>
TPIDR_EL1 is of course up to the kernel to use, but generally it holds a pointer to either the current cpu sturcture or the current thread structure
<qookie>
but yeah, we have tls more or less working in our libc, but i'm just annoyed i couldn't find any documentation about it
<geist>
also note that x18 is free for the platform to use, in both kernel and user space. the abi says either it's temporary or platform use
<j`ey>
qookie: its in the stack overflow path
<gorgonical>
klange was saying you have to pass a compiler flag to make sure the compiler doesn't use it, right?
<gorgonical>
about x18
<geist>
-ffixed-x18 yes. otherwise it's up to the triple to default it to whatever use it has
<gorgonical>
Ah which may be just a gpr
<geist>
ya if the platform doesn't use it then it's another temporary
<geist>
since x16, x17 are otherwise just interprocedural temporaries
<geist>
x18 is too if it has no other use. it's basically the highest temporary, since x19 has some use
<geist>
it's the first of the callee saved ones
<heat>
I also want a certificate of genuine hackerman from geist
<heat>
what's the final test
_xor has quit [Ping timeout: 246 seconds]
<kingoffrance>
run on vax :D
<heat>
"get a vax" sounds pay2win to me
<kazinsal>
relax, send a fax to a vax; certified hax
<geist>
noice
<mjg_>
vaxcine
<mjg_>
get it
<heat>
hahahahahaha
<heat>
omg
<heat>
so funny
<mjg_>
ikr
<heat>
😂😂😂😂😂😂
<heat>
i should use emojis more
<heat>
see if your shitty systems break
<mjg_>
i'm on ubuntu so you are not far off
<heat>
🚫Ubuntu, 👍👌Arch linux, which I do use
<j`ey>
btw
<psykose>
btw
mahmutov has quit [Quit: mahmutov]
mahmutov has joined #osdev
<Ermine>
btw
gorgonical has quit [Quit: Client closed]
mahmutov has quit [Remote host closed the connection]
mahmutov has joined #osdev
mahmutov has quit [Remote host closed the connection]
mahmutov has joined #osdev
mahmutov_ has joined #osdev
mahmutov has quit [Remote host closed the connection]
<klange>
< geist> there's a TIPIDRRO_EL0 which as far as i know has no real use anywhere. not even sure linux uses it for anything ← it's the thread pointer on macOS, contrary to everyone else :)
heat has quit [Remote host closed the connection]
Brnocrist has quit [Ping timeout: 244 seconds]
heat has joined #osdev
Brnocrist has joined #osdev
mahmutov_ has quit [Ping timeout: 240 seconds]
opal has joined #osdev
bradd has quit [Ping timeout: 248 seconds]
pretty_dumm_guy has quit [Quit: WeeChat 3.5]
vdamewood has quit [Quit: Life beckons]
nyah has quit [Ping timeout: 276 seconds]
bradd has joined #osdev
nickster has quit [Ping timeout: 240 seconds]
dragestil has quit [Ping timeout: 248 seconds]
acidx has quit [Ping timeout: 248 seconds]
dzwdz has quit [Ping timeout: 260 seconds]
Ameisen has quit [Ping timeout: 260 seconds]
thaumavorio has quit [Ping timeout: 260 seconds]
Brnocrist has quit [Ping timeout: 276 seconds]
merry has quit [Ping timeout: 276 seconds]
gruetzkopf has quit [Ping timeout: 248 seconds]
Emil has quit [Ping timeout: 256 seconds]
Brnocrist has joined #osdev
varad has quit [Ping timeout: 244 seconds]
Vercas has quit [Remote host closed the connection]