klange changed the topic of #osdev to: Operating System Development || Don't ask to ask---just ask! || For 3+ LoC, use a pastebin (for example https://gist.github.com/) || Stats + Old logs: http://osdev-logs.qzx.com New Logs: https://libera.irclog.whitequark.org/osdev || Visit https://wiki.osdev.org and https://forum.osdev.org || Books: https://wiki.osdev.org/Books
<moon-child> they do fuse things, but riscv wants much bigger fusion windows
* geist nods
<geist> microblaze deals with the delay slot in a silly way: two jmp instructions, one with and one without
<geist> pick which one. the one without a delay slot takes an extra cycle. have a nice day.
skipwich has quit [Remote host closed the connection]
skipwich has joined #osdev
thinkpol has quit [Remote host closed the connection]
<moon-child> I think it was even proposed that jump over a mov should get fused and turned into a cmov
<moon-child> or something like that
<heat> isn't fuse everything the main prerequisite for fast risc?
<heat> there's no way a load-do_op-store arch could be faster than add $10, (mem) arch otherwise
thinkpol has joined #osdev
<heat> also the weird hoops you need to jump through to load immediates into addresses
<moon-child> isn't there?
<geist> yes but it's basied on the idea that the add $10, (mem) is really just a load-do_op-store inside anyway
<geist> the fusing thing is more of a pragmatic approach to not quite internally riscifying *everything* that modern designs do
<moon-child> I mean, bottlenecks are branches and cache. If you can stuff in better branch predictors and more cache in exchange for simpler design, you win
<geist> yah and i *do* think the 16bit instruction stuff in riscv is a real win
<moon-child> I don't think this has paid off, but that's the idea
<geist> it seems to be well utilized by the compiler and keeps the instruction size near x86 size
<heat> do you need to opt in?
<moon-child> 16bit instructions are neat, but intel/amd demonstrate that you _can_ do full-on variable width with ok performance
<moon-child> even if it is a pita
<heat> i don't think i've seen those instructions
<geist> you probably have, it's just transparent. disassemble something you compiled and you may have seen it
<heat> the 'c' extension right?
<geist> the assembler is even allowed to substitute .c versions
<geist> moon-child: sure but the 16 bit stuff is a good compromise. it gets pretty close to the same density and is still much easier to decode
<moon-child> yes
<geist> i dont think an x86 machine has yet gotten to the same level of parallel decode as has yet happened on risc machines
<geist> POWER, Apple M1, etc are 8+ at this point
<geist> and that's a full cache line at a time
<geist> i think the new tiger lake or zen3 is 5 or maybe 6 now?
<moon-child> well, they do have uop cache. Though I think agner said refilling uop cache was a bottleneck on some core
<geist> yah
<geist> the earlier zens i know were i cache bottlenecked i think. could only pull in 16 bytes at a time, so you could get a 5 way decode but only for very simple instructions
<heat> geist, are the 16-bit instructions suffixed or something?
<geist> heat: yep. first bits of the instruction (low bits) basically tell what the instruction size is
<geist> 00, 01, 10 are 16 bit, 011 are 32 bit, 0111 are 48 bit, 01111 are 64bit and so on
<geist> i dont think there are any 48 and 64 bit instructions yet defined
<moon-child> huh I didn't know they had left space for bigger instructions
<geist> yah it *does* mean you burn 3 bits for a 32bit instruction, as a compromise
<geist> and it's clever that it burns more bits in larger instructions
<heat> llvm-objdump -d boot/vmonyx | grep '\.c.*' <-- should work right?
<geist> but having this mixed in the stream is far more useful than the arm32/thumb2 switch-on-branch scheme
<heat> cuz i don't see any compressed instructions in my binary
<geist> can you pastebin it?
<geist> (i'm sure you're right, just curious)
<heat> the whole binary? it's 1MB xD
<geist> hah okay.
<moon-child> hmm, what if you made an out-of-band rle scheme for instruction lengths?
<geist> well, how are you compiling it?
<heat> geist, see "Onyx boot image (riscv64) (llvm)" https://github.com/heatd/Onyx/actions/runs/2204888603
<bslsk05> ​github.com: ide: Fix bug in DMA code · heatd/Onyx@1a9e494 · GitHub
<moon-child> '5 32-bit instructions; 2 16-bit instructions; 1 64-bit instruction; ...'
<heat> march=rv64imac
<moon-child> bet that would save some space
<moon-child> jump would have to take two addresses, though, or have some out-of-band translation
<heat> and yes that's a zip of a .tar.zst of a directory
<heat> that's the github actions life
<heat> everything is a zip
<geist> oh you're not specifying anything at all on the command line
<geist> so it's whatever default the compiler is
<heat> hm?
<geist> well, i didn't see it looking at a gcc compile in your risc thing
<geist> didn't see a -march or whatnot
<bslsk05> ​github.com: Onyx/make.config at master · heatd/Onyx · GitHub
<geist> already closed the window though (see previous discussion about me always too quickly closing windows)
<geist> then it should. see the 'c' extension
<geist> but i looked at your build and didn't see any of those in the compile line
<geist> maybe the build system isn't working right?
<geist> or it was building user space differently?
<heat> it's building userspace differently
<geist> well, i dunno. you should figure it out
<heat> the kernel stuff has [CC] or [CXX]
<geist> also ifyou're using clang it may not support 'c' extension
<heat> you were probably seeing musl's compiles
<geist> i think the RV support in clang is pretty far behind
<geist> okay. anyway, it's your problem to debug :)
<geist> it *should* be basically totally transparent
<heat> lets see gcc
<geist> and a nice code size decrease
<heat> well, if that regex is correct then it's not using compressed instructions
<geist> basically except for a few academic or hobby riscv cores everything seems to support the compressed extension and it's well designed such that it shouldn't be any slower
gog has joined #osdev
<geist> perhaps the disassembler doesn't put it there. look at it with a text editor and just look at the size of the opcode
<heat> ah yes I think I'm seeing two byte instructions
<heat> if those are all compressed then yeah, it's using them
<geist> probably just llvm-objdump which i've found to be far behind binutils's objdump in the disassembly department
<heat> 9512 add a0,a0,tp <-- is this compressed?
<geist> yup
<heat> ok wonderful
<heat> you can't see the c. in the disassembler
<heat> both llvm and gcc are using it
<geist> the disassembler?
<heat> yup
<geist> how are they using the disassembler?
<geist> you mean the assembler?
<heat> llvm-objdump/riscv64-onyx-objdump
<heat> I was looking for compressed instructions in the executable
<geist> yes. and it didn't show it. i know.
<geist> the *assembler* should know when to emit compressed vs non based on some rules
<geist> but then clang has it's own built in assembler, and it should know to
<geist> and it appears to, unless you've explicitly told it not to use its builtin assembler
<heat> yes it's working
<geist> ideally the compiler picks instructions based on whether or not it knows it'll be compressed or not
<heat> i was just looking for the wrong thing
<heat> plenty of 16-bit instructions
<geist> and favors those, etc. it means there are at least two clases of registers, etc
<geist> since the compressed instructions can only access 8 of the 32 registers
<geist> it's the main reason the way the a/s/t instructions are split up in weird ways on riscv, they're packed in such that there's a medly of them in the 8 registers compressed gets, which iirc is r8-r15
<geist> a/s/t registers, not instructions
<heat> ah geez after 7 years my kernel is still smaller than the kernel-resident ACPICA
<heat> in LOC
<heat> we'll see in 2029
<geist> https://en.wikichip.org/wiki/risc-v/registers has a nice table that also shows the compressed numbering
<bslsk05> ​en.wikichip.org: Registers - RISC-V - WikiChip
<heat> gdb and objdump are insconsistent in the way they show riscv disasm
<geist> also why sometimes you see thec ompiler really try to stick with a0-a5 when doing calculations whatnot, when it really could expand out and use t0-t6 and whatnot
<heat> inconsistent*
<geist> depends on which objdump you're talking about
<heat> binutils
<geist> binutils objdump i'd expect to be much more consistent with gdb
<geist> ah
<heat> in this case llvm is more consistent with gdb than binutils
<geist> ah
<geist> might be some switches you can pass it
<heat> heh all my context switching instructions are compressed
<heat> neat
<heat> (the register load/store ones)
<geist> at least a lot of them i assume, the ones that line up with the compressed registers
nyah has quit [Ping timeout: 276 seconds]
<geist> the neat key thing they did in riscv that's nice is every compressed instruction is hard defined as having a 32bit equivalent
<bslsk05> ​gist.github.com: riscool · GitHub
<heat> all of them
<nomagno> How do they manage to compress RISC-V into 2 bytes?
<nomagno> Is that one byte for opcode and another for data, or what?
<heat> no
<geist> heat: oh huh, there must be a 5 bit register form for the load/store instructions
<geist> been a while since i looked at it
<geist> nomagno: no it's pretty tightly packed. usually the first thing that goes is 3 address instructions
<geist> ie, no more `add a, b, c` because yo ucan't fit 3 operands in
<geist> and then the second thing that goes is access to all 32 registers, since that's 5 bits a register (two operands would use up 10 of the 16 bits)
<geist> so they usually can only encode 3 bits, (r8-r15)
<geist> oh wait, scratch the 3 operand stuff
<geist> they can, because they use 3 bits
<geist> anyway it's well specced out. but basically you cant do as much with the compressed ISA, but it always matches 1:1 with an uncompressed i nstruction
<nomagno> So essentially the assembler figures out how to translate your code into an insane register pressure, 100% PC-relative-indexed version?
<geist> so if you can as a compiler favor a subset of the ISA the assembler is free to substitute a smaller form
<geist> nomagno: assembler can only really replace one instruction at a time. it's not really that complicated
<geist> basically if the compiler generates an instruction that has a compressed version, the assembler uses that instead
<nomagno> geist: you'd be surprised what modern assemblers can optimize away.
<geist> sure, but in thei case they do not.
<nomagno> Fair
<geist> it's a simple substitution
<nomagno> Yeah yeah, it's on the assembly writer, not on the assembler
<jimbzy> Yeah, that's the same thing my calculus professor said, geist. "It's simple substitution..."
<geist> heh
<geist> nomagno: yah that's right the compiler is aware of what will and will not go to compressed and chooses accordingly
<heat> this is why riscv is a fake arch
<heat> no 15-byte nops? really?
<geist> heat: aaaah yes i just double checked. it's *stack pointer* based load/stores
<geist> those have a special compressed form that omits the base register but has 5 bits of target register
<heat> unfortunately there's no load/store compressed form for tp
<geist> right, because tp is outside of the 8
<geist> table 16.1 in the spec is great, it is a list of all the compressed forms
<geist> the compressed instructions are pretty irregular, but that's what you get when you compress things to less bits
sonny has quit [Ping timeout: 252 seconds]
<geist> and also yeah. everything is at best 2 registers
<geist> so you could only substitute a 3 register add if it were something like `add s0, s0, a0`
<heat> i find opcodes really confusing still
<heat> probably because I never really look at them
<geist> this is good, it means you're not a computer
<heat> beep boop send me your credit card information beep boop
<heat> dear humans what is your favourite prefix
<geist> the 32bit instructions are somewhat more regular. the parts of riscv that are a bit unclean to me but there's a reason for it is the way immediates are spread around
<geist> also i cant not think of the game when i see R-Type
<heat> oh wow GAS macros can start with a .
<heat> geist, do you know off the top of your head how large an offset can be in riscv?
<heat> 12 bits?
sonny has joined #osdev
<geist> 12 bits i think
<heat> that sounds reasonable
<geist> usually the I type instruction in the above image
<geist> yah it actually slots in with the way the address computation stuff works
<heat> i really wanted to find a way to get a nice instruction sequence for tp accesses
<heat> unfortunately inline assembly doesn't want to play along
<geist> also keep in mind, and this is a little funny with riscv, all immediates are signed
<geist> so that's 12 bit signed. ie +/- 2048ish
<sonny> interesting
<sonny> any rationale?
<geist> there's some reason for it, like *all* immediate computation in the cpu itself goes through the same logic
<sonny> ohhhh
<sonny> that makes sense
<geist> also the way the immediates are split up in the above table is arranged such that bit 31 in the instruction is always the sign extend bit
<geist> ie, the high bit of the encoded immediate always lies in bit 31
<heat> all my percpu accesses in riscv have 3 instructions and this doesn't sit right with me
<geist> to make the muxing logic a bit more consistent apparently
<heat> ahh what my add_per_cpu is even worse
<heat> i should've optimised this
<geist> anyway, i do recommend reading through the spec, and there's i think an online version of the 'riscv primer' i believe is the name
Likorn has quit [Quit: WeeChat 3.4.1]
<heat> i just did get_per_cpu() and then write_per_cpu(value + n)
<geist> it's a short read, goes through the architecture and has a lot of asides about why this or that decision was made
<heat> i've read the spec but mostly skipped through the instructions
<geist> http://riscvbook.com/ is the book
<bslsk05> ​riscvbook.com: The RISC-V Reader: An Open Architecture Atlas
<geist> i think you can get a free version, but it's a nice read
<heat> i think the only thing I don't have yet is SIMD and quad extension support
<geist> http://riscvbook.com/greencard-20181213.pdf looks like a good thing to have around
<heat> oh nice find!
<sonny> patterson at it again :-)
<heat> geist, do you think there's any reasoning behind simd != floating point or is it standard in most non-x86 archs?
<geist> probably just so you can include a smaller version
<geist> ie make a single precision cpu, double + single, or vector + double + single
<heat> you also have quad
<geist> also i think the vector stuff has just now been properly ratified. i'm not sure the vector bits on that greencard pdf are up to date since it seems to be from late 2018
<geist> aso i'm waiting for dh` to get on my case for spewing nonsense
<heat> they got clz!
<heat> also max and min instructions which is pretty interesting
<moon-child> simd != floating point on x86 too
<moon-child> also x86 has simd max and min, and clz in avx512
<geist> yeah atomic min/max too. i think ARM64 just added those too which i find interesting. not a thing i've needed but i haven't built any algorithms behind it
<heat> if you're using x87 fp you should stop immediately :P
<moon-child> geist: yeah, atomic or/and seems more useful (assuming you have add already)
<heat> the p extension seems to only have SIMD for 64-bit (even for rv32)
<moon-child> (amusingly, min/max _are_ or/and on booleans)
<geist> heat: were are you seeing the clz stuff? is it in the 'b' extension?
<geist> not sure what the status of that extension is to be honest
<geist> aah yeah
<heat> chapter 6
<heat> clz and max/min are also in the b extension now that i'm seeing
jhagborg has joined #osdev
<heat> actually chapter 4 lists the duplicated instructions
<heat> also, subextensions????????
<geist> what i dont know is how far along a proposal has to be to have gotten this far
<geist> ie, is the fact that it exists in the official repo mean it's in some sort of final ratification or is it still just an idea someone is tossing around
<heat> i know that linux doesn't merge riscv code for extensions that haven't been ratified yet
<klys> subextensions exist in the repo? this question is probably simpler than it looks
heat has quit [Quit: Leaving]
<klys> what's new
<geist> aww klys you ran heat off
<geist> ice cold
<klys> sorry about that, I don't have many predilections about temperaments
<geist> klys is entropy in action
<geist> anyway not much, drinking some coffee, should go take a wlak
<klys> well I've been working
<klys> on something not technical
<geist> earning some cheddar?
<klys> yeah I have to decide what to put it on now
<klys> anyways just a bit curious
<klys> and I now own a fun domain just working on database frontends and javascript too
<geist> noice
<klys> and I might buy an epyc
<klys> with my huge tax return
<geist> i was thinking of trying to get a real(ish) server board for my ryzen
<geist> not sure its worth it though. asrack has something kinda like this
<klys> socket?
<geist> AM4
<klys> right I was looking at SP3
<klys> so eh, are there new changes to lk?
<geist> see, ported to 68k. ported to a board i got
<geist> added pluggable network support, tcp outgoing sockets, a cheesy IRC client (To talk to sortix)
<geist> working on spiffying up the FAT driver to properly support RW
<klys> wew fun I hope sortie is enjoying this
<geist> added e1000 driver and the start of an AHCI driver a while back
<geist> need to finish up the AHCI driver
<klys> neat
<klys> ahci is for sata correct?
<geist> yes
<klys> so you've been doing cluster math? how's that coming for ye?
<klys> with the FAT
FragmentedCurve has left #osdev [#osdev]
<klys> and I guess the board you got is that m68010 board I saw last week
<geist> heh cluster math
<geist> found a better version of the FAT spec from MSFT laterthan the 1.03 version where the writer is being an ass the whole time
<geist> someone cleaned it up to include it in the SD card spec i think
<klys> link
<geist> made it not so condescending
<geist> oh i forget where
<klys> okok
<geist> i dont save links to things, i make my own copy so i never have to
<klys> I've had a lot of new browser fun since purchasing this 32GB-RAM dell in november
<geist> ie, the first hit on google
<geist> i had an older version of that which is written in this overtly condescending way
<klys> oh, not too long either
<geist> like 'all you idiots that dont understand FAT here's how it is and listen up and quit fucking it up'
<klys> right hehe
<geist> but yeah i was just reading it closely last night and i think i finally understand cluster math. what a PITA
<klys> I'd been pushing for changes to the qemu docs to include lcyls= lheads= lsecs=
<klys> even though that's a seabios feature
<klys> those are still qemu options
sonny has quit [Ping timeout: 252 seconds]
<klys> and qemu should document them
<klys> you must have an equation then involving "reserved sectors"
<geist> sure
<geist> the spec actually describes it pretty precisely, it's just non intuitive
<klys> yeah on page 29/37
<geist> as in cluster 2 is actually the first data block after the reserved sectors after the fats after the root dir (if on fat 12/16)
<geist> and the first 2 clusters of the FAT are used for other purposes, but instead of burning two clusters of allocation, they offset everything from that
<geist> *eyes roll*
<geist> like you can just see how it's hack upon hack
<klys> so do the reserved sectors come after the FAT table?
<geist> and every step of the way someone could have designed something a little cleaner
<geist> before
<klys> I was somehow under the impression it was before, yeah
<geist> theyr'e basically there so you can pad out the first FAT such that yuor first cluster arrives on a proper boundary
<klys> ah cool
<geist> FAT12 stuff tends to pack it all in with no wasted space, whcih makes sense for floppy disks, but then i think you'd never use anything but 512 byte clustesr on a floppy disk
<geist> so it doesn't matter if the clusters are unaligned
<klys> so that kind of math exists in the format tool, to cluster align the FAT
<klys> does that sound right?
<geist> right
<klys> okay and the FAT32 is different then too?
<geist> also since the FAT length in sectors is a field in the BPB you can pad out your FATs such that they align if you want
<geist> no. FAT32 is fundamentally the same as FAT16 in layout *except* the root dir is a regular file now, instead of being a fixed length right after the fat
<geist> so that means cluster 2 (the 1st cluster) starts immediately after the fats since there is no reserved root dir space
<klys> ah, so the root takes FAT entries, and \ is cluster two.
<klys> er
<geist> yah, though it's actually specced such that it doesn't have to be cluster 2
<geist> there's a field that says what the starting cluster is for the root dir
<klys> oh because that field exists
<klys> except you add that field to the reserved sectors
<geist> but i think the spec says it should be 2, unless that's a bad sector, in which case pick the first available cluster that doesn't have a bad sector in it
<geist> no. you dont. reserved sectors are before the FAT(s) and in sectors
<klys> oh then I see thanks
<geist> cluster2 is the first available Data Cluster, which starts after the last FAT and after the root dir sectors
<geist> but in the case of FAT32 the root dir sectors == 0
<geist> anyway it's silly, but onc yeou grok it it makes sense. the pdf from above has a decent graphic pretty early on
<klys> because cluster two is at location zero
<geist> cluster two is at data block 0. and data block zero is defined as ...
<klys> that is, the math forces cluster two to the start, block zero
<geist> data block zero, to be precise. not block zero of the device/partition
<klys> yeah
<geist> iirc ext* does a more regular job of this and defines the first block to be block 0 of the device, and simply marks those things as occupied in the bitmap
<geist> since that's where the bootsector/etc is
<geist> so it's at least consistently numbered from the 0th byte of the volume itself
<geist> which is i think how most sane things do it
<klys> well is your code looking fairly stable from learning this?
<geist> i'm starting with some existing, mostly broken code, so it's not yet morphed into a good place
<geist> but this part of the code was mostly okay
<geist> but i'm basically rewriting it more or less completely as i go
<klys> okay this is enlightening, do you often start with code that has become messy and forgotten?
gog has quit [Ping timeout: 246 seconds]
<geist> not really, but in this case it's pretty clean and forgotten code, just incomplete
<klys> ah is it thinfs?
<geist> so it was a reasonable starting point, since it basically works, just needs to be redone
<geist> hmm? no. it's fat.
<klys> I remember thinfs
<klys> it didn't work
<geist> https://github.com/littlekernel/lk/tree/master/lib/fs/fat32 just some implementation someone tossed in LK years ago
<bslsk05> ​github.com: lk/lib/fs/fat32 at master · littlekernel/lk · GitHub
<geist> it basically worked for fat32 for simple 'read file all at once'
xenos1984 has quit [Read error: Connection reset by peer]
<geist> but i've started to redo it, so it'll probably morph into something else by the time i'm done
<geist> i haven't pushed my changes into it
sonny has joined #osdev
mahmutov has joined #osdev
<klys> oh yeah that source tree looks trivial
<klys> I think some cameras hardwire the location of FAT data structures
<bslsk05> ​github.com: lk/fat.cpp at master · littlekernel/lk · GitHub
sonny has quit [Ping timeout: 252 seconds]
<klys> mounting FAT without running a r-o dosfsck on it could potentially result in corruption in some cases
<klys> or at least having some sanity checks (eg. the FATs match)
sonny has joined #osdev
xenos1984 has joined #osdev
sonny has quit [Quit: Client closed]
mahmutov has quit [Ping timeout: 246 seconds]
<wxwisiasdf> osdev unit testing? :)
<wxwisiasdf> is there any os that has unit testing?
<wxwisiasdf> like i don't know, `kdebug` for a micro and then like some kind of stress tester hypervisualizer that brings drivers to their limits or so
<Mutabah> I use the rust unit testing framework for a little bit of testing
<zid> turns out we have really good integration testing
<zid> you boot it then run doom
<zid> and wait for it to crash
jhagborg has quit [Remote host closed the connection]
jhagborg has joined #osdev
Ali_A has joined #osdev
<vdamewood> Does an x86(_64) CPU need any features to support UEFI, or is that all in the startup code in the firmware?
<moon-child> I expect the latter
<vdamewood> I as well.
Ali_A has quit [Quit: Connection closed]
<geist> yah nothing other than whats in teh base x86_64 feature set
<geist> like paging, 64bit, etc
<vdamewood> So, no magic instructions found only in extension foo, or anything like that?
<geist> no
<geist> wouldn't make any sense for it to do so
<geist> i suppose it could test for an decide ot use 1GB pages, for example, but it could always test for it
Ali_A has joined #osdev
<geist> OTOH if it has to test for it it implies it has the fallback code in the implementation, so since it's not intended to be performant, may as well just assume the feature isn't present
<geist> and things llike AVX512 or whatot it has no need for
heat has joined #osdev
<heat> vdamewood, no.
<heat> UEFI firmware boots the same as BIOS firmware
<heat> same address, same everything
<heat> it's everything after switching to protected mode (SEC phase in UEFI) that's different
<clever> in the case of coreboot, you have a seperate "init the system" and payload
<heat> yeah
<clever> the payload can be seabios (legacy bios api), tianocore (uefi api), or just raw grub!
<heat> but coreboot's UEFI payload is way different
<heat> after SEC nothing's initialised except for really basic stuff so you're in 32-bit mode and possibly TDX
<heat> it ends by finding a firmware volume that has the pre-efi initialisation (PEI phase) code in SPI flash
<heat> the SPI flash is memory mapped of course
<heat> you don't even have RAM yet, all cache as ram
<geist> none of this means you couldn't compile an implementaiton of UEFI to require some newer cpu features if you wanted to, but then that particular binary wont run on older stuff
<geist> but that may be fine. however it's not baked into the API or whatnot
jhagborg has quit [Remote host closed the connection]
<heat> yup
<clever> i read something about how 17h? family amd chips use a PSP (like intel me) to bring ram up
jhagborg has joined #osdev
sonny has joined #osdev
<clever> and that copies the "bios" to the dram, and maps it to the reset vector
<heat> it's probably safe to assume that the platform's chipset you're going to run on can have a base feature set but afaik that code is totally generic
<clever> so those dont have to deal with the whole cache-as-ram stuff, but the ram config is outside of the "bios"'s control
<geist> yah, no particular reason to specialize it since it's not performance critical
<clever> the only time ive ever seen the rpi firmware using vector opcodes during init, was for the cache-as-ram setup
<clever> using a vector store, to write to an entire cacheline at once, so it doesnt try to fetch the missing bits from ram
<heat> some firmware engineer on the edk2 mailing list was saying that he's seen memory reference code with like a megabyte of debug code that prints histograms and whatnot
<clever> it feels less like proper cache-as-ram, and more like just avoiding a cache miss/eviction
<clever> and there are a lot of places where i might have used vector opcodes, but the official firmware doesnt
<geist> yah if nothing else the uefi stuff might be optimized for space, since flash chips aren't free
<geist> but then i think it's probably not too tight on the average mobo implementation, especially if there's lots of gui code in the bios setup stuff
<geist> which would most likely dominate the flash usage
<heat> i've seen some people with concerns yes
<heat> i think it's tighter than you'd think
<heat> might be possible that we get dynamic linking in edk2 this gsoc
* geist nods
<heat> right now the whole build statically links libraries which is just wasted space
<heat> except the boot services and protocols and all that
<heat> cuz like everything's super modular and all in separate .efi executables everywhere, even before you have RAM up
<clever> damn
<clever> i would have assumed it would be a bit of a race to get ram up first
<heat> no
<heat> i think you even have a heap before ram is up
srjek has quit [Ping timeout: 240 seconds]
<geist> clever: it's also entirely possible all of that ram up stuff happens before the bulk of the big stuff
<geist> ie, the AMD AEGSA may start and do stuff then pass handoff to something else maybe
<geist> though i guess heat just says something that counteracts that so dunno
<geist> thinking of the fairly standardized notion in ARM world of the whole BL1, BL2, BL3x stuff, etc
<heat> i'm talking about intel platforms
<geist> yah i know. thinking that stuff would be done in a fairly similar way
<geist> but i guess there's no reason to think that'd be the case really
<geist> probably all sorts of history there
<heat> ram init is done in the intel FSP
<heat> which is intel's AGESA
<geist> okay, so that'd be kinda like BL1 or so in arm world
<clever> or SPL?
<geist> ie, the highly machine specific stuff that gets you into a runnable space
<geist> clever: what is SPL?
<clever> secondary program loader
<clever> ive seen that on a number of arm boards
<geist> SPL would e a bit later, hence 'secondary'
<clever> you typically prepend the SPL to the uboot binary, and write the combined pair at a fixed offset on the SD card
<geist> BL2 or one of the BL3s in arm parlance
<clever> the rom loads the SPL, the SPL brings ram online and runs the "secondary" program(uboot)
<geist> yah that's some rpi nonsense
<clever> this is done on a lot of non-rpi boards
<geist> that's all BL31 to arm
<clever> maybe more the armv7 era
<geist> sure. again thats 'application level bootloader' after stuff is brought up if you're following ARM's world
<geist> and yes, you dont have to follow it, but you also can't have a secure boot environment if you dont
<heat> i've been looking more closely and I think that they prelink PEI core to run at a specific image base
<heat> because they can't actually relocate the image in SPI flash
<geist> i've seen *tons* of implementationso f this stacked stuff in ARM, but sicne v8 ARM has tried to standardize it, and they're largely okay
<clever> i can see how the secondary program could also be ATF, but it sounds like ATF has its own names
<geist> clever: yes. BL1, BL2, BL3x, etc
<geist> it's all specced out as generic names for 'blob of code here that has this responsibility'
<clever> yeah, that at least makes it easier to talk about each phase
<geist> so someone can call it something like uboot or uefi, etc but it ends up being a phase
<geist> which i think makes sense once you grok it
<clever> similar with SPL, the secondary program can be either uboot or uefi
<bslsk05> ​ohwr.org: Arm trusted firmware (atf) · Wiki · Projects / SoC Course with Reference Designs · Open Hardware Repository
<clever> and i should probably yoink a few of those names for my rpi firmware
<geist> basically by the time you get to BL33 you're in EL2 or EL1 and you're running in non secure world, and then you can build whatever stack of firmware you want there
<clever> ive just been calling things bootcode.bin and lk.elf
<geist> but you've already gone through at least BL1 and BL2
<clever> but lk.elf is confusing, given that every project creates one
<geist> yah
<clever> BL31 most closely fits what ive been calling lk.elf
<geist> anyway, didn't want to hijack the conversation
<geist> yah normally this is where you insert the ATF binary, which is designed to run at EL3 and stick around
<geist> and then the apps bootloader (uboot, uefi, etc) would be BL33
<heat> btw PEI is going to run in 64-bit mode in new intel platforms
<clever> but in my case, its not even on the arm core
<geist> BL32 is when you have some secure OS that you run on the side
<heat> which means that at least their CPUs can have page tables in cache as ram
<geist> heat: hmm, what's PEI?
<geist> besides prince edward island
<heat> pre-efi initialisation. it's totally efi but before the standard UEFI spec environment and it's the part of the firmware that inits the platform, PCI, ram, etc
* geist tosses one out for the canadians
<clever> :D
<heat> part of PEI still runs in temporary cache-as-ram
<heat> it running in 64-bit mode means that at least new intel CPUs can have their page tables exclusively in cache, which is pretty interesting
<heat> guess it works the same as the IDT and GDT which always could be in cache-as-ram
<geist> i'd think PEI runs on potatoes if nothing else
<heat> it runs on top of cpu magic like everything else
<bslsk05> ​wiki.osdev.org: Creating a 64-bit kernel - OSDev Wiki
<geist> yeah i just dont think it's explained very well
<clever> i think thats a side-effect of the cpu having a limited number of usable addr bits?
<heat> it's definitely not true now
<geist> it seems to be a roundabout way of saying you can put a physmap at the bottom of the kernel, but you can also start off by easily unity mapping 0-2GB to -2GB
<clever> address sizes : 48 bits physical, 48 bits virtual
<geist> and then just run the kernel out of where it was loaded
<clever> so you cant freely use the entire 64bit range
<geist> ie, if the kernel got loaded to 1MB physical, then you could link it to run at (-2GB + 1MB) and then the simple map at startup would Just Work
<clever> and to allow for a kernel in the upper half, you have 47 bits that function normally, and then everything else acts as 1 massive BIT
<geist> i think linux did something like ths for a while, though it's probably more sophisticaed now
<heat> whoever wrote this seems to have confused linux i386 with linux x86_64
<geist> yes, i think so
<heat> they totally ditched high memory in x86_64, and the -2GB mapping isn't linear
<geist> yeah most likely
<geist> it still is a convenient way to bootstrap paging if you're okay with the kernel being loaded at a fixed spot phyusically though
<geist> just set up a simple unity map and then get going
<geist> but a little bit more effort and you can be more flexible
<geist> i was probably just pontificating at the time some simple way to bootstrap, not necessarily that it was The WAy
<heat> ah wait it does indeed still have a linear map of 512MB in -2GB
<heat> if the docs are up to date that is
<geist> acgtyually kinda easy to verify with qemu
<heat> I was looking for a random claim in the osdev wiki that I remember reading when I started out with x86_64 that said you couldn't go from long mode to protected mode without rebooting
<heat> which is definitely false
Ali_A has quit [Quit: Connection closed]
<heat> can't find it, hopefully it's not there anymore
jhagborg has quit [Ping timeout: 276 seconds]
jhagborg has joined #osdev
<CompanionCube> `
Ali_A has joined #osdev
<wxwisiasdf> i just disabled mutexes and my os stopped crashing
<wxwisiasdf> woah it's magic
<CompanionCube> heat: 64-bit PEI sounds cool, do you know if intel ever made good on their intention to kill off the CSM for new things in 3030?
<CompanionCube> *2020
<gorgonical> I really need a USB hub. I have two serial USBs plugged in, headphone DAC, sd card reader USB. Too many things
sonny has quit [Quit: Client closed]
papaya has quit [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
<heat> CompanionCube, well, I can't actually find the CSM code
<bslsk05> ​www.ebay.com: 7 Port Aluminum USB 3.0 HUB 5Gbps High Speed +AC Power Adapter For PC Laptop Mac | eBay
<heat> so it looks like it was scrapped around 2019
<heat> OVMF still has a copy though
<gorgonical> What a strange design
papaya has joined #osdev
<gorgonical> But very enticing
<gorgonical> Too enticing. I got one
<gorgonical> thanks klys
<klys> yw!
<klys> main reason I favored the design in the first place was to accomodate some of those: https://www.ebay.com/itm/194988974944
<bslsk05> ​www.ebay.com: HDMI to USB Video Capture Card 1080P 60fps Recorder Phone Game Live Streaming 4K | eBay
<CompanionCube> 2019 would make sense, yes
<heat> yeah it looks pretty dead, only OVMF still has CSM unless there's a considerable amount of CSM code that is maintained by vendors off the tree
<gorgonical> I see. Yeah you definitely need the space. I had too many of those 4-ganged plugs where even thicker USB drives will jam
<klys> and then I found out there is a delay in the signal so I can't keep up in real time
<CompanionCube> i could see AMI or the like maintaining CSM if they wanted/needed, i guess
<wxwisiasdf> usb toothpaste, nom nom
<wxwisiasdf> gimme some universal toothbrush, ehci contoller for max dental care
<CompanionCube> xhci, because 'x' makes anything sound cooler
john has joined #osdev
heat has quit [Ping timeout: 260 seconds]
zaquest has quit [Remote host closed the connection]
zaquest has joined #osdev
jhagborg has quit [Remote host closed the connection]
jhagborg has joined #osdev
john has quit [Ping timeout: 276 seconds]
Ali_A has quit [Quit: Connection closed]
vdamewood has quit [Ping timeout: 246 seconds]
vdamewood has joined #osdev
jimbzy has quit [Quit: ZNC 1.7.5+deb4 - https://znc.in]
jimbzy has joined #osdev
Reinhilde is now known as MelMalik
jhagborg has quit [Ping timeout: 240 seconds]
mniip_ has quit [Ping timeout: 620 seconds]
<corecode> is there a channel on simd/bit twiddling optimization?
<zid> ask fuz in ##asm
wxwisiasdf has quit [Quit: Lost terminal]
GeDaMo has joined #osdev
nyah has joined #osdev
eroux has joined #osdev
marshmallow has joined #osdev
Burgundy has joined #osdev
eroux has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
gog has joined #osdev
eroux has joined #osdev
dennis95 has joined #osdev
gog has quit [Ping timeout: 272 seconds]
gog has joined #osdev
gog has quit [Ping timeout: 240 seconds]
eroux has quit [Ping timeout: 256 seconds]
Ram-Z has quit [Ping timeout: 256 seconds]
nyah has quit [Quit: leaving]
nyah has joined #osdev
nyah has quit [Ping timeout: 246 seconds]
nyah has joined #osdev
vdamewood has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]
nyah has quit [Ping timeout: 246 seconds]
Ram-Z has joined #osdev
srjek has joined #osdev
Ali_A has joined #osdev
nyah has joined #osdev
Teukka has quit [Read error: Connection reset by peer]
Teukka has joined #osdev
les has quit [Quit: Adios]
les has joined #osdev
terminalpusher has joined #osdev
Ram-Z has quit [Ping timeout: 246 seconds]
diamondbond has joined #osdev
vdamewood has joined #osdev
Ram-Z has joined #osdev
Likorn has joined #osdev
pg12 has quit [Ping timeout: 240 seconds]
diamondbond has quit [Quit: Leaving]
knusbaum has quit [Ping timeout: 246 seconds]
pg12 has joined #osdev
knusbaum has joined #osdev
knusbaum has quit [Quit: ZNC 1.8.2 - https://znc.in]
gxt has quit [Quit: WeeChat 3.4.1]
knusbaum has joined #osdev
gxt has joined #osdev
wootehfoot has joined #osdev
wxwisiasdf has joined #osdev
<wxwisiasdf> hello people :>
<wxwisiasdf> & good morning
sonny has joined #osdev
gog has joined #osdev
knusbaum has quit [Ping timeout: 240 seconds]
dennis95 has quit [Quit: Leaving]
sonny has quit [Ping timeout: 252 seconds]
heat has joined #osdev
Ali_A has quit [Quit: Connection closed]
kingoffrance has quit [Ping timeout: 250 seconds]
wootehfoot has quit [Quit: Leaving]
sonny has joined #osdev
heat has quit [Remote host closed the connection]
kingoffrance has joined #osdev
sonny has quit [Ping timeout: 252 seconds]
sonny has joined #osdev
dennis95 has joined #osdev
knusbaum has joined #osdev
knusbaum has quit [Quit: ZNC 1.8.2 - https://znc.in]
knusbaum has joined #osdev
wxwisiasdf has quit [Ping timeout: 246 seconds]
knusbaum has quit [Ping timeout: 240 seconds]
Matt|home has quit [Ping timeout: 256 seconds]
jhagborg has joined #osdev
knusbaum has joined #osdev
knusbaum has quit [Ping timeout: 272 seconds]
wxwisiasdf has joined #osdev
<wxwisiasdf> hmmm
knusbaum has joined #osdev
<sonny> wxwisiasdf: what are you working on?
knusbaum- has joined #osdev
knusbaum has quit [Ping timeout: 272 seconds]
<wxwisiasdf> sonny: separated the shell from the kernel
<sonny> oh nice
<wxwisiasdf> and added spooler manager for spooling devices
mahmutov has joined #osdev
<wxwisiasdf> and added more japanesse translations that i totally understand (sarcasm)
<GeDaMo> It's a long time since I even heard of spooling :|
<wxwisiasdf> my os is for mainframe
<wxwisiasdf> i've heard stuff like "storage load" -> "load thing from ram"
<kingoffrance> yes, primary storage
<wxwisiasdf> GeDaMo: abend, punch card line number, JES2, virtual storage, storage, 3270 data stream
<wxwisiasdf> etc etc ibm terminology
<GeDaMo> I just associate spooling with printers :P
knusbaum- is now known as knusbaum
<wxwisiasdf> spooling is basically
<wxwisiasdf> some random userland thing minding it's own business:"hey yo, queue me this device request",kernel: "k sure bud"
<kingoffrance> sounds a bit like batch :)
<wxwisiasdf> it is a fancy name for queue+polling
<wxwisiasdf> well not so much polling, you can just yield until interruption comes
* geist yawns
Matt|home has joined #osdev
Likorn has quit [Quit: WeeChat 3.4.1]
<jimbzy> yo
* kingoffrance gives jimbzy coffee
<jimbzy> Cheers
Likorn has joined #osdev
knusbaum has quit [Ping timeout: 240 seconds]
diamondbond has joined #osdev
* mjg burps
knusbaum has joined #osdev
<geist> was productive last night. felt nice. hacked a lot of FAT code
<geist> one of those rare times when you just get in the zone for a few hours and bash out code
<jimbzy> Yeah those are good times
<mjg> the zone == $$
<geist> yah it's pretty much impossible to get into the Zone for work stuff anymore. too many distractions, or dependencies on tools/processes that break the flow
<mjg> let's be real though, everyone would be more productive if they had better life/owrk balance, rest, nutrition etc.
<geist> oh totally i've actually been much more focused and productive the last month or so
<mjg> hacking until 2 am sounds fucking great on paper
<mjg> but it is detrimental to actual productivity
<mjg> 's my point
<geist> depends on *whos* productivity you're dipping into
<jimbzy> Yeah. I almost have to go on a strange schedule due to family stuff.
<geist> ie, is this for work or for your own personal stuff
diamondbond has quit [Quit: Leaving]
<j`ey> geist: LK FAT?
<Griwes> mhmmmm, fat code
<geist> yah. i thik i pretty uch fully grok FAT now
<jimbzy> I did pick up a new soldering iron today, tho, so that feels productive ;)
<geist> not that it's that difficult but there were lttle details that i hadn't fully grokked
<jimbzy> Well, ordered one rather.
<geist> cool, a big one of one of the smaller battery powered ones? the latter looks pretty great, i've seen a few
<bslsk05> ​www.weller-tools.com: WE 1010NA Soldering Station | Weller
<geist> i have a pretty good workhorse Hakko that i've using for years but thinking of getting one of the battery ones for smaller work
<mjg> geist: fair point
<geist> oh nice. weller makes good stuff
<mjg> i got a semi-vanity project at work right now
<mjg> kind of forgot the distinction :D
<geist> The Flow for work i think i've decided it never was a good idea to waste it on work. best thing you can do there is slow and steady progress, all the time, inefficiently
<geist> and then find a way to be happy with that
<jimbzy> I almost got a Hakko FX888 but decided tog o with the Weller instead. I've had good luck with them.
<geist> jimbzy: yah i have the 888
<geist> been happy with, but i think the weller is jut as good
<jimbzy> As long as it works I'm not too picky.
<geist> yah and has a nice array of replacable tips. but i tink it's all semi standard at that size anyway
<geist> my only complaint with the hakko (depends on the exact model) is one of them doesn't have a light that says its on
<geist> it only has a led if it's heating
<geist> so it's really easy ti accidentally leave it on. a common mod is to add another led for that
<jimbzy> That's kinda strange.
<zid> A common mod is to your own hands with severe burns.
<geist> burned my finger with it exactly once
<zid> I managed somehow to to stand on a hot soldering iron once
<geist> jimbzy: not so sure about the weller, aside from the toggle switch it may be hard to tell it's on
<zid> I was doing a quick fix on a torn off wire on the floor rather than cleaning up my desk
<jimbzy> I'm used to working with extreme heat, so I should be ok.
<geist> anywaty, just a thing i've found is useful. if it's not in obvious sight, i've found it easy to accidentally it on
<jimbzy> that sounds terrible zid.
<mjg> geist: checked out bare minimum worker? i can dig that
<mjg> geist: had my own period :-P
<geist> re: burning myself with the soldering iron, i do use a binocular scope for most soldering work. really convenient and makes for good results, but.... easy to accidentally your finger
<geist> since you only see a narrow field of view you dont see the tip until you get it right in the right spot
<geist> so you have to be doubleplus careful and be very aware of where the iron is at all time
<jimbzy> Yeah I can see that being an issue for sure.
<geist> but otherwise i highly recommend, especially for peoeple with shitty vision
<geist> or doing SMT work
<geist> and not really that expensive. I think i have a fairly low end AmScope binocular thing, got it like 15 years ago so might e ore pricey now
<jimbzy> I have shitty vision! I can look through a keyhole with both eyes.
<geist> 30x zoom i think?
<zid> 30x zoom is pretty good for a soldering iron
<geist> yah just about the sweet spot
<zid> could kill some pretty far away animals with that
<zid> I do wonder if the 30x zoom is more range than you'd be able to sufficiently use without some kind of high pressure tip delivery system, maybe even black powder
<geist> yah looks like it's a bit priceier now, but it's basically something like https://smile.amazon.com/AmScope-SM-3BZ-80S-Microscope-Magnification-Ring-Style/dp/B006QN5T5G
<bslsk05> ​redirect -> www.amazon.com: Amazon.com: AmScope SM-3BZ-80S Binocular Stereo Microscope, WF10x Eyepieces, 3.5X-90X Magnification, 0.7X-4.5X Objective Power, 0.5X and 2.0X Barlow Lenses, 80-Bulb Ring-Style LED Light Source, Single-Arm Boom Stand, 110V : Electronics
<geist> i think i got it back when it was like $250
<jimbzy> That's pretty wild. I was just gonna get one of those big magnifiers with the LED lights.
<geist> ah https://smile.amazon.com/dp/B004TOZ6AW is more reasonable
<bslsk05> ​redirect -> www.amazon.com: Amazon.com: AmScope SW-3T24Z Trinocular Stereo Microscope, WH10x Eyepieces, 20X/40X/80X Magnification, 2X/4X Objective, Single-Arm Boom Stand, Includes 2.0x Barlow Lens : Electronics
<geist> though you'll want to get a led ring for it
<geist> problem being that amscope makes like 150 versions of everything so it's really hard to tell
<geist> and yeah the round magifiers with a led ring are pretty good too
<geist> i dont really like em, but my dad has one and it works well
<jimbzy> That's what I learned with
<geist> i think i learned with the scope because we always had a lab at a lot of the startups i worked at early on and they always had a similar setup
<jimbzy> I used to use one for sharpening tools, too, so I could see the edge clearly.
<geist> oh yah for non electronics or fine work the big things are definitely more useful
<geist> and the scopes use up so much desk space
<jimbzy> Gonna be building one of these soon for 20m https://qrp-labs.com/qcxp.html
<bslsk05> ​qrp-labs.com: QCX+ 5W CW transceiver kit
<jimbzy> The SMD components come mounted, so I should be good with the rest.
<geist> oh looks nice
<jimbzy> Yeah, it looks like a fun project.
vdamewood has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]
elastic_dog has quit [Ping timeout: 240 seconds]
elastic_dog has joined #osdev
GeDaMo has quit [Remote host closed the connection]
vdamewood has joined #osdev
<gorgonical> I have been silently wondering for a long time geist if you are a "i'm getting on irc now" yawner or "i'm getting out of bed now" yawner
<geist> mostly the 'i'm outta bed and prepared to interface with the world' yawner
<geist> i had to get up super early today because one of my fire alarms started chirping at 6:30 this morning
<geist> so had to get up on a ladder and fix that while half asleep then went back to bed and didn't reset the alarm
<geist> so got up extra late today
<gorgonical> I see
<gorgonical> Reasonable
<kazinsal> what's up fire alarm chirpin all night buddy
<kazinsal> except in my case it's the one in the unoccupied apartment below mine so I had to ping building management
<geist> yah this one is 'smart' too so it's even worse
<geist> it chirps and then says "battery low" every minute
<kazinsal> oh god
<geist> in a pleasant female voice, but it still is some voice in the other room
<kazinsal> this one just pings every 30 seconds, but because it's down a floor it echoes
<geist> and since it's hard wired you can't just remove the battery and deal with it later
<geist> since the battery is simply the backup
<geist> i am vaguely curiuos what the protocol it uses to talk to the other fire alarms. i saw at least 3 wires going into it. so 2 are power/gnd i'm sure, so probably at least one loop or shared bus
<jimbzy> Probably something like HART if I had to guess.
<geist> hmm, interesting. didn't know about it
sonny has quit [Quit: Client closed]
<geist> this looks a bit master/slave though, so may be something even simpler where it's just someone broadcasting their status on the bus
<geist> but i do have an action item now to find where the main plug is for it. unless the batteries of each of them power all the rest of the, there's probably a power brick somewhere. probably in the attic
epony has quit [Ping timeout: 260 seconds]
<mjg> huh. i wonder if there is any real-world hardware in actual use where linux fails the X86_FEATURE_REP_GOOD test
<mjg> not to be confused with a troll box in a garage kept only for this purpose
<geist> hmm what does it test?
<geist> like there's some x86 hardware that has a broken rep prefix?
biblio has joined #osdev
<geist> hard to tell exactly when the feature showed up since it has moved around a bit in the source tree
<geist> yah there is various places in various cpu detection logic in the kernel that sets that bit
<geist> usually beteween stepping X and Y of some particular thing
<geist> doesn't seem to mean that rep prefix is broken as much as it's not as good as a non rep version for memcpy
<geist> and that's distinctly different from ERMS
<geist> one quick example for example, is some range of steppings on K8 hardware
<mjg> ye, ye, but is that really used today?
<geist> what is really used today?
<mjg> you know, in that spirit, i can tell you that some variants of amd athlon cpus have a bug where they require an explicit fence after atomic ops
<geist> that the feature flag? looks like it basically defaults to on, and is unset in particular situations
<mjg> solaris hotpatches this on boot
<mjg> no other system that i know of gives a shit today
<geist> sure. it's like that. whether or not it's used today is whether or not folks have hardware that the feature helps with, and linux seems to go pretty far back all in all
<mjg> my point being that the above is probably something which should have gotten removed years ago
<geist> if they're just now dropping 486 in the last few years, etc then they have a long way to go until they get to mid 2000s era AMD hardware
<geist> i dunno, K8 is only like 15 years old
<geist> for example
<geist> but anyway, it's clear linux has the policy of maintaining this stuff until it eventually becomes too difficult to maintain, and tis switch seems to be pretty low maintenance
<mjg> well see above, in that spirit, they literally have a critical bug for some cpus
<geist> whereas i can see how keeping 486 around would be a burden. if nothign else maintaining the whole soft x87 emulation stuff could go possibly
<mjg> all while the rep stuff is some perf in the worst case
<geist> no no, that's my point. the few places i just looked in the kernel didn't seem like it was critical at all
<CompanionCube> iirc they only dropped 386?
<geist> it was simply 'its faster if you avoid using rep'
dennis95 has quit [Quit: Leaving]
<geist> CompanionCube: yeah i dunno.
<mjg> geist: i mean the amd athlon cpus which require a fence after atomic for locking primitives, which the linux kernel does not bother with
<geist> mjg: okay. i dunno what we're arguing about to be honest
<kazinsal> quick lxr look at how it works, it does seem that if REP_GOOD is set it just NOPs out a `jmp memcpy_orig` at the entry point of `memcpy`, or if ERMS is set it changes it to `jmp memcpy_erms`
<kazinsal> so that's clever
wxwisiasdf has quit [Ping timeout: 240 seconds]
<geist> i mean because it is the way it is doesn't mean there's some hard core logic governing it
<mjg> my point is that X86_FEATURE_REP_GOOD should probably get whacked as it adds complexity elsewhere
<mjg> i happen to be writing a pach right now which has to fuck with it
<kazinsal> honestly it doesn't seem like it has much overhead
<geist> yah it just fiddles with selecting a third variant of memcpy, one that seems to avoid all forms of rep
<mjg> there is more
<CompanionCube> as of last august linux still had 486
<CompanionCube> so seems they didn't drop that
<geist> CompanionCube: ah okay.
<mjg> i would argue freebsd has this solved better -- you ifunc to the expected variant
<CompanionCube> also 386 was dropped all the way in 2012
<geist> yah i was thinking the big drop would be to ditch the old x87 emulation code and require linux run on x86s that have a fpu
<mjg> and it does not use any indirect calls either, all callsites get relocated
<mjg> so there is literally 0 overhead from existence of numerous variants
<mjg> not even a nop sled
* geist nods
wxwisiasdf has joined #osdev
* CompanionCube has no ideas which distros still do 486, but i expect it's a very small list.
<mjg> btw, interestingly
<geist> yah dunno where i remember reading it. probably made it up
<mjg> SYM_FUNC_START(copy_page) ALTERNATIVE "jmp copy_page_regs", "", X86_FEATURE_REP_GOOD movl $4096/8, %ecx rep movsq RET
<mjg> SYM_FUNC_END(copy_page)
<geist> there were also a bunch of 486 equivalents at the time that the kernel still has support for
<geist> nexgen, etc
<mjg> that is, they don't use ERMS for page copying
<mjg> i asked intel once what's up with that, they told me about some uarchs being fucked and requiring a sfence afterewards
<geist> question there is does a rep movsb work better even with erms for a page aligned copy
<geist> or is it really that rep movsb is better in on the average most cases with ERMS
<geist> and ah, what mjg just said
<geist> also reminds me, i was thinking i should see about putting together a more optimal page copy for ARM
<mjg> afair they used to claim (apart from the nugget above) that using erms is always better
<geist> with full alignment like that can factor out at least a few levels of memcpy
gog has quit [Ping timeout: 272 seconds]
<mjg> makes me wonder if that's true for newer uarchs
<mjg> as long as the size is a multiple of 8
<geist> but it may be that it's always better or at least parity with rep movsq. but in the case of a fully aligned copy address that's a multiple of 8 rep movsq is just as good
<geist> and thus doesn't really require a switch
<mjg> intel optimization manual used to claim that rep stosb + size of 4096 + eax == 0 is special-cased
* geist nods
<geist> reminds me, i should finally add the clzero feature for AMD. just cause
<mjg> it is plausible rep movsb with 4096 is also special-cased, but they fucked it up above
<mjg> geist: ha, clzero is something i benchmarked in one real case: the venerable kernel build test
<geist> OTOH while it is generally a good assuption that whatever is in linux is the optimal case, many times you will be dissapoint, as i'm sure you're aware
<mjg> key point being clzero uses non-temporal stores
<geist> mjg: yah i believe it was you telling me in a multi socket/numa AMD machine it's not necessarily faster
<mjg> it is slower
<geist> because of the cross-node zeros
<mjg> not only cross-node
<mjg> you just end up generating more traffic to the memory controller
<geist> but it doesn't trash the cache, so it *may* be faster in some case
<geist> but yeah, i think you were the one that talked me out of it
<mjg> oh sure, i'm confident there are some cases where it ends up being faster
<geist> it also raises an interesting point: what about the ARM equivalent of it. clzero is clearly the equivalent of `dc zva` on ARM and the backend support must have fallen out of the K12 development
<geist> and ARM basically suggests using `dc zva` extremely aggressively. they even use it opportunistically in their memset implementation
<mjg> to be clear, in the kernel build test you had less system time, but got way more cache misses and econsequently more user time
<geist> but i do question that sometimes, because of precisely wehat you just said
<mjg> and did not get faster in total
<geist> since dc zva is also specced as being cache bypassing
* geist nods
<mjg> the way i see it, the kernel tries to reuse pages as much as possible
<mjg> and when you zero them out with nt-stores, you actively combat that behavior
<geist> right if the page is just about to be used then it makes sense to go ahead and cacache allocate in the L1
<mjg> as for linux doing stuff reasonably fast, i used to think that';s true
<mjg> but i know for a fact it is not :)
<mjg> and i mean some cases are weirdly bad
<geist> i guess it may depend on the situation. allocating a new zero page for some user space page fault. does it make sense to bring the page into the cache? you dont know what user space is about to do with it
<geist> in the case of say a page table, are you about to overwrite most of it?
<geist> if so, bringing it into the L1 makes sense
<mjg> if there was a way to tell how much of the page happens to be cached already, i would agree
<geist> a lot of the other situations like allocating a page for COW, you dont zero it because you're about to overwrite it
<geist> yah and the cost of flushing the page is almost certainly much more than just delaing with it
<mjg> linux folks did some benchmarks years ago, i don't remember what was tested apart from building stuff, but the conclusion was that nt stores on page zeroing suck
<mjg> i did my own benches, got the same result fwiw :-P
<geist> yah. it also may be that it's a bad implementation of a good idea, especially since ARM is extremely bullish on it
<geist> but i guess AMD ruined it forever
<mjg> afair all the nt store zeroing is a mid-2000s ideas
<mjg> idea
<geist> especially if AMD say sorts the NT stores as super high priority that starves out other things
<mjg> i find it mildly plausible it was ok at the time as caches were smaller
<geist> yah, i think that has a fair aount to do with it. relative cache sizes nowadays change where these tuning points come in for various things
<mjg> so the current effect of evicting shit was less likely to be happening
<mjg> but i'm also confident people made the change mostly based on their DEEP BELIEF as opposed to serous measurements
<mjg> which used to be a plague
<geist> i *do* suppose the whole background zeroing thing could be *more* plausible with clzero, because then you're at least not trashing the regular cache for it
<geist> except of course the NT stores probably get in the way of other things and generate unnecessary traffic
<geist> if you could hint that it was low priority maybe
<mjg> i don't know how background zeroing came to be. i suspect it was the result of "idle loop" only burning cpu
<mjg> so they came up with shit to do isntead
<geist> probably same reason: L1 caches were smaller, so you didn't want to tolerate a memset on every page allocation
<mjg> i did some tests, anything simple like building the kernel shreds whatever reserve of pre-zeroed pages you might have within seconds
<geist> whereas now the 4K isn't that big of a deal
<mjg> and in fact you would not want to keep going to the pre-zerored list either
<geist> yah also that too, i suspect modern systems are expected to be able to chew through more percentage of physical pages per unit time
<geist> and workload on older systems was relatively different, perhaps
<geist> that's sort of a thing lost in time, but i do wonder about that
<mjg> the metric fuckton of forks + execs results in gigabytes of memory chewed through
<geist> OTOH, that may also be an indicator that that's not necessarily the Best Test for this stuff
<geist> its a valid benchmark, but it benchmarks a particular corner of the envelope
<mjg> no argument here
<geist> but it's the easiest one to run, so it tends to be the one that folks fall back to
<mjg> it highlights the crucial point though
<mjg> let's say you need a new page. say you grabbed a pre-zeroed one, so you got it faster
<geist> also my guess is the background zeroing is just less useful nowadays with modern (ie, last 20 years) style file caching where in the steady state pretty much all pages should be full of something eventually
<geist> so there's really not a lot to background zero
<mjg> but now you freed a page and need a new one zeroed out
<mjg> is it faster to grab the next pre-zeroed or zero out the one you just freed?
<mjg> factor smp and i think we have a winner
<geist> i do think about it when running a bunch of VMs on my overcommitted server. linux host can do KSM page merging and zero page detection, but it only helps if the VMs are actually holding onto zeroed pages
<geist> which they dont, because you can't tell linux/other oses to do that
<geist> so i can tell a guest to dump the file cache, but i cant force linxu to then fill those pages with zeros
<geist> you can of course write code in the guest to attempt to fill most of guest ram with zeros, but it's a hack at best (though it does appear to work)
<mjg> i have to note that zeroing on demand is slow enough that it does show up as a significant factor on both linux and freebsd
<mjg> [or perhaps other parts of the respective kernels suck least enough to make it stand out]
<geist> it's possible that at the end it's just more Fair in the sense that the user that initiated the page allocation pays for it
<geist> which makes sense
<mjg> makes you wonder how much faster things would be if pages reused in the same security domain were allowed to be allocated without zeroing
fkrauthan has quit [Quit: ZNC - https://znc.in]
<mjg> i mean we can't do that now without everything exploding in userspace
<geist> would be cool if you could, for example, store a bitmap of all pages in the memory controller that simply says 'this is zeroed'
<geist> then you could just flip bits as you zero the pages
<geist> basically like TRIM but for DRAM
<mjg> ye, weird there is no dedicated support. i guess OS fuckery is niche enough even for processor vendors
<mjg> :-P
fkrauthan has joined #osdev
<mjg> "here is the bare minimum to get teh thing running and go fuck yourself"
<geist> i know that AXI at least has the ability to push through transactions where the data is explicitly zeroed, and i'd assume at some point intel/amd has the same thing in their private busses to the mem controller
<geist> but i guess if they did do that it'd be just some private thing, and then suddenly things like writing zeros appears to be even faster than before, but only if you do it to an entire page... so that'd be tricky, since they'd have to observe an entire page clear
<geist> it's one thing to do it at a cache line granularity, but i guess page level would be difficult
<geist> or at least more difficult
sonny has joined #osdev
<mjg> oh also note amd did not patch linux to zero_page with clzero
<mjg> i tried to asking them about it but got some weird non-response and dropped the subject
tds has quit [Read error: Connection reset by peer]
terminalpusher has quit [Remote host closed the connection]
tds has joined #osdev
vdamewood has quit [Remote host closed the connection]
vdamewood has joined #osdev
tds has quit [Read error: Connection reset by peer]
hodbogi has joined #osdev
tds has joined #osdev
<hodbogi> It is time for fun
<hodbogi> fun means getting vscode + gdb to work or something with qemu
wxwisiasdf has quit [Ping timeout: 256 seconds]
Burgundy has quit [Ping timeout: 272 seconds]
dude12312414 has joined #osdev
vdamewood has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]
toulene has quit [Quit: The Lounge - https://thelounge.chat]
dude12312414 has quit [Client Quit]
<nomagno> I just implemented a quasi stack macro to do nested looping in my VM. 33 instructions when setting up loop, 5 instructions when exiting, 1 instruction to do the actual looping since it's just jcz
<nomagno> Hm... I'd say it's worth it over my previous 6-instruction-to-setup pseudoinstruction, you loose more time looping than pre-looping usually don't you :P
toluene has joined #osdev
<hodbogi> I foind this after I had it half working https://wiki.osdev.org/User:TheCool1Kevin/VSCode_Debug
<bslsk05> ​wiki.osdev.org: User:TheCool1Kevin/VSCode Debug - OSDev Wiki
gog has joined #osdev
jhagborg has quit [Ping timeout: 240 seconds]
hodbogi has quit [Quit: Lost terminal]
Likorn has quit [Quit: WeeChat 3.4.1]
<geist> mjg: boo on team AMD
<geist> but like i said i bet it was a freebie because they already implemented it for K12, since it looks *precisely* like the `dc zva` instruction
<geist> the same way their new global TLB shootdown stuff looks exactly like ARM's
<geist> somewhere i found a more or less official quote from the main AMD designer guy that said yeah Zen was absolultely a late in the game retarget of the existing K12 work
jhagborg has joined #osdev