#osdev on 2022-07-25 — irc logs at libera.irclog.whitequark.org

2021-05-23 01:57 klange changed the topic of #osdev to: Operating System Development || Don't ask to ask---just ask! || For 3+ LoC, use a pastebin (for example https://gist.github.com/) || Stats + Old logs: http://osdev-logs.qzx.com New Logs: https://libera.irclog.whitequark.org/osdev || Visit https://wiki.osdev.org and https://forum.osdev.org || Books: https://wiki.osdev.org/Books

00:00 <griddle> should those be something the kernel cares much about?

00:00 <wikan> yes if i use it for /etc :D

00:00 <wikan> if i will

00:00 <griddle> on linux /etc/ is parsed by userspace applications and pushed up to the kernel by various means

00:00 <griddle> not sure if you plan on a userspace of some kind

00:01 <wikan> whell I learnt Linux alot. Used BSD, Windows, BeOS.

00:02 <wikan> I just want my os to work "my way"

00:02 <wikan> don't want to write unix-like or dos-like

00:02 <wikan> so it is hard to explain what I have in my head

00:03 <griddle> well good luck to you friend

00:04 <wikan> it is just small os idea :)

00:04 <griddle> they all start that way

00:04 <griddle> mine started as a "kernel" I used when playing around with the KVM api directly

00:05 <griddle> 3 years ago ;)

00:05 <wikan> i am much more stupid yet

00:05 <wikan> i fell like a 14yo kid

00:05 <griddle> I doubt it

00:06 <griddle> starting an os is alot of work and is a huge design space

00:06 <wikan> fill/fell/dunno

00:06 <griddle> makes sense to think about it for a long time, as it's hard to change it significantly 2 years down the line

00:06 <wikan> i think about it for 10 years

00:06 <wikan> but working on idea for maybe few months

00:07 <griddle> most of my time in my kernel was spent getting stuff to work at the very start. Once stuff works (memory management, disk access, etc) iterating is really swift

00:07 <griddle> even then, it took 2 months to get RISC-V working on top of the existing system

00:07 <klange> 64-bit port over a lunch break, aarch64 in a weekend...

00:07 <griddle> design is the hardest part of programming :)

00:09 <wikan> good i don't need windows :D

00:09 <wikan> no X system required

00:10 <wikan> that would be next thing to write lol

00:10 <griddle> klange: reading the list of arm MSRs takes more than a weekend

00:10 <griddle> too many letters

00:10 <griddle> TPIDRRO_EL2

00:10 <klange> You don't need too many to get your foot in the door on aarch64.

00:10 <wikan> is osdev a source of your knowledle/docs/specs etc?

00:11 <klange> And I had a good reference as a friend of mine is rather well known in the space of Apple M1 bringup ;)

00:11 ripmalware_ has joined #osdev

00:11 <griddle> :)

00:12 <griddle> yeah I imagine knowing an arm expert helps

00:13 <griddle> that arm expert in particular :0

00:13 ripmalware__ has quit [Ping timeout: 240 seconds]

00:14 <PapaFrog> Have you ever forgotten writing code so badly you run it through a plagiarism checker just to be sure?

00:14 <griddle> i forget writing my entire x86_64 boot process

00:15 <clever> i once remembered writing an entire feature, but couldnt find a trace of the code on any of my systems

00:15 <clever> my only explanation is that i wrote it in my dreams, lol

00:16 <wikan> hmmm. maybe i would use lucid dream to desing something

00:16 <griddle> work/life balance is important folks :)

00:17 <wikan> am I the only one without degree?

00:18 <griddle> klange: if I already have a portable kernel (I support x86_64 and rv64) how much of a pita is getting aarch64 working?

00:18 * kof123 <--

00:19 <griddle> I just want to dev on my M1 w/o the TCG slowdown

00:20 * wikan send thanks to everybody and waves to say goodbye

00:20 wikan has quit [Quit: WeeChat 3.0]

00:21 knusbaum has quit [Quit: ZNC 1.8.2 - https://znc.in]

00:23 knusbaum has joined #osdev

00:32 <geist> oh hey, just noticed qemu-x86_64 with -cpu max supports 5 level page tables

00:32 <geist> or at least it reported 57 bits for vaddr

00:32 <griddle> !

00:33 <griddle> has intel/amd mentioned upcoming hardware with it?

00:33 <geist> oh absolutely. it's already shipped on intel hardware

00:33 <moon-child> intel has already released hardware with it, ahven't they?

00:33 <geist> a year or two ago

00:33 <moon-child> yea

00:33 <griddle> oh is it on 12th gen or is it restricted to xeons?

00:34 <geist> i forget, to be honest. probably maybe and yes

00:34 <griddle> https://lwn.net/Articles/717293/ related?

00:34 <bslsk05> lwn.net: Five-level page tables [LWN.net]

00:34 <geist> easy enough to test, the feature bit is....

00:34 <geist> probably

00:35 <moon-child> '5-level paging is implemented by the Ice Lake microarchitecture'

00:35 <geist> looks like cpuid leaf 7, ECX[16]

00:35 <moon-child> so maybe also laptops?

00:35 <griddle> what is the CRX bit called?

00:35 <griddle> PAEE? :)

00:35 <geist> CR4.VA57 it seems

00:35 <mrvn> What ever would you have that in laptops for?

00:35 <geist> mrvn: oh silly. you know that everyone always asks that question

00:35 <griddle> very few practical reasons

00:36 <griddle> outside of wanting to address remote memory, that many bits isn't useful practically

00:36 <mrvn> I mean I definetly would want it to play with. But practically it would just waste power.

00:36 <griddle> https://www.usenix.org/conference/osdi20/presentation/ruan something like this might use it

00:36 <bslsk05> www.usenix.org: AIFM: High-Performance, Application-Integrated Far Memory | USENIX

00:36 <geist> insert <why would anyone ever want $NEW_THING> answer

00:37 <mrvn> griddle: again: laptop? You plan to carry around a unch of laptops all interconnected for far memory?

00:37 <griddle> the answer in this case is virtualizing "infinite" PGAS

00:37 <geist> anyway finally got around to adding proper feature bit detection, and acutally reading all these things out of various machines

00:37 <mrvn> s/unch/bunch/

00:37 <klange> geist: My understanding was it was actually tested _in QEMU_ before showing up in real hardware?

00:37 <geist> was surprised to notice that

00:37 <geist> yah, probably so

00:38 <geist> finally decoding all the xsave size bits and whatnot. it's not as bad as i thought

00:38 <griddle> the practical reason it's in laptops is probably because intel is lazy and doesn't want 2 mmu impls

00:38 <mrvn> klange: officially by intel or just some user?

00:38 <griddle> linux will probably not enablwe

00:38 <griddle> for power reasons or whatever

00:38 <griddle> since it doesn't seem to give a new huge page size

00:38 <moon-child> I don't know if it's actually in laptops, just guessed it might be

00:38 <geist> the practical reason those things get killed in consumer hardware right now is the big.LITTLe nonsense thye have now

00:38 <moon-child> since there were some icelake mobile processors

00:38 <geist> ie, AVX512 getting killed in performance cores because the goldmont efficiency cores dont have it

00:38 * geist rolls eyes

00:38 <griddle> yeah

00:38 <geist> come on intel, do a better job

00:39 <griddle> they managed to put avx512 in KNL cores

00:39 <moon-child> yeppp :/

00:39 <klange> mrvn: Supposedly, officially by Intel.

00:39 <griddle> idk what they are smoking over there

00:39 liz has quit [Ping timeout: 244 seconds]

00:39 <mrvn> geist: we need more work for supporting big.LITTLE with different feature sets.

00:39 <griddle> not like anyone wrote code to use avx512 anywho

00:39 <geist> they're smoking 'some VP says cram these two pieces of tech together now'

00:39 <moon-child> griddle: why do you think no one wrote code for avx512?

00:39 <griddle> cause nobody paid for icc :^)

00:39 <moon-child> I mean, m1 does fine for itself. 14 pipes on firestorm, just 6 or 7 on icestorm, but same semantics

00:39 <griddle> but also, nobody likes 16 character op codes

00:40 <griddle> p sure the pipeline is smaller on the intel cores as well

00:40 <griddle> don't think that matters from a programmer perspective though

00:40 <mrvn> I've been thinking along the line of compilers supporting function overloads with march=avx512. You could put those overloads into different physical pages at identical offsets and then map the right pages into the address space when swapping cores.

00:40 <geist> i was a little dissapint to see that M2 didn't apparently pick up SVE yet

00:40 <geist> yeah M1s are doing okay, but the NEON 128 bit fpu is looking at bit weak right now

00:40 <griddle> M2 is based on A15 I think

00:40 <geist> though apple is good about just diverting attention elsewhere

00:40 <moon-child> m2 seems like a pretty marginal upgrade over m1. Might be in m3 or 'm2 max'

00:41 <griddle> they are about 2y out dated

00:41 <griddle> p sure the only arch change is nested virt?

00:41 <geist> yah, it's whenever they officially pick up ARMv9, where it's mandatory to support SVE

00:41 <moon-child> geist: m1 is similar performance to avx2. avx2 is double the vector width, but m1 has twice the ports, so it balanced

00:41 <mrvn> So before migrating the kernel would wait for the code to leave the march= pages and then remap them for big or LITTLE.

00:41 <moon-child> *balances

00:41 <mrvn> insane?

00:41 <geist> i have been meaning to look at the new cpuid leaf that describes big.little

00:41 <geist> probably documented in intels stuff now

00:42 <griddle> yeah, also, big.LITTLE's usage is incredibly boring

00:42 <geist> may be https://www.sandpile.org/x86/cpuid.htm#level_0000_001Bh ?

00:42 <bslsk05> www.sandpile.org: sandpile.org -- x86 architecture -- CPUID

00:42 <griddle> i don't have an answer, but "putting daemons on the little" cores is not intellectually interesting to me

00:42 <geist> ah no PCONFIG is some trusted execution nonsense

00:43 <klange> https://patchwork.kernel.org/project/qemu-devel/patch/20161215001305.146807-1-kirill.shutemov@linux.intel.com/

00:43 <bslsk05> patchwork.kernel.org: x86: implement la57 paging mode - Patchwork

00:43 <klange> ^ Intel contributed the 5-level page table support to QEMU, so that it could be used to test later patches to Linux.

00:44 <geist> makes sense. they almost certainly have some internal emulator/simulator. i remember fiddling with it once for work

00:44 <geist> but it's expensive and whtanot, so they need open source versions before rela people play with it

00:44 <griddle> i feel bad for intel engineers using pintool

00:44 <griddle> off topic

00:44 <griddle> :^)

00:45 <griddle> Why is it that IPIs are just so slow on all systems

00:45 <griddle> coherence messages are fast, why are IPIs not

00:45 <mrvn> does qemu have any support for instrumentation so it can be run against real hgardware for comparison? I remember bochs had this.

00:45 <moon-child> coherence messages are fast??

00:45 <griddle> comparatively

00:45 <moon-child> contended ops are hundreds of clocks

00:46 <griddle> IPIs are thousands somehow

00:46 <griddle> they may be going over MSI on x86

00:46 <griddle> but on my sifive unmatched board, they are still dog slow

00:46 <griddle> (outside of the slowness of that board)

00:47 <mrvn> griddle: because they have to get the actual opcode stream into a consisten state?

00:47 <griddle> Is there a fundamental limitation that IPIs run into>

00:47 <griddle> in theory, does an IPI need to abort the speculative execution?

00:47 <griddle> or could it allow it to finish

00:48 <griddle> they're inherently async, so the actual delivery point doesn't quite matter

00:48 <mrvn> the IPI could sleep for a minute and only then trigger, if you want the delay.

00:48 <moon-child> griddle: does it matter? Either way, you're waiting for things to resolve

00:49 <griddle> I mean, it'd be nice if TLB shootdowns were slower

00:49 <griddle> s/slower/faster

00:50 <geist> much nicer if they just dont exist, but alas riscv still has the same problem x86 has (without hte new amd extensions)

00:50 <griddle> I've been told by hardware folks that implementing tlb shootdown in hardware is "dumb", because the "kernel can do that for us"

00:50 <griddle> which amd ext?

00:50 <mrvn> just handle shootdowns in hardware. no need to stop the opcode stream at all.

00:50 <geist> ah yes the age old tension between hardware and software

00:50 <geist> griddle: AMD has new cross TLB shootdown extensions in Zen3+

00:50 <griddle> I mean, tlb shootdowns are a *significant* bottleneck in parallel systems

00:51 <geist> basically more or less a clone of the ARM64 ones with a few extensions

00:51 <clever> i was just thinking, could you have an async tlb shootdown, that you can poll for completion of?

00:51 <griddle> https://dl.acm.org/doi/10.1145/3342195.3387518

00:51 <bslsk05> dl.acm.org: Don't shoot down TLB shootdowns! | Proceedings of the Fifteenth European Conference on Computer Systems

00:51 <geist> maybe not poll, but you can have async yes. that's intrinsically how amd64 works

00:51 <griddle> clever: on topic

00:51 <clever> so you can issue a shootdown, sleep all threads that depend on it, and then schedule something entirely different

00:51 <clever> so the cpu doesnt stall waiting for it

00:52 <geist> arm64 even

00:52 <griddle> I feel like shootdowns are in the awkward latency area of "too long to spin" but also "not long enough to reschedule"

00:52 <mrvn> clever: why stop other threads?

00:52 <moon-child> this is why we should have \inf hyperthreads

00:52 <moon-child> make the cpu do the scheduling!

00:52 <clever> mrvn: any thread within that virtual space, where you did munmap

00:53 <moon-child> (kidding, mostly)

00:53 <mrvn> clever: unless they have a synchronization point the time when the unmap (akak shootdown) happens is indetermined.

00:53 <griddle> didn't intel try to do the hardware context switching

00:53 <griddle> but nobody liked it

00:53 <griddle> (intel did the typical limited table size thing they often do)

00:53 <mrvn> griddle: only because it was slower

00:54 <griddle> im sure intel also limited the entries to 16 some other useless number

00:54 <griddle> like they did with MPX

00:54 <mrvn> griddle: you can have as many task gates as you like

00:55 <griddle> ah

00:55 <griddle> my bad. It's alos only around on IA32 :)

00:55 <griddle> which honestly would have been an awesome architecture

00:55 <griddle> outside of the intel-isms

00:56 <griddle> (mixed up IA32 and IA64 in my head again)

00:59 <geist> hardware tasking was more or less effectively dead on the vine. probably by 386 and definitely by 486 i dont think too many systems used it if they wanted to be efficient

00:59 <geist> aside from the 'you need it for #NMI or #DF' stuff

00:59 <geist> and yeah i think early linux used it, but then it wasn't yet optimized

00:59 <mrvn> geist: I wonder if modern CPUs wouldn't have made it faster than switching manual.

01:00 <griddle> I also feel like it's a problem because the kernel might want some feature that hte hardware doesn't provide

01:00 <geist> i did actually implement some code to use it a while back and tested it on some modern hardware. it was *increidbly* slow

01:00 <geist> but then they explicitly say not to use it so it's clearly unoptimized

01:00 <griddle> like, sure the hardware can do it faster than software, but it's also hardware

01:00 <mrvn> griddle: they solved that fore xsave

01:00 <geist> funny i'm literally writing that code right now to parse all of those feature bits

01:01 <geist> https://www.irccloud.com/pastebin/ANcw8wBd/

01:01 <bslsk05> IRCCloud pastebin | Raw link: https://irccloud.com/pastebin/raw/ANcw8wBd

01:01 <geist> etc

01:01 <griddle> how did xsave solve it?

01:01 <geist> that's when i noticed the 5 level paging, since the cpuid leaf that reports the vaddr size says 57

01:02 <griddle> put professional reverse engineer on your resume

01:02 <mrvn> griddle: you configure what to save in XCR0 and EDX:EAX

01:02 <griddle> Sure it works in that sense

01:03 <geist> yep and that cpuid leaf i'm parsing right now tells you of all the optional things you can save in it how much space they occupy and what offset into the saved bits

01:03 <griddle> but lets say you want to optimize some IPC or something between two processes

01:03 <griddle> letting the kernel do the switching takes alot of control away from the kernel

01:03 <mrvn> geist: the task gate should do a burst write/read of all the state as opposed to the software doing lots of little push/pop. Should give some benefit.

01:03 <griddle> and it's the 1% percentiles that kill performance

01:03 <geist> mrvn: sure, except it doesnt. but thats because they have defacto killed the feature

01:04 <mrvn> geist: sure. I doubt it does any bursts at all.

01:04 <geist> worse it seems to probably dump the pipeline and run through a series of slow ass microcode

01:05 <geist> and who knows if/how it's interruptable or whatnot

01:05 lkurusa has quit [Quit: I probably fell asleep (or went out). Who will ever know.]

01:05 <geist> when i was timing it i was seeing like orders of magnitude worse performance using HW task switching than the equivalent instruction sequence on modern hardware

01:05 <geist> on classic hardware (ie, a 386) it was merely slower

01:05 <mrvn> geist: what wonders me a bit is that you have XSAVE. Which is basically the same but as opcode.

01:05 <geist> yes and no. xsave has a bunch of saved internal state to opimize what it writes out

01:06 <mrvn> why doesn't a task gate fall into the xsave code path?

01:06 <geist> ie, doesn't write out or load bits if it knows the registers are zeros and whatnot

01:06 <geist> go ask intel.

01:06 <griddle> i'm sure it's MMU related

01:06 <geist> but again the task gate is dead, so it's an irrelevant discussion

01:06 <geist> as in it's literally unimplemented for 64bit code, etc etc

01:06 <griddle> the complexity of the MMU went through the roof when they moved past i386

01:07 <griddle> from a hardware perspective, you probably also don't want to have to maintain and test 2 codepaths to do context switching

01:07 <griddle> (interrupts and task switch)

01:07 <mrvn> It could be nice to have a hardware scheduler, something trivial like round-robin. Make a circular linked list of tasks and the CPU runs them as hyperthreads.

01:07 <griddle> though the other question is why this can't just be crammed into SMT

01:08 <griddle> mrvn: isn't this kinda what gpus do?

01:08 <mrvn> whenever a task blocks it could switch to tne next

01:08 <mrvn> griddle: I believe so.

01:09 <griddle> a problem might come up with how intel does ASID in the TLB

01:09 <griddle> pretty sure hyperthreads share a TLB

01:09 <griddle> so if you cram a bunch of hyperthreads into a single HART, you'll need to extend ASID

01:09 <mrvn> it's also a lot of registers if you have more hyperthreads

01:10 <griddle> what does the core do if all threads are blocked on memory ops?

01:10 <mrvn> griddle: I wouldn't assume you could have more hyperthreads than ASID.

01:10 <mrvn> griddle: wait for memory.

01:10 <griddle> last time intel had more than 2 hyperthreads was KNL

01:10 <griddle> they needed to do barrel processing

01:10 <clever> barrel processing?

01:11 <griddle> round-robbin each cycle between threads

01:11 <griddle> each hyperthread is effectively 4x slower

01:11 <clever> ah, thats very much like the shader core on the rpi

01:11 <mrvn> But these would be virtual hyperthreads. So even with 2 real ones you would put 128 threads in the ring.

01:11 <clever> where the pipeline is 4 clocks long, and is ALWAYS running 4 different threads

01:11 <griddle> but, if your application is memory bound (each KNL core had avx512, so that was the desired memory model)

01:11 <griddle> s/memory/usage

01:12 <mrvn> griddle: there are still opcodes that don't use memory.

01:12 <griddle> true

01:12 <clever> but the shader core is more restricted, in that all 4 threads must be running the same opcode, and any stalling is shared across all 4 threads

01:12 <griddle> those are expensive

01:12 <griddle> it's not every cycle, it's more like every N cycles

01:12 <griddle> not sure what N is

01:13 <mrvn> griddle: those would become cheap. There is no waiting for a branch condition or speculative execution. By the time the thread runs the next opcode the last one will have finished no matter what.

01:13 <griddle> I have access to a few KNL boxes thru my lab. Running coremark on it

01:13 <mrvn> If you have 64 virtual threads and 4 hyperthreads then anything that finished in 16 cycles has no wait.

01:14 <griddle> 4406, which is what my kernel inside a RISC-V VM, inside an aarch64 vm gets

01:14 <griddle> mrvn: what's the policy on switching to a new virtual thread?

01:14 <griddle> s/what's/what might be/

01:14 <mrvn> griddle: as you suggested every cycle.

01:15 <griddle> I think part of the reason why we haven't seen something like this is that the kernel can make better sense of the system than the hardware can, right?

01:15 <clever> one trick barrel processing gives with the QPU, is that you never need to stall because 2 opcodes depend on eachother too early

01:15 <griddle> it's pretty rare that you have that much CPU over-provisioning

01:16 <clever> because each opcode takes 4 clock cycles to run, and has fully cleared the pipeline before the next begins

01:16 <mrvn> griddle: don't think so. The problem is that to switch 4 threads every cycle you have to have all the state of 64 threads in the CPU at all time.

01:17 <mrvn> 64 sets of regular registers, 64 sets of AVX512 registers. That gets big.

01:17 <griddle> would probably pull an intel and kill avx512 again :)

01:18 <clever> mrvn: i think the QPU has 192 copies of all of the "scalar" registers, the vector stuff is shared

01:18 <mrvn> clever: yeah, but that's probably less than 64x AVX512

01:18 <griddle> but like, the kernel can determine usage patterns of an app, priority levels (which may be a new representation for your OS), etc

01:18 <clever> mrvn: yeah, its only something like 192 x 32 x 32bit regs

01:18 <griddle> hardware can know about stall times and whatnot, but if you want to encode something new, you need new hardware

01:19 <griddle> forgive me, but what is the QPU again?

01:19 <mrvn> griddle: I always found that that is mostly pointless and when you actually need it more wrong than right

01:19 <clever> griddle: QPU is the shader core on all rpi models

01:19 <griddle> ahh okay

01:20 <griddle> what is the Q?

01:20 <clever> quad processing unit

01:20 <clever> its got a lot of quad'ing going on

01:20 <clever> 4 threads/reg-banks, sharing 1 pipeline, in a barrel processing manner

01:20 <griddle> ah so its a marketing name

01:20 <clever> with the restriction that all 4 threads are running the same opcode

01:20 <griddle> not particularly descriptive :)

01:20 <mrvn> 4 values in a vector?

01:20 <clever> mrvn: 16 i think

01:21 <clever> i'm fuzzy on the exact details

01:21 <mrvn> 4 squared, doublle the power

01:21 <griddle> basically, the "warp" size is 4

01:21 h4zel has quit [Ping timeout: 268 seconds]

01:21 <clever> nearly all of the docs treat it as a scalar core

01:21 <mrvn> Note 16 is exactly what you need for 3D graphics.

01:21 <clever> but behind the scenes, its a 16 lane vector, and runs your program 16 times in parallel

01:22 liz has joined #osdev

01:22 <griddle> 16 b/c 4x4 projection matrix>?

01:22 <mrvn> that would be the one

01:22 <clever> i dont think its doing that

01:22 <clever> each lane is computing a seperate vertex

01:23 <clever> so for every 4 clock cycles, it makes 1 opcode of progress, on 4 seperate verticies, and then things get a bit fuzzy

01:24 <clever> the way i originally learned it, is that 4 pipelines are sharing 1 opcode decoder, taking turns decoding an opcode

01:24 <clever> because each only has to decode an opcode every 4 cycles

01:24 <clever> but that doesnt fit with the 16 lane nature, and running the same opcode in all 16 lanes

01:26 <griddle> gpus are scary to me

01:26 <griddle> you often need so much support around them to do anything

01:27 <griddle> a gpu driver often requires you have some kind of compiler as well

01:27 <griddle> or at least some kind of abstract command queue you can compile

01:27 <clever> griddle: https://github.com/librerpi/lk-overlay/blob/master/platform/bcm28xx/v3d/v3d.c#L143-L153

01:27 <bslsk05> github.com: lk-overlay/v3d.c at master · librerpi/lk-overlay · GitHub

01:28 <clever> griddle: this file is a complete driver for the rpi's v3d core, and can render a single polygon

01:28 <clever> the linked sub-section is a pre-assembled shader, with no compiler being involved

01:28 <clever> extending it to support more polygons is easy enough

01:28 <griddle> but then if you want to add a shader or something

01:29 <griddle> using a GPU for it's 2d accel is easy enough right?

01:29 <clever> yeah

01:29 <griddle> but having some kind of abstract "I want to render a triangle using this shader from userspace" is quite the effort

01:29 <clever> but the rpi also has a dedicated 2d core

01:29 <moon-child> well, 2d _acceleration_ pretty much just goes through the same 3d pipeline

01:30 <moon-child> (on mainstream hardware, at least)

01:30 <clever> not on the rpi

01:30 <clever> https://www.youtube.com/watch?v=suswjbpR1HU

01:30 <bslsk05> 'HVS scaling animation test' by michael bishop (00:00:26)

01:30 <griddle> merging them makes sense if you build silicon

01:30 <moon-child> clever: yea but in general

01:30 <clever> the 3d core has its own power domain and can be turned off to extend battery life

01:31 <griddle> auxiliary question

01:31 <griddle> I wish I had the patience to develop for real hardware

01:31 <griddle> but I end up waiting so long for netboot or PXE or whatever

01:31 <griddle> is that just a given

01:32 <clever> moon-child: if you want 2d accel on the 3d core, you just need to generate 3 tris, a total of 4 verticies, and pass it UV's for the texture

01:32 <clever> the harder part, is using multiple textures in a single frame, ive not figured that out yet

01:32 <clever> a texture atlas solves that

01:32 <moon-child> why 3 tris?

01:32 <griddle> hard to give a user app a region of an atlas :)

01:32 <clever> 2 tris, typo

01:33 <moon-child> ok

01:33 <clever> griddle: yeah

01:33 <griddle> some GPUs have quads right?

01:33 <clever> yeah

01:33 <griddle> do those "desugar" to tris

01:33 <clever> on the rpi, yeah

01:33 <griddle> cool

01:34 <clever> griddle: on the rpi 3d core, it supports points, lines, line_loop, triangles, triangle_strip, and triangle_fan

01:35 <clever> strip and fan are a form of dedup, where you are reusing some vertex data from the previous line/tri

01:35 <moon-child> no indices?

01:35 <moon-child> cf opengl ebo

01:36 <clever> it does have index support

01:36 <clever> https://github.com/librerpi/lk-overlay/blob/master/platform/bcm28xx/v3d/v3d.c#L408-L411

01:36 <bslsk05> github.com: lk-overlay/v3d.c at master · librerpi/lk-overlay · GitHub

01:37 <clever> moon-child: this says to use vertex 0, 1, and 2, from the vertex data on line 335-386

01:37 <clever> triangle_fan would let you reuse indices

01:37 <clever> it also has a non-index mode, where it just uses every vertex in the list

01:39 <clever> with that example code, you just want to increase the length of the vertex data, the primitiveList, and the size passes in ~3 places, and boom, more polygons

01:40 <clever> https://github.com/cleverca22/gl/blob/master/core.c#L231-L244

01:40 <bslsk05> github.com: gl/core.c at master · cleverca22/gl · GitHub

01:40 <clever> moon-child: an example of implementing opengl ontop of this

01:42 <griddle> you're telling me you didn't write a glsl compiler

01:42 <griddle> smh

01:42 <clever> griddle: i never got to that step, all shaders where hand written in asm, and then pre-assembled

01:42 <griddle> :)

01:43 <griddle> impressive regardless

01:45 <clever> griddle: originally, i was trying to implement all of opengl from scratch, in a race, and hit a brick wall when i couldnt figure out vertex shaders

01:45 <clever> i later re-visited that code to get 3d working in baremetal

01:46 <griddle> decided i'm gonna get my aarch64

01:46 <griddle> ... port working

01:47 <geist> ugh. intel really screwed up the big.little thing in cpuid. it looks like it's cpuid leaf 0x1a

01:47 <geist> instead of describing which cores are big and little, they simply defined a 23 bit field in eax that says 'this is an atom' or 'this is a core'

01:47 <geist> with yet another long ass future list of what the bits are supposed to mean. goddamnit intel

01:47 <clever> geist: ive heard that big.little has screwed with anti-cheat software, which thinks your VM is leaking and bans you

01:49 <geist> it doesn't even seem to be laid out interestingly: 10h is reserved, 20h means atom, 30h is reseved, 40h is core

01:49 <geist> not defined as everything else is resered, no documentation as to where they're going to do with that

01:49 <geist> ie, not future proof at all. does this mean in gen 13 cores tey're going to use 0x41? or 0x50? why 20h and 40h?

01:51 <geist> https://www.intel.com/content/www/us/en/developer/articles/guide/12th-gen-intel-core-processor-gamedev-guide.html#:~:text=CPUID%20Hybrid%20Function%20Table

01:51 <bslsk05> www.intel.com: Game Dev Guide for 12th Gen Intel Core Processor Hybrid Architecture

01:51 SpikeHeron has quit [Quit: WeeChat 3.6]

01:52 <geist> gosh i can't think of a more half assed implementation

01:52 <geist> sigh.

01:52 SpikeHeron has joined #osdev

01:53 <griddle> ah intel

01:53 <griddle> all of that implies things can be multiple

01:53 <griddle> atom and core

01:53 <geist> yeah. and that now breaks the implicit assumption that you've been able to make up until now that from *any* cpu in the system you can determine the topology of the rest of the system

01:54 <geist> which i guess was always doomed to fail, but it had a good 30 year run i guess

01:54 <geist> the arm and riscv folks are laughing on the sidelinse, because they've never had this luxury to be taken away

01:54 <griddle> CPUID is kinda a goofy feature anyways

01:54 <geist> you wanna know big.little on arm? you gotta be told, in the form of something outside of the cpu

01:55 <griddle> DTB isn't better

01:55 <griddle> but stiill

01:55 <geist> right

01:55 <griddle> it's better for actual external hardware

01:55 <griddle> it's something that could be burned into rom

01:55 <geist> i was hoping leaf 0x1a would have something like 'there are N clusters, here's how you iterate through them'

01:55 <heat> you need machine mode or the dtb for a cpuid-like thing in riscv

01:56 <geist> and then a mechanism to describe each cluster of cores and what their apic id range is

01:56 <geist> would be perfect

01:56 <griddle> "sounds like something that could go in the acpi tables" - intel

01:56 <geist> there is already a bnuch of precedent for this on other cpuid leaves, alas

01:56 <geist> so that's a question: are there more acpi tables for this?

01:56 <griddle> idk

01:56 <griddle> ask microsoft

01:56 <heat> great question

01:56 <griddle> they got special treatment w/ 12th gen

01:57 <geist> i'm bitching about this because will have to add this exact nonsense to fuchsia soon as soon as i get ahold of a 12th gen intel

01:57 zaquest has quit [Remote host closed the connection]

01:57 <geist> and so far we've been relying on 'cpu 0 + acpi is enough info to discover everything about topology'

01:58 <griddle> fuchsia doesnt support machines big enough for numa yet right?

01:58 <geist> at least they defined another feature bit elsewhere that says 'this is a hybrid cpu'

01:58 <geist> so you at least know if you should be checking leaf 0x1a

01:58 <geist> yeah not really. we parse the SRAT and SLIT table but dont currently do anything with it

01:58 <geist> but we do make a good attempt to discover the full topology, which we then feed into the scehduler

01:59 <heat> geist, no, not ACPI

01:59 <heat> https://docs.kernel.org/x86/intel-hfi.html

01:59 <bslsk05> docs.kernel.org: 14. Hardware-Feedback Interface for scheduling on Intel Hardware — The Linux Kernel documentation

01:59 <geist> yeah that is a thing too

02:00 <geist> but you have to at least start by assuming you know what the topo is for inter-cpu scheduling decisions

02:00 <griddle> if the cores are homogeneous, it's still conservatively correct to assume everything is the same, no?

02:01 <griddle> this would be strictly for performance

02:01 <griddle> ?

02:01 <geist> in general yes. topology you can use to inform decisions about migrating threads between cores

02:01 <geist> by assigning particular costs to inter-cpu transfers

02:01 <heat> https://uefi.org/specs/ACPI/6.4/ PSA: uefi.org now has ACPI docs in html form

02:01 <heat> love it

02:01 <bslsk05> uefi.org: Advanced Configuration and Power Interface (ACPI) Specification — ACPI Specification 6.4 documentation

02:04 h4zel has joined #osdev

02:04 <geist> heat: oh that's nice. also huh didnt notice the MSI frame table in the MADT

02:04 <geist> that describes how the GICv2 MSI works.

02:04 zaquest has joined #osdev

02:05 <geist> as a side note the GIC CPU fields for ARM in the acpi spec *does* have a field that describes big.little

02:05 <geist> though it's not that meaningful. basically just a 8 bit number whos relative value is meaningless, but tells you which cpus are more uber than other ones

02:06 <geist> https://uefi.org/specs/ACPI/6.4/05_ACPI_Software_Programming_Model/ACPI_Software_Programming_Model.html#multiple-apic-description-table-madt:~:text=Processor%20Power%20Efficiency%20Class

02:06 <bslsk05> uefi.org: 5. ACPI Software Programming Model — ACPI Specification 6.4 documentation

02:07 <geist> oh hey, there's as new MADT table entry for 'multiprocessor wakeup structure'. looks like precisely a description of how to use the raspberry pi 4 style wakup

02:09 <griddle> think we will ever get aarch64 cores which are socketed?

02:09 <geist> well, technically they exist, but i assume you mean in a consumer format

02:09 <griddle> I feel like so much of arm is designed in a way that assumes you won't change the CPU on a motherboard

02:10 <griddle> My lab just bought one of the big ampere machines, but still

02:10 <geist> i have a https://www.servethehome.com/cavium-thunderx2-review-benchmarks-real-arm-server-option/ workstation right here with two socketed arm cores. and i'm sure there are server ones that go through a socket

02:10 <bslsk05> www.servethehome.com: Cavium ThunderX2 Review and Benchmarks a Real Arm Server Option

02:10 <geist> yah

02:11 <griddle> they all boot on UEFI right

02:11 <geist> yup

02:11 <griddle> I guess since you wont be able to mix ampere and tx2 chips, they can control all this

02:12 <griddle> but still, self-discovery w/o dtb will probably be hard on arm for a while

02:12 <heat> geist, that multiprocessor wakeup thing looks pretty cool

02:12 <geist> yeah looks like it's useful even on x86, saves you the trouble of resetting and doing the whole boostrap

02:12 <heat> you wake it up (presumably, the other cores are mwaited on that address)

02:12 <geist> yah

02:12 <heat> then it puts you straight in long mode

02:13 <heat> have you seen https://uefi.org/sites/default/files/resources/Platform%20Runtime%20Mechanism%20-%20with%20legal%20notice.pdf ?

02:14 <heat> PRM is the new thing intel and microsoft are pushing to replace some good chunks of SMM

02:14 lkurusa has joined #osdev

02:16 <griddle> I'm gonna head out for the night. Been nice chatting yall

02:17 <heat> <3

02:18 <geist> kk!

02:20 <gog> hi

02:20 <heat> greetings gog

02:34 nick64 has quit [Quit: Connection closed for inactivity]

02:39 gog has quit [Ping timeout: 268 seconds]

02:46 pretty_dumm_guy has quit [Quit: WeeChat 3.5]

02:48 griddle has quit [Ping timeout: 268 seconds]

02:50 Matt|home has quit [Ping timeout: 268 seconds]

02:57 [itchyjunk] has quit [Remote host closed the connection]

03:01 lkurusa has quit [Quit: I probably fell asleep (or went out). Who will ever know.]

03:35 heat has quit [Ping timeout: 264 seconds]

04:07 liz has quit [Quit: Lost terminal]

04:44 opal has quit [Remote host closed the connection]

04:44 opal has joined #osdev

05:04 clever has quit [*.net *.split]

05:04 ozarker has quit [*.net *.split]

05:04 ornxka has quit [*.net *.split]

05:04 mcfrdy has quit [*.net *.split]

05:04 night has quit [*.net *.split]

05:04 nether has quit [*.net *.split]

05:04 LambdaComplex has quit [*.net *.split]

05:04 hl has quit [*.net *.split]

05:04 night has joined #osdev

05:04 ornxka has joined #osdev

05:04 hl has joined #osdev

05:04 mcfrdy has joined #osdev

05:04 ozarker has joined #osdev

05:05 LambdaComplex has joined #osdev

05:07 CYKS has quit [*.net *.split]

05:07 andreas303 has quit [*.net *.split]

05:07 eau has quit [*.net *.split]

05:07 klange has quit [*.net *.split]

05:07 ccx has quit [*.net *.split]

05:07 energizer has quit [*.net *.split]

05:07 k4m1 has quit [*.net *.split]

05:07 dminuoso has quit [*.net *.split]

05:07 codez has quit [*.net *.split]

05:07 amj has quit [*.net *.split]

05:07 klange has joined #osdev

05:08 andreas303 has joined #osdev

05:09 amj has joined #osdev

05:10 ccx has joined #osdev

05:11 dminuoso has joined #osdev

05:12 energizer has joined #osdev

05:12 k4m1 has joined #osdev

05:26 elderK has quit [Quit: Connection closed for inactivity]

05:32 clever has joined #osdev

05:45 eau has joined #osdev

06:21 aejsmith has quit [Remote host closed the connection]

06:22 alpha2023 has quit [Read error: Connection reset by peer]

06:23 aejsmith has joined #osdev

06:23 chartreuse has quit [Ping timeout: 272 seconds]

06:24 alpha2023 has joined #osdev

06:46 the_lanetly_052_ has joined #osdev

07:01 the_lanetly_052_ has quit [Ping timeout: 245 seconds]

07:03 m3a has joined #osdev

07:07 bauen1 has quit [Ping timeout: 244 seconds]

07:13 the_lanetly_052_ has joined #osdev

07:26 the_lanetly_052_ has quit [Ping timeout: 245 seconds]

07:43 CYKS has joined #osdev

07:44 h4zel has quit [Quit: WeeChat 3.0.1]

07:56 <ddevault> when I unmap a page table, do I need to invalidate its entire address space?

07:57 <ddevault> yes, or reload cr3

07:57 <ddevault> reload cr3 sounds better

08:23 mavhq has joined #osdev

08:47 <geist> what precisely do you mean by unmap a page table?

08:47 <geist> you mean because there are no more pages in a range, remove the table that held it, but otherwise eep the page table *structure*?

08:47 <geist> (sometimes folks say page table when they mean the whole tree of tables, and vice versa)

08:55 <ddevault> I mean any of the page tables (PDPT, PD, PT)

08:56 <ddevault> and yeah, empty or not, the whole address range they describe becomes invalid

08:57 pretty_dumm_guy has joined #osdev

08:59 <geist> yah you dont have to dump the entire thing. Basically on x86 you simply have to invalidate a page that the page table covered

09:00 <geist> it's mentioned in the manual, but effectively what's going on is the cpu is allowed to cache the page table walker cache (caches inner 'pointers' between page tables) but it must invalidate the path leading to the page table when you invalidate anything that it 'covers'

09:00 <geist> so if you are unmapping something and you unmap the last thing in a page table if you invlpg that address you're free to just remove the page table from the layer above

09:00 <geist> since the invlpg will throw away the page table walker cache as well

09:01 <geist> AMD has some optional features where you can turn off that behavior so you can hypothetically take more control into your own hands and explicitly invalidate the page table walker cache, but i wouldn't recommend it

09:01 fwg has joined #osdev

09:20 heat has joined #osdev

09:27 GeDaMo has joined #osdev

09:49 bauen1 has joined #osdev

09:51 the_lanetly_052_ has joined #osdev

10:10 bauen1 has quit [Ping timeout: 268 seconds]

10:12 bauen1 has joined #osdev

10:14 arch_angel has quit [Remote host closed the connection]

10:15 arch_angel has joined #osdev

10:15 arch_angel has quit [Remote host closed the connection]

10:16 arch_angel has joined #osdev

10:16 arch_angel has quit [Remote host closed the connection]

10:18 fwg has quit [Quit: .oO( zzZzZzz ...]

10:23 the_lanetly_052 has joined #osdev

10:25 gog has joined #osdev

10:25 the_lanetly_052_ has quit [Ping timeout: 252 seconds]

10:30 gog` has joined #osdev

11:01 opal has quit [Remote host closed the connection]

11:03 opal has joined #osdev

11:10 fwg has joined #osdev

11:14 lkurusa has joined #osdev

11:17 sikkiladho_ has joined #osdev

11:22 the_lanetly_052 has quit [Remote host closed the connection]

11:24 gildasio has joined #osdev

11:25 <mrvn> So INVLPG will always throw away the L4 table?

11:25 <mrvn> or just the one entry to the page?

11:26 fwg has quit [Quit: .oO( zzZzZzz ...]

11:27 <mrvn> ddevault: if you unmap a range then at some point reloading cr3 is faster than INVLPG every page.

11:27 <ddevault> aye

11:28 <ddevault> this was my conclusion

11:32 <mrvn> beware of global pages, you have to INVLPG them as reloading CR3 doesn't wipe them.

11:32 <ddevault> global pages?ack

11:32 <ddevault> ? ack*

11:33 <mrvn> Entries with the global bit set

11:33 <mrvn> So basically everything in kernel space.

11:38 <ddevault> aye

11:38 Matt|home has joined #osdev

11:52 <mrvn> do you use PCID?

11:54 <ddevault> not yet

12:12 gildasio has quit [Write error: Broken pipe]

12:12 gxt_ has quit [Write error: Broken pipe]

12:12 opal has quit [Remote host closed the connection]

12:12 foudfou_ has quit [Remote host closed the connection]

12:13 foudfou has joined #osdev

12:13 gildasio has joined #osdev

12:13 opal has joined #osdev

12:14 gxt_ has joined #osdev

12:15 eroux has quit [Ping timeout: 272 seconds]

12:17 eroux has joined #osdev

12:31 gxt_ has quit [Remote host closed the connection]

12:37 gog has quit [Ping timeout: 240 seconds]

12:37 Dyyskos has quit [Quit: Leaving]

12:39 gildasio has quit [Remote host closed the connection]

12:39 gxt_ has joined #osdev

12:40 gildasio has joined #osdev

12:40 elastic_dog has quit [Ping timeout: 240 seconds]

12:42 mzxtuelkl has joined #osdev

12:44 xenos1984 has joined #osdev

12:46 elastic_dog has joined #osdev

12:55 gog has joined #osdev

13:00 nyah has joined #osdev

14:30 foudfou has quit [Remote host closed the connection]

14:31 foudfou has joined #osdev

14:36 fwg has joined #osdev

14:39 the_lanetly_052 has joined #osdev

14:41 the_lanetly_052 has quit [Client Quit]

14:41 foudfou has quit [Remote host closed the connection]

14:41 the_lanetly_052 has joined #osdev

14:41 foudfou has joined #osdev

14:43 ThinkT510 has quit [Ping timeout: 268 seconds]

14:44 ThinkT510 has joined #osdev

14:44 sikkiladho_ has quit [Quit: Connection closed for inactivity]

14:47 andydude has joined #osdev

14:53 andydude has quit [Quit: andydude]

15:09 bauen1 has quit [Ping timeout: 245 seconds]

15:11 bauen1 has joined #osdev

15:16 tsraoien has quit [Ping timeout: 268 seconds]

15:20 bauen1 has quit [Ping timeout: 245 seconds]

15:24 LittleFox has quit [Quit: ZNC 1.8.2+deb2+b1 - https://znc.in]

15:25 LittleFox has joined #osdev

15:28 tsraoien has joined #osdev

15:29 fwg has quit [Quit: .oO( zzZzZzz ...]

15:39 lkurusa has quit [Quit: I probably fell asleep (or went out). Who will ever know.]

15:42 bauen1 has joined #osdev

15:47 bauen1 has quit [Ping timeout: 252 seconds]

15:59 <freakazoid333> Not so Common Desktop Environment: https://github.com/NsCDE/NsCDE

15:59 <bslsk05> NsCDE/NsCDE - Modern and functional CDE desktop based on FVWM (30 forks/998 stargazers/NOASSERTION)

15:59 bauen1 has joined #osdev

16:07 mzxtuelkl has quit [Quit: Leaving]

16:22 exec64 has quit [Remote host closed the connection]

16:22 alethkit has quit [Remote host closed the connection]

16:22 gjnoonan has quit [Remote host closed the connection]

16:22 sm2n has quit [Remote host closed the connection]

16:22 jleightcap has quit [Remote host closed the connection]

16:22 patwid has quit [Remote host closed the connection]

16:22 tom5760 has quit [Remote host closed the connection]

16:22 milesrout has quit [Write error: Connection reset by peer]

16:22 noeontheend has quit [Write error: Connection reset by peer]

16:22 ddevault has quit [Remote host closed the connection]

16:23 ajr has joined #osdev

16:26 tom5760 has joined #osdev

16:27 sm2n has joined #osdev

16:27 patwid has joined #osdev

16:27 alethkit has joined #osdev

16:27 jleightcap has joined #osdev

16:27 noeontheend has joined #osdev

16:27 gjnoonan has joined #osdev

16:34 patwid has quit [Remote host closed the connection]

16:34 tom5760 has quit [Remote host closed the connection]

16:34 gjnoonan has quit [Remote host closed the connection]

16:34 sm2n has quit [Remote host closed the connection]

16:34 alethkit has quit [Remote host closed the connection]

16:34 jleightcap has quit [Remote host closed the connection]

16:34 noeontheend has quit [Remote host closed the connection]

16:36 ddevault has joined #osdev

16:36 gjnoonan has joined #osdev

16:36 tom5760 has joined #osdev

16:36 exec64 has joined #osdev

16:36 milesrout has joined #osdev

16:36 patwid has joined #osdev

16:36 sm2n has joined #osdev

16:36 jleightcap has joined #osdev

16:37 noeontheend has joined #osdev

16:37 alethkit has joined #osdev

16:49 ebrasca has joined #osdev

16:58 Raito_Bezarius has quit [Ping timeout: 240 seconds]

17:07 lkurusa has joined #osdev

17:11 Raito_Bezarius has joined #osdev

17:21 alethkit has quit [Remote host closed the connection]

17:21 patwid has quit [Remote host closed the connection]

17:21 tom5760 has quit [Remote host closed the connection]

17:21 jleightcap has quit [Write error: Broken pipe]

17:21 sm2n has quit [Write error: Connection reset by peer]

17:21 milesrout has quit [Write error: Broken pipe]

17:21 exec64 has quit [Remote host closed the connection]

17:21 noeontheend has quit [Remote host closed the connection]

17:21 gjnoonan has quit [Remote host closed the connection]

17:21 ddevault has quit [Remote host closed the connection]

17:22 exec64 has joined #osdev

17:22 tom5760 has joined #osdev

17:22 sm2n has joined #osdev

17:22 milesrout has joined #osdev

17:22 gjnoonan has joined #osdev

17:23 ddevault has joined #osdev

17:23 patwid has joined #osdev

17:23 alethkit has joined #osdev

17:23 noeontheend has joined #osdev

17:24 jleightcap has joined #osdev

17:41 [itchyjunk] has joined #osdev

17:43 tsraoien has quit [Ping timeout: 252 seconds]

17:48 vdamewood has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]

17:48 foudfou_ has joined #osdev

17:49 foudfou has quit [Ping timeout: 268 seconds]

17:49 gxt_ has quit [Ping timeout: 268 seconds]

17:51 gxt_ has joined #osdev

18:00 ptrc has quit [Remote host closed the connection]

18:05 ptrc has joined #osdev

18:06 tsraoien has joined #osdev

18:10 the_lanetly_052 has quit [Ping timeout: 240 seconds]

18:19 <GeDaMo> I don't like when things are described as "modern", what does that even mean? :|

18:19 <zid`> Shit.

18:21 tsraoien has quit [Ping timeout: 252 seconds]

18:21 <gog> i think it's meant to be like contemporary, fitting in with the style of the time and season

18:22 <GeDaMo> https://www.youtube.com/watch?v=nxTnJPIl20U

18:22 <bslsk05> 'Dedicated Follower Of Fashion' by The Kinks - Topic (00:03:01)

18:22 <zid`> But what it actually means is shitty.

18:23 <zid`> it has ads, loads and operates slowly, and breaks in 2 months

18:25 foudfou_ has quit [Remote host closed the connection]

18:25 foudfou has joined #osdev

18:27 <gog> yes

18:28 <gog> the web browser becoming a universal application platform was a mistake

18:28 <zid`> DHTML is the future!

18:28 <gog> return to the fragmented, competitive past

18:29 <GeDaMo> I wonder if whatever replaces the web will first be implemented in the browser :|

18:29 <zid`> yea, I prefer the broken 2000s web to the broken 2020s web

18:29 <zid`> At least nobody was *demanding* I used IE5, because using websites was optional

18:29 <zid`> and my machine didn't struggle to at least run the software at all

18:30 <gog> i want the 80's back with the 20 different CPU architectures and 35 operating systems

18:31 <zid`> That's why we should all listen to golden brown by the stranglers on repeat

18:31 <GeDaMo> s/operating systems/BASICs/ :P

18:31 <gog> yes

18:31 <gog> also yes

18:35 <zid`> golden brown, texture like sun. Lays me down, with my mind she runs.

18:36 <zid`> Throughout the night, no need to fight. Never a frown with golden brown.

18:36 <zid`> Really makes you want to try heroin doesn't it

18:37 <gog> not really

18:37 <gog> but you do you

18:37 <zid`> But, never a frown!

18:41 <PapaFrog> May be a stupid question. Is Intel HD audio related to Realtek HD Audio in any way?

18:41 <zid`> yes

18:41 <zid`> They're both HDA, the spiritual successor to ac'97

18:41 <zid`> one using an intel chip and one using a relatek chip

18:41 <PapaFrog> If I implement a driver for one, how far am I from supporting the other?

18:41 <zid`> it makes the drivers *similar* but not hugely

18:42 <zid`> as in they both support the same basic features

18:42 <PapaFrog> I really just want PCM16.

18:42 <gog> afaik there's the bus and then codecs

18:42 <zid`> https://en.wikipedia.org/wiki/Intel_High_Definition_Audio

18:42 <bslsk05> en.wikipedia.org: Intel High Definition Audio - Wikipedia

18:43 <gog> each codec can implement support for different types of streams and exposes them to the bus

18:43 <PapaFrog> I with there was a damn consensus when it comes to some hardware.

18:43 <zid`> It turns out though that it's not as portable as you'd hope, as the chips support 'extras' each, and have different pin wirings and stuff

18:43 <zid`> but it's much better than them all being randomly totally different

18:43 <zid`> (soundblaster era)

18:45 <PapaFrog> I'm tempted to play with the speaker. lol

18:48 wootehfoot has joined #osdev

18:48 <GeDaMo> 1 bit should be enough for anyone :P

18:51 <gog> yes the pit and cpu-bound I/o is all the sound hardware you need

18:52 <gog> and very careful timing

18:52 <mrvn> GeDaMo: welcome to cmov computing.

18:53 <geist> mrvn: invlpg will throw away the page, and all of the page table cache leading up to it

18:53 <geist> thus why it's safe to just throw away a page table as long as it has no entries

18:55 <geist> that's why i was asking questions about what is precisely going on

18:55 <geist> if you want to en-masse toss a bunch of page table entries and not individually invlpg them, then yes you can toss the entire cr3 (though as you say global pages are different)

18:56 <geist> but *generally* aside from special cases most mmu code does one page at a time, and thus when you're removing a page table it's because you just unmapped the last page from it

18:56 <geist> and if you invlpged it, then you can simply toss the page table, and you dont have to reload cr3

18:57 <mrvn> geist: but what does it cache? The whole 4k page of each level or individual entries?

18:57 <geist> neither. you're talking about the page table cache?

18:57 <geist> more specifially the page table walker cache?

18:57 <mrvn> the page walker cache

18:57 <geist> it stores basically 'this virtual address's range is at page table X physically'

18:57 <geist> baiscally short circuits walknig the entire page table by jumping directly to the end of the walk

18:58 <geist> so when you invlpg it also throws away any page table cache entry that refers to the page table leaf you just invlpged

18:59 <geist> thus forced the page table tree to be fully walked next time it loads a TLB

18:59 <geist> in that space

18:59 <mrvn> I would have thought it's levels of cache. If it can't find the L4 table in cache it looks for the L3 table then the L2 table and last it lookups the L1 table entry from scratch.

18:59 <geist> nah the whole point is to short circuit the entire walk

19:00 <geist> note x86 manuald doesn't describe how it works, so it's also possible it does what you say, but the rules are written: you dont have to worry about evicting it if you use invlpg

19:00 <geist> ARM64 has the same thing and the manual describes it in fairly intricate detail

19:00 <mrvn> It would. But when you looked at 2MB the next access fails and then has to walk from scratch. My way it would still have the L3 table in cache and only had to walk the last step.

19:00 <geist> so i'm assuming the x86's work basically the same way

19:00 * geist shrugs

19:01 <geist> it would be less useful because of the mechanism i just described: evicting the page table walker cache

19:01 <geist> would have to walk up the tree and evict everything in the walker cache

19:01 <mrvn> That's why I wondered how much it would evict.

19:01 <geist> such that the inner nodes would be far less useful, they'd get tossed all the time

19:01 <geist> if it just stores leaf notes in the walker cache it only has to evict at most 1

19:01 <mrvn> probably why they aren't doing it that way.

19:01 <geist> right

19:02 <mrvn> Drawback is that you get a full page table walk every 2MB.

19:02 <geist> indeed, however it can store as many walker entries as it has cache for it

19:02 <geist> so it may track a large amount

19:04 <mrvn> Every entry caches 2MB. Except when you have huge pages then it's 1GB or 512GB.

19:06 <geist> right

19:07 <geist> only real reason i know about it is like all things on ARM64 this is fully exposed. software is responsible for maintaining it

19:07 <geist> also recentl AMD cpus have a feature bit you can hit that enables the explicit maintenance of it, but AFAICT linux doesn't use that feature, so it's probably dead on the vine

19:08 <geist> idea is if you know you aren't invlpging the last entry in a page table, dont evict the walker

19:08 <geist> or conversely, only evict the walker when you know you're removing the last entry, whcih is what you have to do on ARM64

19:08 lkurusa has quit [Quit: I probably fell asleep (or went out). Who will ever know.]

19:08 <mrvn> And since you need to track that anyway to free the PT that is basically free.

19:09 <mrvn> can you reload CR3 without evicting the walker?

19:10 <geist> i dont think so. also i *assume* it gets evicted even when you have globla pages

19:10 ZipCPU has quit [Quit: ZNC 1.7.5+deb4 - https://znc.in]

19:10 <geist> since it'd only be able to mark the cache walker as global if it somehow knew that every single pagein the page table is also global, etc

19:10 <mrvn> Every time you change a page from RW to RO or vice versa you don't have to change the walker.

19:10 <geist> right

19:11 <mrvn> Or populate a mmaped region

19:11 <geist> obviously if you did something (this would be odd) like copy a page table from one to another for the purposes of moving the page table to somewhere else physically

19:11 <geist> you'd have to evict the page table walker cache and not the pages themselves

19:11 ZipCPU has joined #osdev

19:11 <geist> say you were defragging physical ram

19:12 <mrvn> if you replace a page table or free large amounts of memory you have to kill the walker. But that is really the minority.

19:12 <geist> anyway PT walker caches are i think a really big reason the page tables themselves aren't *that* much of an overhead

19:12 <geist> would be interesting to see what kinda hit rate they get. i dunno if that's exposed in any of the profile vars

19:12 jafarlihi has joined #osdev

19:12 <mrvn> totally. Doing 4 (soon 5) entry lookups would be much slower.

19:13 <jafarlihi> geist: Hey! Do you know if there's DHCPv4 client impl in Fuchsia? I can find DHCPv4 server and DHCPv6 client but no DHCPv4 client.

19:13 <mrvn> Hmm, why aren't page table lookups in the L1 cache? :)

19:13 <geist> there should be

19:13 <geist> fuchsia gets ip addresses from dhcp servers all the time

19:13 <mrvn> or are they?

19:14 <geist> i think the cpu cache also is free to cache the page tables yes

19:14 <geist> on ARM it's a bunch of control bits in the TTBR and whatnot

19:14 <mrvn> Would be hard to do with virtual address indexing

19:17 <geist> but on ARM at least that's why there's some memory order issues when dealing with modifying page table entries and whether or not the page table walker 'sees' it

19:18 <geist> since in ARM at least the walker is considered basically it's own cpu. from the point of view of the data cache, it is independently walking through it, with more or less the same way a cpu core os

19:18 <geist> and thus you need memory barriers and whatnot at particular spots for the walker to see what you just wrote, etc

19:19 <mrvn> geist: that says the walker isn't using the cache, so you have to flush writes out of it to become visible.

19:19 <geist> no memory barriers are not a cache flush

19:20 <geist> they're simply forcing the cpu to drain it's write buffer and get stuff into the cache subsystem

19:20 <geist> weakly ordered cpus are a bitch!

19:20 <geist> it gets even more complicated when you enable the A and D bit feature on ARM. now you have to do everything with careful atomics, since the page walker can be modifying page table entries in a weakly ordered way

19:21 <geist> i haven't yet written that code for fuchsia, but it's on my list of things to wade through this year

19:22 <mrvn> only if you have threads or modifying another threads tables.

19:22 wootehfoot has quit [Ping timeout: 252 seconds]

19:23 <mrvn> I never turned on the hardware A/D bits since I believe the RPi don't (all?) have it.

19:23 <geist> no, not at all. think of the walker itself as another cpu

19:23 <geist> it's also asynchronously modifying page table entries

19:24 <geist> there are a huge pile of rules about it but it makes the page table code much more complicated, since you have to use atomics for all of them

19:24 <mrvn> geist: is there a barrier the walker respects?

19:24 <geist> well, it's less of the barrier respecting it and more making sure the cpu is coherent relative to the walker at points where it matters

19:25 <mrvn> or rather one that respects the walker and waits for it to finish writing

19:25 <geist> same as any other weakly ordered shared data accesses, basically

19:25 <mrvn> I assume the walker is in the inner domain, right?

19:25 <geist> yes

19:25 <geist> the ARM ARM defines this is fairly intricate detail, precisely what semantics the walker uses when it accesses or modifies page tables

19:26 <geist> based on that the cpu has a series of rules to follow to keep itself in sync. without A/D bit the walker is RO, and never modifies entries

19:26 <geist> so the rules are far simpler. that's the gist of it

19:26 <mrvn> So barrier, modify tables, barrier. The walker can't modify any (user) tables while I'm in kernel.

19:26 <mrvn> atomics maybe for kernel tables

19:27 <geist> sure it can. other cpu's walkers can

19:27 <mrvn> geist: no threading here. The process can't run on another core.

19:27 <geist> okay then.

19:27 <mrvn> With threading it's hell, yes.

19:28 <mrvn> Do you know of any kernel that has non-shared thread local storage? Memory only mapped in one thread.

19:29 <geist> not that i know of

19:29 fwg has joined #osdev

19:29 <mrvn> Seems like it would be a useful thing as many page table operations would get a lot cheaper.

19:29 tsraoien has joined #osdev

19:30 <mrvn> if (addr >= 0xC00..00) { /* no TLB shootdown for thread private mappings

19:31 <gorgonical> Okay I think I'm losing my mind

19:31 <gorgonical> User space processes should, in general, have interrupts enabled, right?

19:32 <dh`> having to switch the MMU context to switch threads makes that much more expensive and defeats a lot of the point of having threads

19:32 <dh`> so ~nobody does that

19:33 <mrvn> https://www.youtube.com/watch?v=PD1vkhFO4Dg&t=805s

19:33 <bslsk05> 'How Many Glass Panes Will a Bullet Go Through? - The Slow Mo Guys' by The Slow Mo Guys (00:15:17)

19:33 <zid`> gorgonical: depends if you want that cpu to be able to service IRQs or not

19:33 <zid`> (you probably do)

19:33 <j`ey> gorgonical: yes because how else to preempt?

19:34 <gorgonical> Right. So I'm trying to understand exactly when that should be enabled. Looking through Linux's (and netbsd's a little) I can't figure out when they actually set the user process to have interrupts

19:34 <zid`> iret

19:34 <zid`> set it in eflags

19:34 <zid`> well, iretq and rflags I guess

19:34 <gorgonical> On riscv its sret and the SPIE flag anyway

19:34 <gorgonical> But the same idea

19:34 <gorgonical> But I don't see anywhere that SPIE is actually set in the status reg it gets restored from

19:34 <mrvn> dh`: I thought the point of threads was to run the code on multiple cores. :)

19:34 <geist> yep. it's in eflags, and the I bit is not writable by ring 3 code (though it's visible)

19:35 <geist> gorgonical: so you're specifically interested in the riscv solution?

19:35 <gorgonical> Yes

19:35 <gorgonical> I know it *should* be set, but I want to understand where/how Linux actually sets it, since we're aiming for broad compatibility

19:35 <mrvn> gorgonical: you set the flag when you first create the threads context and then it's always saved on kernel entry and restored on exit.

19:35 <zid`> It's probably set when it drops into init, re the linux kernel?

19:35 <geist> so the sstatus register has a saved copy when it enters an exception. moves the bits

19:36 <geist> so when setting up the cpu for entering into user space, its much like x86 in that you arrange for interrupts to be enabled when you eret to it

19:36 <geist> in the case of riscv it's not on the stack, it's in the sstatus register itself

19:36 <geist> in another field

19:36 <gorgonical> mrvn: yes so in copy_thread when they create kthreads it does get explicitly set there, but otherwise its cloned from parent regs

19:36 <dh`> mrvn: just that much you can do with separate processes

19:36 <gorgonical> geist: right, in spie

19:37 <gorgonical> so in theory the init_process should have it set

19:37 <mrvn> dh`: My point was that with different cores you just load the page table on each. Makes no difference if it's the same or slightly different ones.

19:37 <geist> maybe not even that. depends on how the OS initially switches into user space

19:37 <mrvn> dh`: You only pay a price if you switch a core from one thread to another thread of the same process.

19:37 <geist> for the very first switch it may just set things up manually and then eret

19:37 <geist> here's my code, for example: https://github.com/littlekernel/lk/blob/master/arch/riscv/arch.c#L153

19:37 <bslsk05> github.com: lk/arch.c at master · littlekernel/lk · GitHub

19:37 <gorgonical> hmm

19:37 <dh`> oh, true.

19:38 <mrvn> gorgonical: I think you have to set it every time you create a thread.

19:38 <geist> it just sets up sstatus so that when it eventually srets (at the bottom of the function) interrupts get flipped on

19:38 <gorgonical> geist: I am thinking that whatever gets set has to be spie due to the swapping semantics on exception/sret

19:38 <geist> yes.

19:39 <mrvn> gorgonical: you don't copy the parents eflags as they are somwhere lost on the kernel stack when you do the thread creating.

19:39 <geist> mrvn: it doesn't work that way on riscv

19:40 <geist> this is riscv, things are slightly different here. though functionally it's the same thing, you're probably just confusing them

19:40 <gorgonical> oh I'm real dumb, I think I just found it

19:40 <geist> found what where?

19:40 <gorgonical> Kitten combines the create/start code and linux breaks them up

19:40 <dh`> riscv is like mips, there's one register that masks sources and another with a master switch

19:40 <mrvn> geist: you don't have the flags in a banked register and save it to the stack?

19:41 <gorgonical> start_thread does explicitly set SR_PIE

19:41 <geist> mrvn: not necessarily. that's my point, your suggestions are assuming that the hardware mechanism works a certain way

19:41 <gorgonical> mrvn: every priv level has a register to bank in

19:42 <geist> gorgonical: yep. so on theinitial switch to the thread there's no existing stack frame to return from, so you simply set the sstatus's PIE and then sret

19:42 <mrvn> I guess you could not save it assuming the banked register doesn't get touched and you won't switch tasks. Then it's still the right value on exit.

19:42 <dh`> according to 4.1.1 supervisor-level interrupts are always enabled when in user mode

19:42 <geist> from then on out when you enter the kernel from that thread the sstatus is *probably* saved on the stack and you dont have to do it again

19:43 <geist> dh`: yeah that's probably true

19:44 <geist> it may be that it's just implicitly enabled on riscv and setting PIE does nothing when switching from supervisor to user

19:44 <geist> i always found this part of the riscv spec to be especially confusing.

19:44 <geist> not because it's complicated, but because it's poorly written

19:44 <dh`> according to my riscv spec there's SPIE and UPIE

19:45 <dh`> since the user interrupt stuff has been withdrawn, I assume the PIE bit you're talking about is the SPIE bit

19:45 <gorgonical> I am not aware of a board that actually supports u* regs

19:45 <dh`> no

19:45 <gorgonical> I wasn't aware they withdrew it

19:45 <dh`> the scheme they invented doesn't really work so it got punted

19:45 <dh`> i think that's the state of things

19:45 catern has joined #osdev

19:46 <dh`> "when a SRET instruction is executed, SIE is set to SPIE, then SPIE is set to 1"

19:46 <dh`> so when you trap from supervisor mode to supervisor mode, that bit controls the master interrupt switch

19:47 <dh`> but in user mode the master switch is apparently always on

19:48 <geist> so that begs the questin: before entering user space does it make sense (or matter) to set SPIE or does it not matter?

19:48 <dh`> I think it doesn't matter

19:48 <gorgonical> It definitely seems to matter in qemu. I was getting bug_on triggers for thread migration

19:48 <geist> i think the answer is probaby doesn't matter because it's about to be implicitly enabled, and then when it comes back from user mode SIE will get cleared

19:50 <geist> mrvn: anyway the big difference in riscv vs x86 or ARM is the saved state of the previous interrupt disable flags is saved into the same control register you already have. there's no backup copy. it's the bits that are copied from one field to another

19:50 <gorgonical> The board manual for the sifive doesn't say that sret sets spie to 1, fwiw

19:50 <dh`> but in general the trap handler should restore what was there in the previous state, because if the trap came from the kernel and you mess with it things will go off the rails

19:50 <geist> so it's slightly different than x86 pushing eflags on the stack, or arm copying things into SPSR

19:50 <geist> right, so in general a trap handler should push *status on the stack and restore it before *retting

19:50 jafarlihi has quit [Ping timeout: 245 seconds]

19:50 <dh`> yes

19:51 <geist> [ms]status [ms]ret

19:51 <gorgonical> dh`: Oh you are right though the priv spec says spie *should* be set to 1. wtf lol

19:51 <geist> but when entering user space for the first time you dont have to set up a stack frame, you can simply set up sstatus and then sret

19:51 <dh`> and for entering usermode the first time, the best thing to do usually is initialize a trap frame and call the return path

19:51 <geist> can do that too

19:51 <geist> it's a bit more annoying if you're already on that stack, etc, which is why i dont do it that way in LK

19:52 <dh`> helps to avoid either forgetting things or leaking kernel data

19:52 <geist> indeed

19:53 <dh`> in OS/161 typically you initialize a trapframe on your stack, point at it, and jump to the return code even though that frame won't be in the same place as one generated by a trap

19:53 <geist> yeah, makes sense

19:53 <dh`> this is to some extent up to what students decide to do, but there are limits

19:53 <mrvn> geist: interrupts don't get disabled on interrupt entry?

19:54 <dh`> e.g. you can't malloc the frame because there's no useful way to avoid leaking it :-)

19:54 <geist> they do

19:54 <mrvn> but that would overwrite the bit in the register so you can't know the previous state

19:54 <dh`> mrvn: on a trapa the SIE bit is copied into the SPIE bit and then the SIE bit is set to 0

19:54 <dh`> s/trapa/trap/

19:55 <geist> right. it saves the previous interrrupt state (and the cpu mode it was in) but it just moves it into the same register

19:55 <geist> [ms]status. it's fairly close to CPSR on arm

19:55 <mrvn> sounds exactly like SPSR on ARM.

19:55 <dh`> the riscv supervisor stuff has nothing like the design quality of the riscv base :-|

19:55 <geist> except instead of making a copy into SPSR, it just copies into fields within CPSR

19:55 <geist> so it's not exactly like SPSR

19:55 <dh`> it is much more like mips than arm

19:56 <geist> that's the point i was trying to make, it's not copying the register into a saved one, it copies from one part of the register to the other

19:56 <dh`> but not that much like mips either

19:56 <geist> so the saved state is always 'live'

19:56 <mrvn> On ARM64 you have 4 banked status registers, right?

19:56 <mrvn> one per EL

19:56 <geist> yes. on riscv you have 2 (or 3 if you think of the virtualization extensions)

19:57 <geist> because user mode doesn't have a '*status' register

19:57 <dh`> if you remember the r3000 status register, there's three bit pairs for interrupt and user/kernel state at the bottom of the register, and traps/returns shift them left/right respectively

19:57 <mrvn> Who has interrupts enabled inside their kernel?

19:57 <geist> that's the part where where *status register is *not* like CPSR on arm. it's closer to SCTLR in the sense that it's not user visible

19:57 <dh`> so basically there's the user state, the state after trapping into the kernel, and a third set for a nested trap within the TLB refill handler

19:57 * geist raises hands

19:57 <dh`> it is more like that

19:58 <geist> yah

19:58 <dh`> mrvn: unless your kernel is very micro you need to have at least some interrupts on to avoid dropping some

19:58 <mrvn> not afraid to get too many interrupts and running out of stack?

19:59 <geist> this is a solved problem like 50 years ago mrvn

19:59 <geist> you can control the amount of nesting you get

19:59 <geist> enabling interrupts within the kernel does not mean you *always* enable interrupts within the kernel

19:59 <geist> just in most of the code

20:00 <dh`> mrvn: usually only one at a time, though if you have interrupt priority levels you might allow one per level at a time

20:00 <geist> also keep in mind most modern arches are pretty stupid in terms of having exactly 2 or maybe 3 interrupt levels. more sophisticated arches of yore were much more friendly with multiple nested interrupts via different irq levels

20:00 <geist> 68k, VAX (where 68k copied it from most liikely), etc

20:00 <geist> though cortex-m class hardware looks suspoiciously similar to VAX

20:01 <dh`> traditionally, if you don't react quite rapidly to serial port interrupts you drop characters

20:01 <mrvn> geist: Do you enable interrupts only in code that will block or always up to a certain limit?

20:02 <geist> more like enable it by default and disable it in code where you can't take another one, including interrupt handlers themselves

20:02 <geist> that's a fairly standard reentrant, preemptive kernel design

20:02 <dh`> or even nonpreemptive

20:02 <dh`> traditional unix is nonpreemptive but enables everything else in the kernel

20:02 <mrvn> dh`: serial? Seriously? That generally has 16 byte fifo and you can get an interrupt at 7/8th to 1/8th full. At 115200 BAUD that's forever.

20:03 <geist> oh my god dont get me started about the woes of serial ports

20:03 <dh`> mrvn: "traditionally"

20:03 <dh`> fifos on UARTs only started to appear in the 90s

20:03 <mrvn> 3 decades ago :)

20:03 <geist> and t's still common for arm soc makers to put 1 byte fifos

20:03 <geist> *right now*

20:04 <dh`> kernel design hasn't changed much in the past 25 years

20:04 <geist> with the assumption that 'software will just get to it quickly enough' or 'use dma'

20:04 <mrvn> wow, I've never used anything but a 16650 clone.

20:04 <geist> then you dont know what you're missing

20:04 <geist> a) 8250 derivates suck. it's a terrible design

20:04 <geist> and b) there are worse

20:04 <mrvn> must have been lucky with the ARMs I bought.

20:05 <geist> there are far nicer uarts to work with out there that aren't based on 8250 designs

20:05 <geist> and there are even worse ones. the 'console uar't on the raspberry pi is a good example of an even worse one

20:05 <geist> it's intentionally stupid because it's supposed to just be used for slow transfers

20:05 <dh`> anyway, if you want to run 19200 bps on a 1992 machine you can't fuck around with your interrupt latency

20:05 <geist> it even shares an irq with some other hardware because broadcomm couldn't be bothered

20:06 <mrvn> 1 byte FIFO is 0.039ms. Still not too bad.

20:06 <geist> yah i remember back in the 386 days you actually wanted to go buy a 16450 card or your cpu couldn't keep up

20:06 <geist> but like i said even on modern hardware it can be a challenge if you have an exceptionally dumb serial port

20:07 <geist> enough that you gotta be careful, and extended irq disablement can cause you to miss windows

20:07 <mrvn> geist: yeah, and I was laughing at you 386 users and doing 230400 BAUD on my serial.

20:07 <geist> that's a common thing btw: running serial ports at a few megabits

20:07 <mrvn> AmigaOS hardly ever disabled interrupts.

20:07 <geist> then even 16 byte fifos start to look pretty small

20:08 <geist> mrvn: it's also possible they used 68k's native irq level stuff

20:08 <mrvn> geist: obviously. :) m68k is a lot better there than 386.

20:08 <geist> 68k is not a particularly good cpu for bit banging, but it does have a decent irq handling mechanism

20:08 <dh`> wait a sec. 19200 bps is 2400 bytes/sec and that's 0.4 ms per character

20:08 <mrvn> dh`: my number was for 115200

20:09 <dh`> ah oops

20:09 justmatt has joined #osdev

20:09 <geist> my general rule of thumb is 115200 is about 10 characters per ms

20:09 <geist> since it's approx 10k chars/sec

20:09 <mrvn> At 19200 BAUD (0.4ms) that's nearly half the time slice each process gets on AmigaOS + multiuser/dynamic priorities patches

20:09 <dh`> anyway on a 1992 machine you've got say 25000 cycles per ms

20:10 <geist> also remember lots of embedded things are say 25Mhz or so now

20:10 <geist> though a cortex-m at 25mhz would run rings around a 486 or 68k at 25mhz

20:10 wootehfoot has joined #osdev

20:10 <mrvn> I was so used to 1000Hz timer for the scheduler with AmigaOS and then I tried Linux with it's 100Hz. *blahh*

20:11 <geist> omg amigaos is so amazing!

20:11 <geist> why did any of us ever survive?

20:11 <mrvn> It was.

20:11 <geist> dont trigger me or i'll start going off on how great VAX is again

20:11 <mrvn> and don't ask me how we are still living

20:11 <geist> and then you'll be sorry!

20:11 <dh`> so for 19200 you get about 10000 cycles, and to not drop a character you have to finish whatever you're doing, reenable interrupts, and take the interrupt before that budget runs out

20:11 <mrvn> I watched a VAX running BSD scrolling it's console once.

20:11 <mrvn> blink, blink, blink

20:12 <mrvn> dh`: yeah, definetly want to enable interrupts in kernel, at least for some levels.

20:13 <mrvn> On the other hand you had hardware flow control. So if you are late the serial just pauses.

20:13 <geist> mrvn: well which vax was it?

20:13 <geist> there was a huge array of them. how was the terminal connected?

20:14 <mrvn> geist: no idea, it was like 15 years ago.

20:14 <geist> i once saw an amiga in the dumpster

20:14 <geist> ergo all amigas are dumpsters

20:14 <dh`> the serial _might_ just pause if all the flow control bits actually work

20:15 <dh`> I once had an rs232 cable that was down to one wire

20:15 <dh`> (by luck, the wire that had remained attached was the data line)

20:15 <mrvn> so GND was floating?

20:15 <dh`> yeah

20:15 <dh`> eventually it got resoldered, or maybe just thrown out

20:15 <geist> keep in mind i think 8250 doesn't do full hw flow control. one of the reasons why i dont like them

20:15 <mrvn> but you need recv and send

20:15 <geist> iirc they just allow you to use hw flow, but doesn't fully implement it

20:16 <mrvn> geist: like many ARMs and all the USB serial dongles I've used.

20:16 <heat> >dont trigger me or i'll start going off on how great VAX is again

20:16 <geist> huh? usb serial dongles aren't 8250

20:16 <heat> dont trigger me or i'll start going off on how great itanium is again

20:16 <geist> see!

20:16 <dh`> well, gnd was probably connected to the outer shield around the rim, right? so it may have also been connected

20:16 <dh`> but certainly that cable did not support RTS/CTS

20:17 <mrvn> dh`: I think it wouldn't have worked otherwise.

20:17 <dh`> dunno, sometimes things work even when they have no right to whatsoever

20:17 <mrvn> dh`: It must have never had RTS/CTS. If you cut those lines and don't short the pins the serial never works.

20:17 <dh`> it probably never had rts/cts

20:17 <dh`> most RS-232 cables didn't

20:18 <mrvn> geist: I have no idea what's inside the USB dongles but I only get 4 wires out: 5V GND, send, recv.

20:18 <geist> sure, but that's not 8250

20:18 <geist> 8250 is the programming interface

20:23 <heat> what's the difference between a 8250 and a 16550?

20:23 <heat> i get the two confused

20:24 <mrvn> heat: the fifo?

20:25 GeDaMo has quit [Quit: There is as yet insufficient data for a meaningful answer.]

20:26 <heat> i dunno, you tell me

20:28 <mrvn> The 16450(A) UART, commonly used in IBM PC/AT-series computers, improved on the 8250 by permitting higher serial line speeds.

20:28 <mrvn> With the introduction of multitasking operating systems on PC hardware, such as OS/2, Windows NT or various flavours of UNIX, the short time available to serve character-by-character interrupt requests became a problem, therefore the IBM PS/2 serial ports introduced the 16550(A) UARTs that had a built-in 16 byte FIFO or buffer memory to collect incoming characters.

20:28 <dh`> double the 82, obviously it must be an improved model

20:29 <mrvn> So 8250: 1 byte fifo, 16450 -> more speed, 16550 -> 16 byte fifo

20:31 <heat> ah ok so all "8250" drivers that touch the fifo are in reality 16550 drivers

20:31 <mrvn> There is also something about the BAUD generator. iirc the 16650 can generate it's own BAUD.

20:31 wootehfoot has quit [Ping timeout: 252 seconds]

20:32 wootehfoot has joined #osdev

20:39 <mrvn> The BAUDrate generator was something 8250 vs. 8251 that doesn't.

20:52 <mrvn> Hey, enjoice, Intel EVO is better at virus protection. You know, because everything runs faster (except viruses apparently) and and well, stuff.

20:53 <mrvn> snake oil V2, only with intel evo

20:54 <heat> my laptop is INTEL EVO POWERED BY CORE VPRO CERTIFIED

20:54 <mrvn> windows vista ready?

20:54 <heat> VT-d VT-g IGD GVT VT-x IA32e INTEL 64 64-BIT READY

20:55 <geist> also there are defactor 16650s and 16750s and whatnot

20:55 <geist> but basically i call them all 8250s, the newer ones are extensions to it

20:56 <Bitweasil> I like big serial FIFOs...

20:56 amine has joined #osdev

20:56 <Bitweasil> Makes life far easier.

20:56 <mrvn> Bitweasil: like 16 bytes?

20:56 <mrvn> or DMA capable serials?

20:56 <Bitweasil> I was think more the FTDI USB ones, I think they've got 256 or 512 byte FIFOs.

20:56 <Bitweasil> I was shoving a lot of data around at 3Mbaud for a while, and that was quite useful.

20:56 <dh`> why not just use arcnet?

20:57 * dh` hides

20:57 <Bitweasil> I was talking to firmware stuff on a Minnowboard Max or Pi4, they had serial ports...

20:57 <Bitweasil> And far easier to deal with serial than anything fancier, especially when I didn't really tell the OS I'd taken over the serial port from it.

20:58 <heat> i want a minnowboard

20:58 <mrvn> clever: can you have the serial start DMA on the RPi?

20:58 <heat> mostly just to hack on firmware

20:58 <clever> mrvn: the uart fifo's have a dreq signal, that can turn the dma on/off

20:58 <mrvn> thought so

20:58 <Bitweasil> heat, Max or the base one?

20:58 <clever> mrvn: you then program the dma to copy to/from the fifo register, with addr inc disabled

20:59 <clever> and set an axi burst size that fits whatever the dreq trigger is

20:59 <heat> Bitweasil, something newer

20:59 <heat> apparently it's the turbot now

20:59 <clever> as far as i know, the dma cant detect an over/underflow, and the axi port cant stall

20:59 <Bitweasil> I think that's the Max with a slightly faster chip on it.

20:59 <clever> so if the dma reads/writes too much, bytes will be lost/faked

20:59 <Bitweasil> ... if you're US based, I can toss mine in a box, I'm not using it for anything anymore.

21:00 <mrvn> clever: but you know exactly how much to read write by the trigger level you set

21:00 <Bitweasil> I was using it as a light desktop, but I've got other boards for that now.

21:00 <clever> mrvn: i think the dreq is using a hard-set trigger level, not the irq fifo level

21:00 <Bitweasil> brb, coffee underflow error.

21:01 <clever> mrvn: the rp2040 dreq is far better then the broadcom dreq

21:01 fwg has quit [Quit: .oO( zzZzZzz ...]

21:01 xenos1984 has quit [Read error: Connection reset by peer]

21:01 <heat> Bitweasil, I'm not :(

21:01 <clever> rp2040 dreq, will hold the dreq line active, for one clock cycle, for every byte that is added to the fifo, and the dma block then counts how many cycles dreq has been active

21:01 <clever> so dma knows exactly how many bytes can be read, and can fire off a perfect axi burst

21:01 <mrvn> clever: not surprising witht eh customary good-enough-to-work-around it Broadcom quality.

21:02 <clever> mrvn: i think it was more about axi burst size and acceptable latencies

21:02 <clever> broadcom dreq is just a "level is over X" signal, and you must then do a burst of X reads

21:02 <heat> the minnowboards are all kinda old anyway

21:02 <heat> i want an open firmware machine to play around with :(

21:03 <heat> not literal open firmware though

21:03 <heat> i'm not interested in that crap

21:03 <mrvn> clever: that should be enough. DMA should do that in no time, even before the next char is recv/send.

21:03 gildasio has quit [Remote host closed the connection]

21:03 opal has quit [Remote host closed the connection]

21:03 <heat> i'm a crap connoisseur, UEFI only

21:03 <clever> mrvn: but if you want to allow an 8 byte dma burst, you cant enable the dma until 8 bytes are in the fifo

21:03 opal has joined #osdev

21:04 <clever> mrvn: triggering with 1 byte in the fifo, would result in 7 bytes of junk, due to a fifo underrun

21:04 gildasio has joined #osdev

21:04 <clever> and thats where the rp2040 dma is better, its aware of how much can actually be read

21:04 <mrvn> clever: but if you don't expect 8 bytes to come in why are you setting up DMA? :)

21:04 <clever> but the rp2040 design, doesnt deal with clock domains

21:05 gildasio has quit [Remote host closed the connection]

21:05 <clever> mrvn: what if you want to receive 1234 bytes, and your dma is configured to do 8 byte bursts?

21:05 <mrvn> clever: the rp2040 can trigger DMA after a timeout with less bytes, right?

21:05 <clever> rp2040 dma doesnt use timeouts, it will trigger a dma copy with even 1 byte in the fifo

21:05 gildasio has joined #osdev

21:05 <mrvn> clever: then at the end you poll the last few bytes.

21:05 <clever> yeah

21:06 <mrvn> isn't 1 byte DMA rather wasteful?

21:06 <clever> i think it usually operates in 32bit chunks on the rp2040

21:06 <mrvn> bus width?

21:06 <clever> 32bits on the 2040

21:06 <clever> i think the bigger problem, is the clock domains

21:07 <clever> on the rp2040, everything is in a single clock domain, so if the uart holds dreq high for 5 clocks, the dma counts +5, and knows it can read 5 times

21:07 <clever> but on the broadcom SoC's, the uart and dma are in different clock domains

21:07 <clever> so its harder to give an exact count like that

21:08 <mrvn> you could toggle the dreq or pull it down for a few cycles between chars.

21:09 <mrvn> Use an edge trigger and you only have to pulse it every time a char comes in

21:09 <mrvn> So no, I don't think the clock domains are a real problem.

21:09 <clever> what if the dma is in a slower clock domain

21:09 <clever> and it misses a pulse because its clock is too slow?

21:09 <Bitweasil> heat, fair enough. Yeah, export is a pain.

21:09 <mrvn> clever: slower than the BAUD rate?

21:10 <clever> mrvn: dreq is also used by internal things, like the 2d compositor

21:10 <mrvn> or even 1/4 BAUDrate if you raise the signal for half a char.

21:10 <mrvn> clever: now that is a bigger problem

21:11 <mrvn> clever: a edge trigger is damn fast though regardless of the clock. Even a mini pulse would latch the trigger high till you read it and reset.

21:12 <mrvn> You just can't send a second pulse before it's reset.

21:12 <clever> yep, but you can only count 1 edge per clock

21:12 <clever> exactly

21:12 <clever> so the dma will loose track of how many reads it should issue

21:12 <mrvn> but again BAUD rate speed vs DMA speed. No contest.

21:12 <clever> DSI is one of the dreq sources

21:12 <clever> thats 4 lanes of DDR 500mhz traffic

21:12 gog has quit [Ping timeout: 252 seconds]

21:12 <clever> 4000mbit

21:13 <clever> what where you saying about baud rate?

21:14 gog` has quit [Remote host closed the connection]

21:14 gog` has joined #osdev

21:14 <mrvn> you can also feed the edge into a clock-less adder directly. You can add GHz pulses easily and the DMA then only has to read out the adder before it overflows.

21:15 <clever> yeah, that could potentially be done

21:15 <clever> i feel like the PLL's are using that kind of hw

21:15 <clever> drive the PLL output directly into a clockless adder, and when the count hits $divider, reset and emit 1 pulse

21:16 <clever> then phase-compare that slower clock, with the reference, and feedback loop

21:16 gog has joined #osdev

21:16 <clever> but, even that, has limits

21:16 <mrvn> I've done that for some arduino project. Use the pule to generate a clock signal going into a counter and the carry out pin on the counter is connected to a pin on the arduion with IRQ set up.

21:16 <clever> the adder/compare stage cant run over 3ghz on the rpi

21:16 <clever> there is a dedicated /2 that is much dumber logic, for >3ghg speeds

21:16 <mrvn> Only trigger an interrupt every 512 pulses.

21:17 <clever> so for low speeds, you do PLL/divider==crystal

21:17 <clever> but for high speeds, you instead do PLL/2/divider==crystal

21:18 <clever> the bcm2835 datasheets also mention, it has 2 sets of dma controllers

21:18 <clever> the full dma controllers, have a 256 bit bus

21:19 <clever> while the lite dma, is only a 128bit bus

21:19 <clever> they also differ in fifo depth

21:19 <mrvn> but the uart only has 32bit, right?

21:19 gildasio has quit [Ping timeout: 268 seconds]

21:19 xenos1984 has joined #osdev

21:19 <clever> and it warns that if you do too big of an axi burst read, the fifo can fill, and the reads will stall and jam that entire axi path up

21:19 <clever> and if the writes conflict with that path, the whole system will deadlock

21:20 <clever> mrvn: yeah, peripherals are on a dedicated 32bit only bus, and it says you can do a 4x burst on peripherals, and it will happily fill the 128bit fifo without issue

21:21 <clever> > The Lite engine will have about half the bandwidth of a normal DMA engine, and are intended for low bandwith peripheral servicing.

21:21 gildasio has joined #osdev

21:26 <clever> mrvn: i really need to get around to actually testing out the dma engines, linux using dma on the rpi fails, when under my open firmware

21:26 opal has quit [Ping timeout: 268 seconds]

21:26 <clever> and that greatly hampers performance

21:28 psykose has joined #osdev

21:30 <dzwdz> is there a name for the subset of libc that doesn't interact with the kernel? memcpy, strlen, snprintf, etc

21:30 <mrvn> not syscall?

21:31 <Bitweasil> Not sure either...

21:31 psykose has quit [Remote host closed the connection]

21:31 psykose has joined #osdev

21:32 <gog> you can't necessarily guarantee that those never syscall

21:32 <dzwdz> but they can be implemented without syscalls, their point isn't interacting with the OS

21:32 <mrvn> right, memcpy calls the DMA syscall :)

21:32 <dzwdz> as opposed to e.g. fread

21:34 <heat> freestanding?

21:34 <dzwdz> as in "freestanding libc"?

21:34 <heat> it's not quite a term for that but it's close enough I think

21:35 <heat> freestanding parts of the libc

21:35 <mrvn> since when does freestanding have strlen?

21:35 <dzwdz> i suppose that works, but that's a mouthful

21:35 wootehfoot has quit [Quit: Leaving]

21:35 <heat> why would you care?

21:36 <dzwdz> because i'm considering splitting my libc in two

21:36 <dzwdz> well it already kinda is, but i'm considering making it more explicit

21:37 <mrvn> I have a libstring basically

21:37 <clever> dzwdz: newlib is designed like that, with the libc half being entirely free-standing, and then libgloss deals with the syscall half

21:37 <clever> and the user i think is supposed to replace gloss with their own thing

21:37 <heat> i'm not a fan of that

21:38 <heat> i'm a strong believer than a kernel should have its own libc

21:38 <heat> sharing code is not trivial

21:38 <heat> you'll find yourself doing #ifdef __STDC_HOSTED__ #else #endif

21:38 <clever> yeah, LK also has its own libc, and people have asked me before why it doesnt just use newlib

21:38 <dzwdz> heat: dumb question: why not?

21:39 <heat> dzwdz, harder to read, harder to reason with

21:39 <heat> usefulness is kinda questionable

21:39 <gog> I'm a strong believer that the kennel should implement its own special API and not any of the standard other than freestanding

21:39 <netbsduser`> you only really need a portion of it anyway

21:39 <j`ey> and I guess kernel is way more restricted

21:39 <netbsduser`> the mem* and str* functions being most of it

21:39 <mrvn> gog: vdso to the rescue

21:40 <dzwdz> well i didn't ask only because i want to reuse it in the kernel

21:40 <gog> yes

21:40 <heat> gog, oh no, not kennels!

21:40 <gog> lol

21:40 <gog> autocorrect fail

21:40 <heat> dzwdz, if you wanna reuse stuff, reuse string functions

21:40 <dzwdz> i'm not linking my init binary against libc either, because i need a custom entrypoint

21:40 <heat> that's probably the best

21:41 <dzwdz> actually

21:41 <heat> and with string functions for instance, you'll still need to ifdef the kernel

21:41 <heat> because SSE, AVX can't be used in the kernel

21:41 <mrvn> heat: that kind of gets harder an harder

21:42 <mrvn> heat: you should ask geist about recent troubles with gcc vectorizing his kernel code.

21:42 <dzwdz> s/$/ nvm

21:42 <netbsduser`> dzwdz: in any case in netbsd there is a library `libkern' which incorporates files shared with libc, called `common libc sources'

21:42 <heat> mrvn, that's a different problem

21:42 <heat> i'm aware of it, I reported the original issue :)

21:43 <dzwdz> netbsduser`: thanks

21:43 <heat> lk has that issue because it doesn't disable SSE, AVX, etc codegen

21:44 <heat> it can't, as to support applets which do floating point, etc

21:44 <heat> it's not a traditional kernel which can disable it

21:44 <clever> LK (at least on arm) also uses lazy FPU context switching

21:44 <mrvn> heat: nah. the problem is that when you do you get ABI incompatible objects to the parts that need SSE/AVX equivalents on ARM/riscv

21:44 <clever> but LK also bans all FPU use in irq handlers

21:45 <netbsduser`> funny you should mention this, i built LittleKernel recently to see what all the fuss was about but it didn't go anywhere

21:45 <clever> so when it does a context switch, it just turns the FPU off, and leaves the state of a random thread in the FPU regs

21:45 <heat> mrvn, huh?

21:45 <clever> the FPU exception handler, then forces the FPU context switch, only when needed

21:45 opal has joined #osdev

21:45 <netbsduser`> there was an invalid opcode exception, i got as far as identifying that it was some sse operation

21:45 <heat> netbsduser`, yes, been there done that

21:45 <mrvn> heat: you can't link against the applets that use fpu/simd

21:45 <heat> the build with gcc 12.1.0 is broken

21:45 <heat> well, no shit

21:45 <heat> <heat> it can't, as to support applets which do floating point, etc

21:46 <heat> netbsduser`, https://github.com/littlekernel/lk/issues/331

21:46 <bslsk05> github.com: GCC 12.1.0 x86-64 build is broken · Issue #331 · littlekernel/lk · GitHub

21:46 <mrvn> I have the same problem because I have no ELF loader in my kernel. Everything is just linked together even if some of it is user space apps.

21:47 <heat> anyway, that was not the original point

21:47 <heat> disabling SSE, AVX codegen is trivial

21:47 lkurusa has joined #osdev

21:47 <heat> it's less trivial when you add optimized versions of your routines

21:47 <heat> i.e memcpy which uses SSE and AVX

21:47 <j`ey> heat: cant you just disable avx or whatnot for the core kernel code?

21:47 <mrvn> or dumping the state of a thread including fpu context

21:47 <heat> you will not be able to use it in the kernel so you'll end up maintaining two routines

21:48 <heat> j`ey, as per the travis of the g "define some way in the build system to mark modules as 'may use fpu' and 'no fpu' and then segregate modules accordingly. Would work well except for shared bits like libc (printf for example)."

21:49 <mrvn> I really don't want to add a full soft-float implementation to the kernel just to printf() the FPU registers in a crash dump.

21:50 <mrvn> heat: If I segregated the modules into fpu and no-fpu then how do I link them together or load them?

21:50 <heat> why would you print your fpu registers in floating point format?

21:50 <mrvn> heat: so I can see it's 1.024

21:50 <heat> that's not useful

21:51 <heat> how can you know its 1.024 and not a random SIMD bit

21:51 <mrvn> heat: I show hex and decimal

21:52 <heat> mrvn, applets would run with fpu saving and restoring, core code would run with no fpu

21:52 <heat> boom, problem solved

21:52 <heat> core code never uses the fpu, applets use the fpu

21:53 <mrvn> heat: that doesn't help with 23:47 < heat> i.e memcpy which uses SSE and AVX

21:53 <heat> of course it doesn't

21:53 <mrvn> heat: or do you want a kernel memcpy and user memcpy?#

21:53 <heat> yes

21:53 <heat> the solution isn't "enable the FPU inside the kernel"

21:53 <heat> it's "don't use SSE code inside the kernel"

21:53 <heat> thus making code sharing a bit dubious

21:54 <mrvn> I have one rare case where I want a fast memcpy. When looking for huge pages I have to copy up to 2MB of memory. Might even be worth using DMA for that.

21:54 griddle has joined #osdev

21:54 <heat> at the end of the day, with a "proper libc", how much code will you share?

21:54 <mrvn> heat: at the moment the whole STL.

21:55 <mrvn> strings, lists, heaps, arrays, vectors, sort, ...

21:55 <heat> if you speed up string ops, you'll speed everything up

21:55 <mrvn> The really problematic part to share is printf.

21:55 <heat> oh that's also funny: g++/clang++ can't compile inline C++ code with floating point in no-FPU mode, even if you don't reference it

21:56 <heat> so sharing the STL is also problematic

21:56 <mrvn> heat: even with "if constexpr"?

21:56 <heat> if constexpr (what?)

21:56 <mrvn> if constexpr (HAVE_FPU == 1) or something

21:56 <heat> it spits itself when it sees a float in an argument

21:56 <heat> i have really funny hacks around that

21:57 <mrvn> oh yeah. if you do hard-float no-fpu you are screwed.

21:57 <mrvn> On ARM rpi-1 I did soft-float for the kernel.

21:57 <heat> https://github.com/heatd/Onyx/blob/master/kernel/Makefile#L25

21:57 <bslsk05> github.com: Onyx/Makefile at master · heatd/Onyx · GitHub

21:58 <heat> all of this because I wanted <type_traits>

21:58 <mrvn> heat: What I would like for a shared printf would be some "#pragma enable-fpu" and "#pragma disable-fpu"

21:59 <griddle> Came in late, are you talking about using the same code for the kernel and libc's printf and string routines?

21:59 <mrvn> if you reach the "%f" part of printf do the fpu things there. kernel never goes there.

21:59 <clever> that reminds me, LK's printf has a global float support flag

21:59 <mrvn> griddle: as an example, yes

21:59 <griddle> Hmm

21:59 <clever> GLOBAL_DEFINES += WITH_NO_FP=1

21:59 <clever> if i disable printf fpu support, then %f just prints a literal %f

21:59 <clever> and i think it sanely skips that entry in the va_args

22:00 <clever> but the gcc still does all of the float math

22:00 <griddle> for shared code I have a macro `_KERNEL` that I just check for. Still have to compile shared code twice :^)

22:00 <mrvn> clever: va_args does FPU stuff when you get a double from VA at any point in the function.

22:00 <mrvn> clever: if that is #ifdef-ed out then no fpu stuff

22:00 <clever> *looks*

22:00 <geist> something ike that

22:01 <mrvn> i.e. the %f case gets repalced by print a literal %f

22:01 <clever> #if WITH_NO_FP

22:01 <clever> #define FLOAT_PRINTF 0

22:01 <geist> it calls into an inner functon that generates the double string, but it ifdefs out the call to it, etc

22:01 <griddle> I mean, printf could print floats w/o float hardware right

22:01 <geist> that's still an inssue in a mixed float/no float build. haven't decided what to do about that

22:01 <geist> griddle: that's the *real* answer. and it's of course a total bitch

22:01 <clever> s = double_to_string(num_buffer, sizeof(num_buffer), d, flags);

22:01 nyah has quit [Ping timeout: 240 seconds]

22:02 <clever> geist: yeah, this function is skipped

22:02 <mrvn> The problem is the argument parssing though. You could easily print float/double in the hex format if you can get at the value.

22:02 <geist> but it still has to deal with the calling convention

22:02 <clever> double d = va_arg(ap, double);

22:02 <clever> as is this one

22:02 <griddle> I think the real answer ought to be that the fpu registers should be read only in the kernel

22:02 <griddle> imo

22:02 <clever> so i think its not eating the va_arg things, and everything desyncs?

22:02 <geist> which may involve passing floats via floating point registers, and then even if you do all the work outside of the fpu you still have to deal with varargs and marshalling float args (or not)

22:02 <griddle> allowing the kernel to use xmmN or whatever means you have to include the old state in your trapframe

22:03 <mrvn> clever: if you have va_arg(ap, double) in the function then (on x86) it checks ax for the float bit and saves fpu regs to the buffer. Different things happen on ARM but it blows up at the function entry.

22:03 <clever> mrvn: in the past, i did have linux userland printf blowing up at function entry, because the FPU was disabled when linux started

22:03 <mrvn> SO it doesn't matter if the format string actually has "%f" in it. It always saves fpu regs and fails

22:04 <mrvn> clever: exactly.

22:04 <clever> mrvn: LK's lazy FPU context switching, had left the FPU disabled, when i exec()'d linux, and linux then just assumed the FPU doesnt work

22:04 <clever> and then userland tried to use the FPU, and SIGILL!

22:04 <heat> what?

22:04 <griddle> yeah the var args calling convention requires saving all registers into memory in the order of register usage in the base calling convention

22:05 <clever> heat: LK disables the FPU when context switching, and leaves the FPU regs in a random state, from whatever thread last used it

22:05 <mrvn> Isn't there some hardware mode that allows reading FPU regs?

22:05 <clever> heat: upon an FPU exception, it then does the context switch for the FPU state, and retries that op

22:05 <griddle> lazy fpu?

22:05 <heat> how is disabled = not work ?

22:05 <clever> heat: linux assumes that if its disabled upon entry, its disabled for a reason

22:05 <mrvn> griddle: that fauls and then turns the FPU fully on

22:05 <clever> and just leaves it disabled

22:06 <mrvn> clever: not on ARM

22:06 <heat> that's so cursed

22:06 <clever> mrvn: yes, ive run into this exact problem on arm32

22:06 <mrvn> clever: but the FPU is off after the bootloader

22:06 <clever> mrvn: there are 2 seperate flags, an on/off, and a trap/dont-trap, i believe

22:07 * clever gets link

22:07 <mrvn> clever: might also differ pre and post neon

22:07 <clever> https://github.com/littlekernel/lk/blob/master/arch/arm/arm/arch.c#L376-L377

22:07 <bslsk05> github.com: lk/arch.c at master · littlekernel/lk · GitHub

22:07 <clever> mrvn: this line of code must be called before you chainload linux, or the fpu just never works

22:08 <clever> https://github.com/littlekernel/lk/blob/master/arch/arm/arm/fpu.c#L38

22:08 <bslsk05> github.com: lk/fpu.c at master · littlekernel/lk · GitHub

22:08 <clever> write_fpexc(enable ? (1<<30) : 0);

22:08 <clever> __asm__ volatile("mcr p10, 7, %0, c8, c0, 0" :: "r" (val));

22:08 <geist> gosh that was such a long time ago

22:09 <clever> > Enable bit. A global enable for the Advanced SIMD and Floating-point Extensions:

22:09 <heat> the good old days where fpu code wasn't written

22:10 <heat> them youngins now have their fpus all configured for them

22:12 <clever> https://github.com/librerpi/rpi-open-firmware/blob/master/arm_chainloader/start.s#L121-L130

22:12 <bslsk05> github.com: rpi-open-firmware/start.s at master · librerpi/rpi-open-firmware · GitHub

22:12 <clever> mrvn: that is seperate from this fpu enable coe, which is touching NSACR and CPACR

22:12 <clever> and also fpexc

22:14 <heat> mrc p15, 0, r0, c1, c1, 0

22:14 <heat> you sure ARM isn't CISC? lol

22:14 <clever> ok, so first the NSACR bit, is enabling co-processors 10, 11, an 12, from non-secure state

22:14 <clever> heat: yes, i also hate that co-processor syntax

22:14 <clever> i think aarch64 still has it at the binary level, but the register names are known by the assembler

22:14 <clever> so its hidden from you

22:15 <j`ey> clever: you can still use that form

22:15 <clever> j`ey: but if you disassemble again, which form does it decode as?

22:15 <j`ey> clever: linux kernel does, because older binutils dont know about newer registers

22:15 <gog> the nascar bit enables left-turn-only mode

22:16 <j`ey> yeah it'll disassemble to a known name

22:16 <clever> j`ey: ah, as i expected, so the names are just pretty aliases, to hide the co-processor mess

22:16 <clever> and if objdump knows of the reg, it gives a pretty name

22:16 <clever> and at the binary level, its still the same mess

22:17 <clever> next, my code writes to CPACR, Coprocessor Access Control Register

22:17 <clever> 4 bits are set, starting at bit 20, which is something to do with cp10 and cp11

22:18 <clever> > Full access. The meaning of full access is defined by the appropriate coprocessor.

22:18 <geist> fpu on arm32 is such a nightmare too

22:18 <geist> i'm glad i forgot most of this

22:18 <geist> worse is you have to actually parse the invalid instruction exception to determine if it's a fpu or not

22:19 <clever> yeah, it looks like there is 3 seperate enable flags, and LK is turning just one of them on/off when it context switches

22:19 <clever> and it will fault out when a thread first touches the FPU

22:19 <clever> and if any of those are disabled upon entering linux, it just leaves it disabled

22:19 <geist> okay, i guess it's not *that* bad: https://github.com/littlekernel/lk/blob/master/arch/arm/arm/faults.c#L119

22:19 <bslsk05> github.com: lk/faults.c at master · littlekernel/lk · GitHub

22:20 <clever> id still say thats ugly, that its trapped via the undefined opcode exception, just because an enable flag was turned off

22:20 <mrvn> geist: supporting soft float and hard float is worse.

22:20 <geist> yep. arm64 of course cleans all of this up and you get a nicely broken out exception syndrome

22:21 <clever> my rough understanding, is that hardfloat can pass floats via FPU regs

22:21 <mrvn> fpu isn't optional on arm64, right?

22:21 <clever> while softfloat can pass floats via the stack only

22:21 <clever> but, softfloat can still use the hw fpu, if you choose to, on a per-.o basis

22:21 <mrvn> clever: hard to pass args in float registers if you don't have an FPU

22:21 <geist> mrvn: it is, but not in practice. it's possible to build an extremely low end armv8 core with fpu optional, but i've never seen one in practice

22:21 <mrvn> geist: armv6

22:21 <clever> mrvn: thats the special bit, you can still use the FPU with soft-float!

22:22 <geist> mrvn: yes?

22:22 <mrvn> clever: sure, if you have one you can.

22:22 <clever> (i believe)

22:22 <mrvn> geist: ARMv6 frequently has no fpu

22:22 <geist> sure, but you explicitly asked about arm64

22:22 <clever> mrvn: so you could compile one .o file with the fpu enabled, and another with the fpu disabled, and then have a runtime if() to call the right variant of the functions

22:22 <mrvn> geist: oh, sorry

22:22 <geist> and arm64 == armv8. and thus my answer re: armv8

22:22 <clever> mrvn: and the rest of the codebase assumes no fpu, and uses floats on the stack

22:23 <mrvn> clever: that's what I did on the RPI1

22:23 <clever> hardfloat instead puts args in the fpu regs, so the ABI is different, and every function has to agree on the new ABI

22:23 <mrvn> RPi2 I think had neon FPU

22:23 <clever> pi1 also has an FPU

22:23 <clever> its just smaller, half the number of regs

22:23 <mrvn> clever: but an older one

22:23 <clever> yeah

22:23 <clever> and most people building for armv6 disabled FPU usage

22:23 <mrvn> neon was a big step forward in speed

22:23 <clever> because it was rare for v6 to have an FPU

22:24 <clever> thats what made the pi1 weird, and needing a special build

22:24 <mrvn> Raspian used soft-float

22:24 <mrvn> Debian ARM used neon

22:24 <mrvn> So even though you had an FPU it didn't have enough registers for the Debian ARM ABI.

22:25 <clever> yeah

22:25 <clever> i think armv7 had twice the fpu regs

22:25 <mrvn> I need one of those black-light fly buzzers.

22:26 <geist> there were a ton of variants of vfp, with varying levels of floating point regs, it was a real nightmare

22:26 <geist> by the later ends of v6 and v7 the defacto standard fpu was vfp3, and the defacto subset you could compile for was called vfpv3-d16

22:26 <geist> which explicitly limited itself to 16 double precision floating point regs. code compiled for that would work on all vfpv3s, even if they had say -d32 implemented

22:26 <clever> this is one area, where i think the VPU is far more sane, compare to arm/x86, there are no dedicated float registers!

22:27 <clever> any 32bit register can be either an int32 or a float32

22:27 <clever> the type, just depends on which opcode you use to interact with it

22:27 <geist> the calling convention explicitly threw out d16-d31 on calls, so code that didn't know it existed was okay, etc

22:27 <clever> load/store doesnt care about the type

22:27 <mrvn> clever: something they learned from SIMD

22:28 <geist> yep. i remember the SPU processors on a cell processor were like that too

22:28 <clever> one thing i still dont fully understand, where is the line drawn between vector and float stuff in arm?

22:28 <geist> simply 128 128-bit vector registers, integer or float. and there were scalar versions of all the instructions

22:28 <clever> VFP implies both vector and float?

22:29 <geist> yes. there qas some previous floating point standard thing on arm called FPE i think, but it was the early days

22:29 <geist> VFP was an early attempt at vector bits on ARM. however, it wasn't vector in the simd sense

22:29 <clever> so is VFP both vector and float?

22:29 <geist> iirc it was vector in the 'repeat this N times' sense

22:30 <geist> i had rarely seen it used, not sure most compilers knew how to do it except maybe arms, but i'm sure one could write some asm to make use of it

22:30 <clever> ive looked at some arm64 vector code in gnuradio, and it was basically just support for loading a float[4] i think

22:30 <geist> yah arm64 is basically an extension of NEON, and NEON in the armv7 days extended and largely replaced VFPv3

22:30 <clever> so you could then do `float a[4], b[4], c[4]; for (int i=0; i<4; i++) { a[i] = OP(b[i], c[i]); }`

22:31 <clever> but there was so few vector registers, that its very load/store heavy

22:31 <mrvn> clever: even better if you mark them as vectors and then do a = b + c;

22:31 <mrvn> pre vectorizer solution

22:31 <geist> that's why i was saying the vfpv3-d16 was the common subset of all of them arm32 modern stuff. that calling convention and whatnot was the minimum standard hard fpu, so it was fairly common to use it as the baseline

22:31 <clever> mrvn: they where all wrapped up in intrinsics

22:32 <geist> and then individual code could be aware of additional registers and/or instructions

22:32 <clever> i just wrote it as scalar, to make it more understandable

22:32 <clever> let me find the source...

22:32 <geist> and d0-d15 were the only registers used to pass args and be saved

22:32 <geist> vfpv3-d32 was extended by NEON, which i think added even more registers

22:33 <geist> the v0-v31 regs, iirc, or maybe v0-v15? i forget

22:33 <moon-child> 32 64-bit regs

22:33 <moon-child> and they get paired up into 16 128-bit regs

22:34 <geist> there is still a mess of subset of fpu contexts in the cortex-m world, but in that universe it's much more common to compile the whole thing for a given subset

22:34 <clever> mrvn: https://github.com/gnuradio/volk/blob/main/kernels/volk/volk_32f_x2_dot_prod_32f.h#L892-L929

22:34 <bslsk05> github.com: volk/volk_32f_x2_dot_prod_32f.h at main · gnuradio/volk · GitHub

22:34 <clever> float32x4x4_t is just a struct, that holds a float[8] i think

22:34 <geist> moon-child: yeah arm32 floating point regs had a bad pairing thing that was confusing as heck. a thing they fixed in arm64, where they are a strict subset

22:34 <geist> ie, s0 = d0 = q0

22:34 <clever> or was it float[16]

22:34 <clever> i think 16

22:34 <geist> vs [s0,s1] = d0, [s2,s3] = d1 like you got with arm32

22:35 <clever> but at the hardware level, its only float[4]

22:35 scaleww has joined #osdev

22:35 <clever> and the float[16] is purely to give you a bit of pipelining, for when you dont use too many regs

22:35 <clever> so you can create a virtual vector that is bigger

22:36 <clever> the other thing i notice here, is the lack of a vectorized sum

22:36 <clever> so this winds up creating 4 sums, from each lane

22:36 <clever> and then has to finish the job in scalar mode

22:37 <mrvn> clever: ARM64 has that

22:37 <clever> SVE right?

22:37 <clever> i think thats hw support for virtual vectors of a larger size

22:38 <clever> while this code is purely software level support

22:38 <geist> yeah SVE is a new extensino to NEON/ASIMD that lets you define up to i think 2048 byte vectors

22:38 <geist> and then somewhat dynamically declare how wide you want to do your work in

22:38 <clever> and in theory, future chips can run the same job in fewer clocks

22:38 <clever> without having to rewrite your asm every time

22:38 <geist> that's exactly right

22:39 <clever> the volk code for example, is only using 1/4th of the vector registers, because it only has float[4] at the hw level

22:39 <geist> i think 256 is the minimum, and probably what most will implement, but iirc the fujitsu cpu for some supercomputer is using 768 byte hardware vectors or something

22:39 <clever> and its trying to let the pipeline do its job, by operating on a float[4][4]

22:39 <clever> but its just doing 4 vector loads, 4 vector mults, and then 4 vector stores

22:40 <clever> so when the vector core doubles in size next year, you have to rewrite this to work on chunks of float[4][4][2]

22:40 <mrvn> changing vector size was costly though I think.

22:41 <clever> as an example:

22:41 <clever> float a[n], b[n]; float sum = 0; for (int i=0; i<n; i++) { sum += a[i] * b[i]; }

22:42 <mrvn> clever: it reminds me of "rep" on x86

22:42 <clever> mrvn: for this code, the vector width doesnt really matter, just do as many mults in parallel as you can, sum them all up, and your done

22:43 <clever> and the best the above volk code can do, is load 16 a's, load 16 b's, load 16 accumulators, then do a fused mult+add, acc += a*b;

22:43 <clever> then store 16 accumulators, and repeat

22:43 <mrvn> clever: with the virtual vectors can you keep anything in registers between operations? Like a+b+c+d+e+f would store 4 temporaries to ram

22:44 <clever> ive not read the SVE specs yet

22:45 <mrvn> I don't want each "+" to go over the whole vector but do the 5 "+" on one register load and then repeat for the vector size.

22:45 <clever> VPU vectors, are instead always 16 lanes wide, and has a REP flag, that can repeat it a power of 2 times between 1 and 64, or use a scalar reg (no power of 2 limit)

22:45 <mrvn> clever: yeah, but that needs to store the temporaries

22:46 <clever> for the VPU, it doesnt, the vector registers can hold the entire int16_t[1024] at once

22:46 <clever> 2 of them infact

22:46 <clever> its got a whole 4096 bytes of vector registers

22:47 <mrvn> clever: that allows doing a simple accumulate like my example. 2 rges are rather limited for more complex cases though.

22:47 <clever> and you can do the entire load of the int16_t[1024] in a single opcode, which can axi burst properly and saturate the dram bus

22:48 <mrvn> clever: I think the virtual vector size extension allows you to implement a loop over the vector size, advancing by the actual vector size of the hardware each loop. So this years cpu adds 16 each loop, next years cpu adds 32 for the same opcodes.

22:49 <clever> yeah, i could see how that might work

22:49 <clever> just query how big the vector is, and allow the reg# to come from a scalar

22:49 <clever> so you can index into register[n]

22:49 <mrvn> and load/store in the native register size, whatever that is

22:50 <clever> that is also possible on the VPU, you can use immediate+register as a coord into the vector register bank

22:52 <clever> mrvn: if true, then SVE is less of a virtual vector size, and more of a way to query how big the VFP really is, and iterate over the registers based on a scalar reg, so you can use more regs

22:52 <clever> so instead of hard-coding it to do something 4x, you instead have a loop within a loop

22:53 <mrvn> clever: yeah, more of a self adjusting loop size

22:53 <clever> yep

22:53 <mrvn> but that's what you want.

22:53 <clever> volk is doing similar

22:53 <clever> its creating a virtual 16 lane vector, by running the 4lane vector 4x

22:53 <mrvn> You have 8 regs and the size is variable

22:54 <clever> and then it dynamically changes how many loops of the 16lane vector it runs

22:54 <clever> and then does a scalar loop at the end, to deal with the leftover

22:54 <mrvn> They should have done that with AVX and AVX512 would run the same code twice as fast

22:55 <clever> part of what you need there though, is for opcodes to be far more async

22:55 <mrvn> how do you mean?

22:55 <clever> so you can issue 4 vector loads back2back, and the cpu wont stall out on just the first one

22:55 <clever> while the 1st is doing a fetch, the 2nd, 3rd, and 4th should add to the fetch queue

22:55 <clever> to keep the bus saturated

22:56 <mrvn> clever: the cpus loop unrolling (branch predictor) does that

22:56 <clever> yeah

22:56 <mrvn> does anyone prefetch anymore?

22:58 <heat> yes

22:58 <heat> micro-optimized string.h code for instance

22:58 <clever> i saw a blog post before, about how a memcpy loop, had a prefetch opcode in it

22:58 <mrvn> I remember on alpha they had extra opcodes just for saying "I'm going to load <address> in the next loop"

22:58 <clever> so as its copying some bytes, the cpu can be pre-fetching future bytes

22:59 <clever> yeah, it was that kind of thing

22:59 justmatt has quit []

22:59 <clever> a purely async opcode, that produces no results, but primes the d-cache

22:59 <clever> however, there was a fatal hw bug, if there was a TLB miss, the prefetch just gives up

22:59 <clever> and the fetch happens later, at dcache miss, which stalls the cpu

22:59 <mrvn> is that still a think with AVX or just for regular register sizes data?

23:00 <clever> the blog post i read, was about one of the xbox models

23:00 <clever> and they "fixed" the problem by just using larger pages

23:00 <mrvn> ppc?

23:00 <clever> so a TLB entry covers more bytes

23:01 <mrvn> clever: it always sucks when you code suddenly drops to half speed because your data lands on a page boundary.

23:02 <clever> yep

23:02 <clever> and they fixed it by just having fewer page boundarys

23:03 <clever> mrvn: https://randomascii.wordpress.com/2022/01/12/5-5-mm-in-1-25-nanoseconds/

23:03 <bslsk05> randomascii.wordpress.com: 11 mm in 1.25 nanoseconds | Random ASCII – tech blog of Bruce Dawson

23:03 <geist> i dont see the prefetch stuff as much on modern arm64 stuff, but i think it's assumed that the more modern cores are better at prefetching on their own anyway

23:04 <griddle> Is it best to read other kernels to learn how arm64 works? The docs aren't fantastic for what I can tell

23:04 <geist> dumber arm32 bits from 10-15 years ago i remember it being quite essential to get good looping performance for memcpies and wehatnot

23:04 <clever> geist: what happens if you do say a load opcode, but then never use the resulting register?

23:04 <mrvn> geist: they just fill the pipeline with the next loops fetches before the first loop finishes. Add to that the fetch predictor...

23:04 <geist> griddle: perhaps, but it's complicated enough that i doublt you're learn from just reading code

23:04 <griddle> lots of the docs kinda feel like I should already know how it all works

23:04 <clever> geist: could the cpu core keep on going without stalling for that cache miss?

23:04 <heat> griddle, arm64 docs are garbage but mindlessly reading code won't get you anywhere

23:04 <clever> and that is effectively a prefetch opcode

23:04 <heat> well, not garbage

23:04 <geist> clever: probably

23:04 <heat> but written for hardware engineers maybe

23:05 <griddle> well I'd be implementing the port on my end as well

23:05 <geist> sure, it's a combination of everything. also we can help if you have questions

23:05 <mrvn> clever: if you avoid a register register dependency, or load into the zero register

23:05 <geist> though we'll generally refer you to the manual for lots of things after pointing you in the right direction

23:05 <griddle> Appreciate it

23:05 <griddle> Yeah I'll hack at it for a bit

23:06 <mrvn> geist: is load into zero register allowed?

23:06 <geist> on arm64?

23:06 <mrvn> yep

23:06 <geist> depends on if the xzr encoding is reused for something else

23:06 <griddle> the kernel already builds for arm64, but doesn't work :) I set that up to test if my kernel was "portable" after also abstracting for a risc-v port

23:06 <mrvn> seems like the perfect place to hide a pre-fetch. Load and throw away.

23:06 <geist> in some instructions it is, acts as the standing for the PC or SP register. possible here the xzr register encoding refers to SP

23:07 <geist> i do vaguely remember some discussion on the riscv irc channel or forum or whatnot as to whether or not a cpu is allowed to elide a load into the x0 register there. i forget the answer

23:08 <griddle> rv doesn't have side effect regs, so I figure it's fine to do that

23:08 <geist> but i suspect the answer on arm is even if it was allowed, it probably wouldn't encode as such

23:08 <mrvn> geist: pre-fetch could be totally optional.

23:08 <geist> since loads *do* have side effects, in the sense that it can page fault, etc

23:09 <mrvn> geist: yeah. If you have speculative execution you would pre-fetch to probably use the same path as speculative load

23:09 <mrvn> (and add one more side channel vector)

23:10 <geist> re: prefetching there are actually instructions for that with a fair amount of flexibility

23:10 <geist> so you'd almost certainly just use those

23:10 <clever> mrvn: and thats the cause of another bug from the randomascii blog

23:10 <clever> mrvn: there was a special opcode, that would prefetch in a non-coherent way, disabling the write-back, so it wouldnt steal the cacheline from another core

23:11 <clever> and speculative execution would still do the non-coherent load, and poison the resulting cache-line

23:11 <clever> and then the whole cache-line is just lost upon eviction

23:11 <mrvn> ... and what if it's a speculative pre-fetch?

23:11 <clever> so, despite that being gated behind an if statement, it still had an effect

23:11 griddle_ has joined #osdev

23:12 <clever> mrvn: yeah, a speculative execution of a non-coherent prefetch, resulted in a non-coherent cache line

23:12 <clever> so basically, it was: bool noncoherent=false; if (noncoherent) non_coherent_prefetch(foo);

23:12 <clever> and despite the fact that it was never true, it was still non-coherent!

23:12 griddle has quit [Ping timeout: 268 seconds]

23:13 griddle_ has left #osdev [#osdev]

23:13 griddle_ has joined #osdev

23:13 griddle_ has left #osdev [#osdev]

23:13 <clever> the nasty part, is that this was corrupting the malloc state, causing it to assert() out

23:13 <clever> but the core dump then read ram, and claimed the assert couldnt have possibly fired

23:13 griddle has joined #osdev

23:13 griddle has left #osdev [#osdev]

23:13 griddle has joined #osdev

23:14 <geist> anyway FWIW i just checked and indeed you can load into xzr. there's no special encoding on the target/source register. there *is* special encoding on the base regster, the 'xzr' encoding (31) is interpreted as the SP instead of a xN register

23:14 <clever> mrvn: so the coredump says the assert cant have fired, but the stack trace says it did!!

23:16 <mrvn> clever: much more fun is the MIPS cpu reseting on a certain opcode sequence and binutils emiting that sequence with an minor update for common code.

23:16 <clever> mrvn: ouch

23:22 lkurusa has quit [Quit: I probably fell asleep (or went out). Who will ever know.]

23:32 scaleww has quit [Quit: Leaving]