Bitweasil has quit [Remote host closed the connection]
Bitweasil has joined #osdev
flx-- has quit [Remote host closed the connection]
flx-- has joined #osdev
dude12312414 has joined #osdev
Piraty has quit [Ping timeout: 240 seconds]
dude12312414 has quit [Remote host closed the connection]
Piraty has joined #osdev
<kazinsal>
forgive me #osdev for I have sinned
<kazinsal>
I wrote python for work today
<papaya>
ngl I don't see a sin in that
<kazinsal>
I wasn't told to do something in Python. I just thought, maybe it'd be easiest to write a script in Python instead of doing all of this by hand
<heat>
i rate python
<eryjus>
nothing like a pissed off snake....
<klange>
kaz i wrote _a_ python for my os, ain't nothin evil about writing python!
<heat>
the only thing I like more than writing C/C++ is not writing C/C++ because I can do it in python
<GreaseMonkey>
i work on embedded software, and i never, ever call it "C/C++"
<GreaseMonkey>
they're two different languages, i did end up liking C++ temporarily but hated it again once i realised i had a mess i revisited that i couldn't clean up without a lot of drudgery
<gog>
i'm sinning right now and wrote a fake plugin for a unit test because i thought it would satisfy the gitlab ci service
<gog>
it did not
<gog>
WONTFIX:WORKSFORME
<heat>
GreaseMonkey, they're two very similar languages in a lot of aspects
<GreaseMonkey>
in practice, however, they're quite different
<papaya>
I work mostly with Java/Spring on web applications
kkd has quit [Remote host closed the connection]
<heat>
if you go into template hell? sure, super different
<papaya>
I agree C and C++ have diverged too much to call it "C/C++"
<GreaseMonkey>
C also doesn't use the C++ smart pointers thing
<GreaseMonkey>
and quite frankly for personal projects i prefer to use Zig these days
<moon-child>
C/C++ is undefined behaviour
<heat>
neither does C++?
<moon-child>
because / is not a sequence point
<papaya>
C++ is a mess I like C more
<moon-child>
and C++ mutates C
<GreaseMonkey>
C++11 has the shared_ptr thing or whatever's current
<heat>
C code is a mess, I like C++ more
<moon-child>
ergo, if you write C/C++, I get to put demons in your nose
<heat>
raii is highly superior to goto hell
<kazinsal>
I am a mess, therefore I enjoy C
<gog>
i'm sharpening my C++ teeth again
<gog>
enjoying it
* papaya
hands gog a tooth sharpener.
eddof13 has joined #osdev
<gog>
^w^
<no-n>
uwu
<heat>
weird that your teeth are C++
<heat>
mine are based and rustpilled
<gog>
cat::teeth
<klange>
oops i had a nested log issue in kuroko and it didn't show up on linux because glibc's pthread locks support nesting...
<heat>
don't you need to enable them explicitly?
<heat>
oh, TIL EDEADLK
<heat>
I was thinking of PTHREAD_MUTEX_RECURSIVE
<klange>
not mutexes
Mutabah has quit [Ping timeout: 246 seconds]
Mutabah has joined #osdev
<heat>
hmm, EDEADLK in glibc/musl is only returned for PTHREAD_MUTEX_ERRORCHECK mutexes
nyah has quit [Remote host closed the connection]
<klange>
this whole thing is probably broken anyway, need to fix a lot of thread support stuff in kuroko that plays it too fast and loose
<klange>
also need to write better reader-writer locks for toaru anyway
eddof13 has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
<heat>
reject fine grained locking
<heat>
embrace the GIL like a good python interpreter
<klange>
I refuse to take the the (GI)L.
<gog>
welp
<gog>
i got the unit test to work
<gog>
on the CI server
<gog>
all i needed to do was put the plugins in a build directory where the test binary could find them
tomaw_ has joined #osdev
bauen1_ has joined #osdev
bauen1 has quit [Ping timeout: 240 seconds]
Dreg has quit [Quit: Dreg]
smeso has quit [Ping timeout: 272 seconds]
ckie has quit [Ping timeout: 272 seconds]
ckie has joined #osdev
Dreg has joined #osdev
smeso has joined #osdev
tomaw has quit [Ping timeout: 604 seconds]
tomaw_ is now known as tomaw
gog has quit [Quit: byee]
<moon-child>
who the fuck is outside scraeming global interpreter lock. show yourself coward. i will never lock the interpreter
<heat>
GLOBAL INTERPRETER LOCK
<klange>
IF PYTHON BANS ME FOR NOT LOCKING THE INTERPRETER I WILL FACE GUIDO AND WALK BACKWARDS INTO HELL.
Arthuria has quit [Ping timeout: 260 seconds]
<kazinsal>
what guido doesn't want you to know is that p-langs work on highlander rules so if you successfully overtake python in popularity, even for a moment, you steal his powers
<gorgonical>
and from that point on only you have the power to write PEPs
<klange>
we call them KEKs - Kuroko Enhancement Krokosals
<heat>
>kek
* heat
keks in uefi
<klange>
oh right let's rebuild that
<klange>
I should do module builds for that...
<heat>
is KEK-001 an interface for manipulating the KEK?
<kazinsal>
proposal for how to submit proposals
<heat>
no
<heat>
the key exchange key
<heat>
the best key ever
<heat>
the keyest of keys
<kazinsal>
(ten minutes later) we regret to inform you that the key exchange key has been compromised
<heat>
top kek
srjek has quit [Ping timeout: 240 seconds]
<Clockface>
are linux signals implemented as an interrupt?
<Clockface>
well, likely not now that i think about it?
<Mutabah>
yes and no...
<Clockface>
does it jump somewhere else next-timeslice?
<Mutabah>
Some signals are triggered by an interrupt (e.g. timer based ones)
<Clockface>
that makes sense, i guess its indirectly based on an interrupt since everything shares a hardware interrupt
<Clockface>
well, sometimes its a hardware interrupt
<Mutabah>
I've not looked at how linux does it, but I'd implement signals by having a check in the syscall stub that checks if a signal is pending, and returns to a different userland location if it is
<Clockface>
im not implementing them yet, i just have been using them recently and started thinking how linux does it
<heat>
you check for pending signals at specific moments
<klange>
While signals and interrupts are conceptually similar, they have nothing to do with each other in implementation.
<heat>
inside/around interruptible waits, in irq entries and exits, in all other kernel entry points and exits, etc
<heat>
you can theoretically IPI another processor to reduce signal latency but I don't think anyone does that
<klange>
Mutabah: Yeah, that's basically how Linux does it, though any return to userspace can trigger a signal, not just syscall returns - means you can signal something that isn't making syscalls, eg that broken thing doing an infinite loop.
<heat>
the only way a signal triggers an interrupt is if it wakes an interruptible thread that has a higher prio than the current thread (so it wants to schedule it out)
<heat>
the canonical way to do that scheduling out is through an IPI
<Mutabah>
klange: Yeah, I was too lazy to add in "and interrupt handlers"
<klange>
I just redid my signal stuff recently.
terrorjack has joined #osdev
<heat>
the only big thing I'm missing is restartable syscalls
<heat>
re: restartable, something cool I wanted to add: restartable sequences
<klange>
I've got those for a couple of cases. Haven't done the fancy time-based stuff Linux supports.
knusbaum has quit [Ping timeout: 260 seconds]
knusbaum has joined #osdev
skipwich_ has quit [Quit: DISCONNECT]
<heat>
i should revisit my scheduler
skipwich has joined #osdev
<heat>
it's just a round robin with priorities
kkd has joined #osdev
<klange>
I don't even have priorities.
<klange>
Granted, I don't really want to add them or improvement as at this point it exists as a demonstration of "round robin with no priorities is actually good enough".
<klange>
improve it*
<klange>
It's my response to multiple semesters of operating systems / systems design courses spending way too much time on scheduling back in my uni days.
<klange>
The _one_ improvement I would consider is something with core affinity.
<heat>
i was thinking of something like a multilevel feedback queue
<heat>
it would be a simple-ish thing
<heat>
and better than what I have
<heat>
i could also look into fuchsia's fair scheduler
<klange>
oh there is one other thing I should probably fix which is that my time slicing for preemption is totally yolo
<klange>
on x86 at least, it's not "you get 'n' units of time before you're getting scheduled out", it's "every 'n' units of time, something gets scheduled out" because it's periodic timers that aren't getting reset
<heat>
hm?
<heat>
my scheduler's heartbeat just decrements the current quantum
<heat>
on a switch it's reset to 10ms
<klange>
that sounds better than what I do
<klange>
I don't even remember off-hand what my periodic timers are set to
smeso has quit [Quit: smeso]
smeso has joined #osdev
<kazinsal>
I think my PIT is set to 60 Hz despite my OS not having any concept of graphics output...
<geist>
yah i do something pretty similar to heat in LK. works pretty well
<geist>
depending on the platform it may only have periodic timers or dynamic ones, but it's the same result
<heat>
all my timers have periodic support but its never used anywhere
<heat>
mostly because all of the timers I've found have oneshot
<geist>
yah i've used periodic from time to time on various platforms. generally very low end ones where the cost ot recomputing the next event is generally more expensive than just a periodic timer
<geist>
but it's a build option, so easy eough to switch back and forth
freakazoid12345 has quit [Ping timeout: 256 seconds]
heat has quit [Ping timeout: 252 seconds]
Oli_ has quit [Quit: leaving]
Oli has quit [Quit: leaving]
no-n has left #osdev [#osdev]
Likorn has joined #osdev
sonny has joined #osdev
Arthuria has joined #osdev
skipwich has quit [Ping timeout: 260 seconds]
skipwich has joined #osdev
Likorn has quit [Quit: WeeChat 3.4]
kkd has quit [Remote host closed the connection]
Arthuria has quit [Killed (NickServ (GHOST command used by guest2795))]
Arthuria has joined #osdev
Arthuria has quit [Ping timeout: 246 seconds]
sonny has quit [Ping timeout: 250 seconds]
nur has joined #osdev
the_lanetly_052_ has joined #osdev
<mrvn>
klange: I found that round-robin improves a lot when you add move-to-front. Like when some task gets woken up that's been sleeping longer than one round it gets put to the front.
<mrvn>
geist: Not sure I ever had a case where "computing" the time till the next event is costly. That's just looking at the top of the heap. But reprogramming a timer to fire at the right time might be costly or, like the PIT, plain impossible.
<mrvn>
Well, strike that, just thought of a case. :) For the network code I have a timeout algorithms with bins: 2 ticks, 4 ticks, 8 ticks, 16 ticks. Every tick one of them is processed fully and each item put into the bin for the remaining time. No way to now when the next event happens there except it's longer than a tick.
<mrvn>
It's kind of a bucket sort amortized over time.
<mrvn>
The idea is that most timers will be removed before they even get processed once, >99% get removed before they expire. So add/remove is O(1) and expiring is O(log duration).
<mrvn>
Linux has pretty much the same.
bauen1_ is now known as bauen1
gog has joined #osdev
Likorn has joined #osdev
Likorn has quit [Client Quit]
no-n has joined #osdev
C-Man has quit [Ping timeout: 260 seconds]
wand has quit [Ping timeout: 240 seconds]
GeDaMo has joined #osdev
Vercas has quit [Quit: buh bye]
Vercas has joined #osdev
gog has quit [Quit: byee]
bauen1 has quit [Ping timeout: 252 seconds]
the_lanetly_052_ has quit [Ping timeout: 245 seconds]
dennis95 has joined #osdev
Vercas has quit [Remote host closed the connection]
Vercas has joined #osdev
wereii has quit [Ping timeout: 240 seconds]
Burgundy has joined #osdev
C-Man has joined #osdev
Payam19 has joined #osdev
wereii has joined #osdev
srjek has joined #osdev
Payam19 has quit [Quit: Client closed]
Likorn has joined #osdev
heat has joined #osdev
atrapa has joined #osdev
nomagno has joined #osdev
X-Scale` has joined #osdev
X-Scale has quit [Ping timeout: 256 seconds]
X-Scale` is now known as X-Scale
heat has quit [Ping timeout: 252 seconds]
pretty_dumm_guy has joined #osdev
MiningMarsh has quit [Ping timeout: 240 seconds]
MiningMarsh has joined #osdev
gildasio has quit [Remote host closed the connection]
gildasio has joined #osdev
Jari-- has quit [Ping timeout: 250 seconds]
Likorn has quit [Quit: WeeChat 3.4]
freakazoid343 has joined #osdev
wand has joined #osdev
amazigh has quit [Quit: WeeChat 2.8]
amazigh has joined #osdev
atrapa has quit [Quit: atrapa]
Oli has joined #osdev
vancz_ has quit [Quit: vancz_]
pie__ has quit [Quit: pie__]
Arthuria has joined #osdev
vancz has joined #osdev
pie_ has joined #osdev
vancz has quit [Quit: vancz]
pie_ has quit [Quit: pie_]
pie_ has joined #osdev
vancz has joined #osdev
mahmutov has joined #osdev
bauen1 has joined #osdev
Likorn has joined #osdev
gdd has quit [Ping timeout: 272 seconds]
elastic_dog has quit [Ping timeout: 260 seconds]
gdd has joined #osdev
elastic_dog has joined #osdev
gog has joined #osdev
heat has joined #osdev
Jari-- has joined #osdev
Bonstra has quit [Ping timeout: 252 seconds]
<catern>
are there any operating systems on any architectures that can do a process switch *while the process is stalled mid-instruction waiting on a load from main memory*?
<Bitweasil>
Nothing I'm aware of, it would be false savings.
<Bitweasil>
That's what SMT/Hyperthreading does, though.
<Bitweasil>
While one thread is stalled on a main memory load, it will use the execution units to go make progress on the other thread.
<Bitweasil>
(among other things it can do)
<Bitweasil>
If you're stalled on RAM, blowing your cache out with a kernel trip is the wrong thing to do.
gdd has quit [Ping timeout: 260 seconds]
<catern>
indeed indeed, I partially ask this from inspiration by hyperthreading
<Bitweasil>
A quick trip to the kernel is a few thousand cycles, more if you have to do things like schedule another task.
gdd has joined #osdev
<catern>
yes, probably any OS/arch which did this would need to have an extremely fast transition to the kernel and scheduler
<geist>
yah i think SMT is the closest to what you're asking for
<Bitweasil>
And it's almost certain that after some period of operation, the kernel entry/scheduler/etc will not be in L2 cache.
<catern>
you already want to schedule processes that mostly share their cache on pair hyperthreads though :) so this would just be an extension of that desire in the scheduler
<Bitweasil>
SO now you're right back to waiting on memory.
<catern>
geist: oh for sure, I just am curious if anyone ever, basically, did "SMT in software"
<Bitweasil>
Or you want to disable hyperthreads because of all the impossible-to-plug leaks between them. ;)
<Bitweasil>
I've stopped buying HT processors, I just disable them anyway.
<geist>
yah i dont think so, since there's not a good way for software to work between the instructions
<geist>
unless it's, sya, some sort of microcoded thing
<catern>
yeah you'd need an extremely fancy architecture
<heat>
is SMT done in microcode?
<Bitweasil>
And an i5 without HT is usually a lot cheaper than an i7 with, for ~the same single core, and disturbingly close to the same on multithreaded loads, unless you're heavily stalled.
<geist>
ie, the microcode decides that its going to block so it switches tasks
<Bitweasil>
It's the scheduler.
<Bitweasil>
You just have two threads (or more) feeding into a common set of execution units.
pie_ has quit [Quit: pie_]
vancz has quit [Quit: vancz]
<Bitweasil>
It's not really "switching between threads" so much as "Just running both threads at the same time, and if one stalls, the other will have more EUs to execute on for a while."
<geist>
but there are cpus that have existed where the user facing 'cpu' was just microcode running an emulator
<Bitweasil>
True, that... uh...
<Bitweasil>
Linus was involved in it.
<Bitweasil>
Laptop CPU from the early 2000s?
<Bitweasil>
Transmeta?
<geist>
and the microcode itself can context switch. the Xerox Alto is the usual early example
<geist>
and then things like transmeta or i think nvidia Denver
<Bitweasil>
I've *still* never figured out what Denver was/is.
<Bitweasil>
It sounded like "Bolt a couple beefy ARM cores to the GPU."
<Bitweasil>
But everyone seems to think it's something slightly different.
<geist>
same. i *think* some transmeta folks ended up there and convinced nvidia to have another go at it
<Bitweasil>
I don't think it's ARM emulated on GPU, though.
<geist>
the soc maybe, but the Denver core is its own thing
<geist>
but it's just code some microcode going on
<Bitweasil>
For a bunch of compute tasks that are GPU heavy, nVidia just needs to bolt some ARM cores and high bandwidth networking on the GPU, presto, standalone node.
<Bitweasil>
Netboot it.
<geist>
sure
<Bitweasil>
Oh, I didn't realize Denver was specifically a core.
<geist>
side note, they just announced their Grace ARM cpu
<geist>
goes with their new Hopper GPU arch
<geist>
cute!
<Bitweasil>
I saw something like that, hadn't ... OH!
<Bitweasil>
lol.
<Bitweasil>
Hadn't paid much attention.
<Bitweasil>
More interested in the new Rockchip SoC lately.
<geist>
oh? which one?
<Bitweasil>
RK3588. Should be good!
<geist>
oh neat
<Bitweasil>
Quad A76, quad A55, up to 32GB RAM, NVMe, couple GPU ports, and a HDMI in.
<catern>
geist: so interesting point about the microcode context switching. I guess it would be possible in theory for such a system to context switch on memory stall
<catern>
the question is whether any of them ever actually did...
<Bitweasil>
catern, again, SMT does that in practice, very well.
<Bitweasil>
Spending thousands of cycles going to the OS instead of "Oh, hey, I'll just run from this queue over here..." doesn't seem to make any sense.
<Bitweasil>
Memory stalls aren't *that* slow.
<Bitweasil>
And a kernel trip requires a ton of memory accesses that aren't likely to be cached.
<Bitweasil>
Unless you start locking OS stuff into cache.
<catern>
Bitweasil: well, alternative idea: maybe if there was a way to dynamically scale up SMT, haha - ask the core to give you another 10 register sets and start SMT-ing between them
<Bitweasil>
Now you've just eliminated a bunch of cache that could have prevented the stall in the first place.
<Bitweasil>
There are some SMT4 and I think a few SMT8 cores.
<Bitweasil>
At some point, the gains stop happening for almost all workloads, and, again, you're splitting cache.
<Bitweasil>
SMT2, generally, you've got about half the L2 per thread. SMT4, 1/4. So you hit diminishing returns in a hurry.
<Bitweasil>
Meanwhile, Apple bolted insane giant gobs of L1 on their cores instead, and seems to be doing a solid job with it.
<Bitweasil>
It's... 192kb L1I and 128kb L1D *per* core? Or maybe flip those.
<Bitweasil>
(on the M1)
<geist>
yah the SMT4 in the cavium i can say is clearly a diminishing returns
<Bitweasil>
Plus gigantic L2s, and a huge L3 system cache, on top of fantastic latency to DRAM.
<geist>
i can configure it to SMT2 in firmware and really it performs just as well
<geist>
but the novelty of SMT4 is too great
<Bitweasil>
There are plenty of workloads in which SMT2 is slower, total system throughput-wise, than no SMT, because the bigger effective cache per core is more useful than filling the bubbles.
<catern>
(i'm not too concerned about losing cache because I was thinking about this in the context of explicitly-software-managed scratchpad memory anyway, where you could just make the tradeoff of "larger scratchpad" vs "more threads" explict to software)
<Bitweasil>
I improved wall clock time on some Java stuff by reducing the total threads involved in stop the world GC, because it was thrashing the cache.
<geist>
and there are a few SMT like microcontroller things. propeller
<geist>
and there was a network processor i remember years ago bumping into that was kinda neat
<geist>
had 8 threads, and you coud bind IRQs to threads if you wanted, so it had real time guarantees of task switching
<Bitweasil>
I think a Defcon badge from years back was a propeller?
<geist>
yah somewhere i have a propeller dev board that i never futzed with
<Bitweasil>
High end NICs seem like something that could make use of that too, there's not a lot of processing, but when you get into the virtual NIC splitting, having "separate cores" would be useful.
<Bitweasil>
Well, a lot of processing, but most of it is hardware accelerated stuff.
<Bitweasil>
Checksum calculations and the like.
MiningMarsh has joined #osdev
<Bitweasil>
I know it gets crazy complex, I've never had network links where it mattered.
<Bitweasil>
I still think gigabit is pretty cool.
pie_ has joined #osdev
vancz has joined #osdev
rustyy has quit [Quit: Lost terminal]
Bonstra has joined #osdev
rustyy has joined #osdev
heat_ has joined #osdev
heat has quit [Read error: Connection reset by peer]
<bslsk05>
catern.com: Your computer is a distributed system
Oli_ has joined #osdev
Oli has quit [Ping timeout: 260 seconds]
<Bitweasil>
> This is especially painful on NUMA architectures, where different memory accesses can have radically different relative costs.
<Bitweasil>
It's not *that* bad between sockets.
<Bitweasil>
Extra couple cycles, typically.
<Bitweasil>
I make my money in the weeds of all that complexity and have no interest in anything much higher level than C, except for my small board ARM hobby of "discovering what doesn't build on AArch64 because it believes that x86 is the only 64-bit processor out there so downloads x86 binaries to run."
<Bitweasil>
And the reality is that a lot of the stuff you're talking about is well hidden, and... you seem to have an exceedingly pessimal view of the latency of cache misses.
<clever>
Bitweasil: i recently found that even a single-socket design can be numa
<bslsk05>
randomascii.wordpress.com: 11 mm in 1.25 nanoseconds | Random ASCII – tech blog of Bruce Dawson
<clever>
Bitweasil: the bus connecting the cores to the L2 cache, has a longer path for some cores
<Bitweasil>
Ok, and at 4GHz, 1.25ns is 5 cycles.
<clever>
the thing i find a bit odd, is that nearly all logic in a cpu is deterministic
<clever>
always triggering on a clock edge
<catern>
Bitweasil: lol I added that line because someone else complained I didn't mention NUMA
<clever>
so its not simply a matter of the wire being longer, and having a speed of light imposed latency
<clever>
somebody had to choose to add flip-flops into the datapath, that would delay it for a clock cycle
<Bitweasil>
I don't recall how many of the late Netburst era pipeline stages were literally just drive stages, pushing the signal on, but it was a non-trivial number.
<catern>
Bitweasil: and was like "well SMP is not NUMA so this doesn't apply!!!"
<Bitweasil>
... I never said that.
<Bitweasil>
I said the latency difference between sockets isn't that significant compared to DRAM latency in general.
<catern>
Bitweasil: no no I know I'm just saying why I added it
<Bitweasil>
And SMP is *typically* single socket, per-core L2 vs shared global L2.
<catern>
nothing about you
<Bitweasil>
Ok.
<catern>
(really I'm just thinking aloud about how to remove it or edit it...)
<Bitweasil>
L3 will see that sort of behavior, depending on how many ring stops away it is.
<catern>
I think I'll just remove the mention of NUMA
<catern>
still! if that's the only complaint of someone as knowledgeable as you, I think I'm pretty set
<Bitweasil>
Oh, I think you're whining about problems that mostly don't meaningfully exist without offering solutions, but it's not *wrong.*
<Bitweasil>
I just don't play in those spaces where it matters.
<catern>
that's good enough for me
the_lanetly_052_ has joined #osdev
<Bitweasil>
You just need to spend some time doing latency tests and understanding how things relate to each other.
<Bitweasil>
The Anandtech review articles are a good start for memory vs cache latency.
<Bitweasil>
Log scale charts, usually, but you're on the order of ~1ns for cache, ~100ns for DRAM.
<Bitweasil>
That's only 400 cycles, and a syscall/sysret pair isn't free. Plus, kernel not being in cache if the thread has been running for a while.
<Bitweasil>
So you're back to waiting on DRAM for the kernel, to save time... waiting on DRAM. You probably could design a chip that it would work on, but I don't think that's the best use of design and transistor resources.
<Bitweasil>
What I would *love* to see, and don't think exists, is when hyperthreading is disabled, the core gets double the L1 cache...
<clever>
and that leads back into a second bug from the page i linked above
<clever>
the cpu had a prefetch opcode, so you could tell the cpu what data your going to want
<clever>
so the latency is hidden, and its already in the L1 when you do need it
<clever>
but, if there is a TLB miss, it just doesnt prefetch
<clever>
so you would randomly have to pay that cache-miss cost
<Bitweasil>
*Proper* processors will let you pin TLB entries, and you can often do something sane with them if you care, but... yeah.
<clever>
the solution in this case, was switching from 4k pages to 64k pages
<Bitweasil>
Of course, Intel prefetch will fail depending on where in the process it loses the rails, so you can probe things with it.
<clever>
so a single TLB entry covered more data
<Bitweasil>
Yeah, large pages are nice.
<clever>
and large consecutive memcpy's would need fewer TLB slots
<Bitweasil>
IIRC Apple uses 16kb pages on most of their ARM stuff.
<j`ey>
YRC
<clever>
another factor ive discovered on the rpi, ive got acess to an opcode that can load 4096 bytes in a single shot
<clever>
i suspect thats making use of burst transfers, so while you may have a high latency to start the transfer, you only pay that latency once per 4kb
<clever>
and that lets me nearly saturate the dram, without any cache involvement
Arthuria has quit [Read error: Connection reset by peer]
<Bitweasil>
Where's it load the 4k to?
<clever>
the vector registers
<Bitweasil>
Oh, it's got enough space for that?
<clever>
yeah
Brnocrist has quit [Ping timeout: 256 seconds]
<Bitweasil>
I've not touched the vector stuff yet. Neat!
<clever>
its only on the VPU side of things, the arm cant access it
<Bitweasil>
... which would be why it didn't sound familiar, got it. :D
<clever>
basically, its an uint8_t[64][64]
<Bitweasil>
I was doing the math on the ARM vector registers in my head thinking I was coming up an awful lot short.
<clever>
at the cost of columns, you can join 2 or 4 vectors of 16, to form a 16bit or 32bit field
<clever>
so it can also act as an uint32_t[64][16]
<Bitweasil>
*nods*
<clever>
you then specify an xy coord, a bit width(8/16/32) and a direction (horizontal or veritcal) to make a vector of T[16]
Brnocrist has joined #osdev
<clever>
and you can optionally specify it to repeat an operation a power of 2 times (1 to 64), while incrementing either x or y
<clever>
so the most extreme limit, is copying an uint32_t[1024] into/outof the vector core
<catern>
(that's almost big enough to be scratchpad memory)
<clever>
but i dont think you can do any scalar access into the vector file
<clever>
the only vector->register options you even have, is to store the sum of a vector into a scalar reg
<clever>
otherwise, its almost entirely vector<->ram or scalar->vector
<clever>
or vector<->vector of course
Ali_A has joined #osdev
atrapa has joined #osdev
Arthuria has joined #osdev
k8yun has joined #osdev
nyah has joined #osdev
dude12312414 has joined #osdev
<Bitweasil>
Sounds like a GPU to me!
<clever>
Bitweasil: yep, but its entirely seperate from the 3d core
<Bitweasil>
It's a Raspberry Pi. Of course it makes no sense. :p
<clever>
i suspect this feature pre-dates the 3d core being on the chip
<clever>
the official rpi firmware uses it for complex FFT operations
<clever>
and i just had an idea!, a sampling cpu profiler, for the closed firmware
<clever>
knowing the hot-spots in the code, would point towards what code is doing what tasks
<CompanionCube>
does sound cool
<clever>
basically, there are 4 32bit wide compare registers in the rpi's timer
<clever>
i think the official firmware only uses 1
<clever>
and with the arm having its own generic timer, the other 3 channels are unused
<clever>
the VPU's vector table also allows a unique handler for every irq
<clever>
so i could configure 2 timer interrupts, that fire at whatever delay i want (resolution of 1 uSec), and record the return addr i just interrupted
Ali_A has quit [Quit: Connection closed]
xenos1984 has quit [Read error: Connection reset by peer]
freakazoid343 has quit [Ping timeout: 256 seconds]
GeDaMo has quit [Remote host closed the connection]
Ali_A has joined #osdev
Oli_ has quit [Ping timeout: 272 seconds]
xenos1984 has joined #osdev
dude12312414 has quit [Quit: THE RAM IS TOO DAMN HIGH]
jimbzy has joined #osdev
Ali_A has quit [Quit: Connection closed]
heat_ has quit [Remote host closed the connection]
heat_ has joined #osdev
asocialblade has quit []
dude12312414 has joined #osdev
heat_ is now known as heat
bliminse has joined #osdev
asocialblade has joined #osdev
atrapa has quit [Quit: atrapa]
immibis has quit [Remote host closed the connection]
immibis has joined #osdev
Burgundy has quit [Ping timeout: 272 seconds]
asocialblade has quit []
<Clockface>
when a signal handler is invoked in linux, where does it store where to return to when its done signal handling?
<zid>
the stack
<Clockface>
nice
<Clockface>
ty
<Clockface>
so ill find it at the top of the stack?
<zid>
don't forget to add 128 to rsp on x86_64
<Clockface>
why?
<zid>
redzone
<Clockface>
thank you
Oli has joined #osdev
<heat>
Clockface, linux (and others) store the whole state in the stack
<heat>
the siginfo_t, the mcontext_t, etc
pretty_d1 has joined #osdev
pretty_d1 has quit [Client Quit]
<heat>
you can actually change the return state by changing those and sigreturning
immibis has quit [Remote host closed the connection]
immibis has joined #osdev
pretty_dumm_guy has quit [Ping timeout: 272 seconds]
<mrvn>
since when do signal handlers return anything? typedef void (*sighandler_t)(int);
<zid>
nobody said they did
C-Man has quit [Ping timeout: 272 seconds]
<mrvn>
ups, read that as "what to return"
<mrvn>
Does the stack have the return address on it when you have an signal stack?
<heat>
yes
<heat>
as I said, not just the return address but the whole state is stored there
<mrvn>
one that isn't in the crt.o of the libc?
<heat>
hm?
<zid>
it's a software exception handler
<zid>
so it does what hardware exception handlers do, but in software
<mrvn>
heat: the return address on the stack will be the libc function that switches the stack back to non-signal context.
<heat>
that's not how signals work
<heat>
the libc doesn't dispatch signals
<heat>
the kernel dispatches signals
* kingoffrance
watches catern slowly get a migraine as "what is an interrupt?" takes hold again
<zid>
does the kernel do the filteirng and stuff too?
<zid>
I never looked at it
<mrvn>
heat: and you are sure that doesn't go through the libc first?
<heat>
mrvn, yes
<heat>
zid, filtering of what?
<zid>
My naive assumption would have just been that posix registers the 'default' handler and does the processing in software
<zid>
heat: which handlers are registered or ignored etc
<heat>
the only fancy thing the libc does is it sets sa_restorer to a nice function that just syscalls sigreturn
<heat>
zid, no, that's all kernel
<zid>
fair enough, I never looked
<mrvn>
zid: mine too. especially when the libc has to posix-ify the kernel interface
<zid>
I guess linux and posix just match in this case
<zid>
on purpose
<zid>
there's nothing saying you couldn't do a non-posix kernel interface and then need a trampoline to make it posixy
<heat>
the kernel keeps two signal masks (blocked and pending) plus signal queues (for rt signals) and essentially an array of struct sigaction
<mrvn>
zid: often the kernel interface can do more or less than POSIX or the format changed between kernel versions and the libc unifies that into a stable interface.
<clever>
[root@amd-nixos:~]# grep Sig /proc/self/status
<clever>
SigBlk: 0000000000000000
<clever>
SigIgn: 0000000000000000
<clever>
and this lets you view those states for any thread on the system
<heat>
fun fact: even though userspace sigsets have space for 1024 signals, the kernel just uses 64 bit ints
<heat>
that's why the max rt signal is 65
<catern>
kingoffrance: lol
* mrvn
wonders where the "struct sigaction *oldact" comes from when the previous handler was set through signal() instead of sigaction().
<clever>
mrvn: i believe the kernel has a copy of the current handler for each signal
<clever>
so when you tell the kernel to set, the kernel can just give you the old value
<bslsk05>
libera.irclog.whitequark.org: #osdev on 2022-03-29 — irc logs at whitequark.org
<nomagno>
aah
<mrvn>
The Amiga RKRMs are great. Big book with all the interfaces, big book explaining how to use them and the concepts behind them and so on.
<clever>
mrvn: the more i learn about the amiga, the more dos looks like a step backwards
<clever>
but they are also pretty close in age
<mrvn>
clever: dos is a joke
<clever>
exactly
<mrvn>
clever: AmigaOS in 85 was better than window3.1 or even later.
<clever>
and it even had auto-configuration of cards like pci, in the form of the zorro expansion cards
<clever>
back when x86 was still on isa i think
<nomagno>
Can't find it...
<mrvn>
You can take an Amiga 500, plug in a modern IDE drive (if you have a controller) and have it boot from ZFS by providing the filesystem driver in the harddisk.
<nomagno>
It didn't sent, great
<clever>
emu68 is using a virtual zorro3 card, to inject a rom into the initialization sequence
<nomagno>
Is it more RISCy to have fixed-size instructions where space might be wasted depending on opcode, or more optimal variable-size instructions with length depending on opcode?
<clever>
mrvn: that then injects sdcard.device into the driver array, so tagged SD card parititions appear as whole drives in amiga
<heat>
i feel like fixed size instructions are more in the spirit of risc
<heat>
as it simplifies decoding
<heat>
vs the crapshoot you have in x86
<mrvn>
heat: x86 is probably the worst there
<clever>
i dont want to even think about how i would do an x86 decoder in verilog
<nomagno>
Issue is these instructions are 5-bytes long
<nomagno>
Which is pretty big for an 8-bit system with 16-bit address space
<mrvn>
m68k has one 16bit opcode followed by optional immediates. None of those prefix bytes where you have to parse each before you even get the opcode.
<nomagno>
clever: U don't cuz it's patended :O
<heat>
no it's not
<nomagno>
Depends on which instruction set you take to be x86
<heat>
32-bit x86 isn't
<mrvn>
nomagno: why would you have 5 bytes?
<heat>
and decoding x86 definitely isn't since emulators exist
<mrvn>
heat: something can be patented and not be inforced
<nomagno>
mrvn: Well operands may be addresses or literals, and instructions may have up to two operands. It's not very RISCy I agree, but that's besides the point
<nomagno>
I explicitly rejected load/store architecture
<mrvn>
nomagno: so you want to make CISC except for the variable opcode length?
<clever>
nomagno: what is a load/store architecture ?
<mrvn>
clever: the part that I would say makes a RISC system
<clever>
as-in, only a load/store opcode can access ram, and mov/alu-stuff cant?
<heat>
if youre skipping load/store that kinda stops being a RISC as instructions get complex as hell
<mrvn>
clever: other than load and store opcodes operate on registers only.
<clever>
ahhh
<heat>
fixed size is also impossible there
<clever>
but half of that, could just be naming
<heat>
because instructions get very big
<clever>
what id i patch the x86 objdump, so that `mov a, [b]` gets disassembled as `ld a, b` ?
<clever>
now its a load/store arch!
<nomagno>
heat: Well yeah I figured. I'll just deal with the 5 bytes per instruction, honestly
<mrvn>
clever: no, because you have add r, #mem.
<nomagno>
It gets too complex to implement otherwise
<nomagno>
Or well, it gets too complex to read the bytecode
<clever>
yeah, so its the alu directly from memory, that makes it not load/store
<mrvn>
nomagno: 5 bytes is a horrible number
<mrvn>
clever: jmp #mem
<clever>
and i do agree that alu from memory, is a bad idea
<nomagno>
mrvn: Because it's big or because it's odd?
<clever>
you have no control over when the fetch happens, and it harms your latency when it cache misses
<mrvn>
clever: usualy when you have one component like the ALU with mem access then all parts have mem access
<clever>
with load/store stuff, you can load it ahead of the alu op, and hope out-of-order cores run it faster
<mrvn>
nomagno: the later
<nomagno>
mrvn: there are no real drawbacks to 4 vs 5 aside from aesthetics. No system ever is going to have issued with it
<mrvn>
clever: that hardly matters in modern pipelines.
<clever>
mrvn: what about an opcode like `switch r0`, where r0 is an int, and the opcode is immediately followed by an int16_t[] of pc-relative offsets?
<mrvn>
nomagno: bus size, cache line size, DRAM sizes
<clever>
thats not exactly `jmp #mem`, because the table is at a predictable offset, and likely in your i-cache
<nomagno>
mrvn: it's really just for the byte code. The first byte gets split into two internally very easily
<mrvn>
clever: up to which size?
sonny has joined #osdev
<mrvn>
clever: is the int16_t[] part of the opcode? Or do you do the ARM jump table thing where each entry is a jump instruction on it's own?
<clever>
mrvn: ive yet to see an upper limit, but it can only jump up to +64kb forwards, so if the table goes over ~32768 entries, all forward jumps become impossible
<mrvn>
switch r0 == add PC, 4 * r0
<clever>
mrvn: its not really `add PC, 4 * r0`, its more of a `ld pc, pc + (r0 * 2)`
<clever>
hmmm no
<mrvn>
clever: lea
<mrvn>
clever: On ARM it's 32bit per opcode so 4*r0 there.
<clever>
ld t, r0 * 2; t2 = t*2; pc = pc + t2;
<clever>
thats the best way to describe it
<clever>
all opcodes must be a multiple of 16bits on here, so bit0 is assumed to be 0, and the offset table contains bits 16:1
<clever>
but it also has negative offsets
<clever>
so each slot is an offset +/- 64kb, forwards or backwards
<clever>
let me get an example
<mrvn>
clever: you can always design something that breaks any precondition
<nomagno>
I mean, I can't to much about the 5 bytes. I need 4 bits for the complex addressing mask, 4 for the opcode (yes, I have 16 instructions. Surprisingly nice to work with), and 16 for each operand
<mrvn>
nomagno: not every opcode needs the same number of operands or the same addressing modes.
<nomagno>
mrvn: Which is why it's really variable length
<nomagno>
But the implementation cost of that doesn't really outweigh thee simplicity of 5 bgtes
<nomagno>
I prefer 5 byte chunks to a singly linked list.
<clever>
mrvn: the 1e on line 6, is an offset to pc + (0x1e * 2), and with line 6 being the 2nd index (0 based), it will jump to 80001582 if r2==2
<mrvn>
nomagno: why not 4 bytes?
<nomagno>
mrvn: what am I supposed to shave off?
<mrvn>
nomagno: one of the later 4 bytes
<nomagno>
4 bits for complex addressing, 4 bits for opcode, 16 bits for arg1, 16 bits for arg2
<mrvn>
why 16bit?
<mrvn>
you don't have 65536 registers, do you?
<nomagno>
16 bit address space, 8-bit words, no load-store architecture. You just read that though
<clever>
mrvn: in the example i linked, there is a compare against 0xe, and a "branch if higher" right before the switch, so its enforcing an upper limit on the offset-table size
<mrvn>
nomagno: most cpus only allow one memory operand per opcode.
<nomagno>
mrvn: Well, uh... Good for them.
k8yun has quit [Ping timeout: 260 seconds]
asocialblade has joined #osdev
<nomagno>
It's a programmer-oriented machine code, not a compiler oriented machine code
<mrvn>
nomagno: normaly machine code is hardware oriented
<nomagno>
It's P-code
<nomagno>
No need
<heat>
if you want risc and easier programming, just add pseudo instructions
<clever>
mrvn: gist updated to annotate things better
the_lanetly_052 has joined #osdev
<clever>
does it make sense now?
<mrvn>
clever: it made sense an hour ago. Everybody knows what a jump table is
<clever>
but its also pc-relative, and immediately after the opcode
<clever>
so its more predictable then `jmp #r0` from x86
<clever>
and also PIC friendly
the_lanetly_052_ has quit [Ping timeout: 260 seconds]
<mrvn>
clever: ARM uses that in hardware
<nomagno>
PC-relative addressing feels to me like you have to be actually superhuman to be able to do many memory operations without messing up
<mrvn>
nomagno: that's why you have an assembler and not just a hex editor.
<clever>
mrvn: how did arm do this?
<mrvn>
clever: the jump table, or interrupt vector or whatver is an uint32_t[] where each entry is a branch instruction.
<clever>
ah, but thats different, its more setting PC to the addr of that slot
<clever>
and that slot must contain a branch opcode
alpha2023 has quit [Ping timeout: 272 seconds]
<clever>
vs the slot just being an offset relative to the source
<nomagno>
mrvn: 'assembly is 1-to-1 coding of machine code' goes brrr
<mrvn>
clever: just a different encoding
<clever>
it starts to differ, when you introduce the switch.b opcode
<clever>
which is followed by an int8_t[] array of offsets
<clever>
which now allows denser packing, when you only need short offsets
<mrvn>
clever: still not anything conceptually new
<nomagno>
Is the relationship between time and memory price more pronounced than Moore's law?
<heat>
assembly is not 1-to-1 coding of machine code
<clever>
ldr could be either a mov or a ld i think?
<clever>
depending on the size of the immediate
* mrvn
wonders how "#include" translate into machine code
<nomagno>
heat: Well did I make a programming language and not a virtual machine then?
<mrvn>
nomagno: you made whatever you made. C++ has a virtual machine too.
k8yun_ has quit [Quit: Leaving]
<nomagno>
Then that statement was kinda void. Assembly that is 1-to-1 coding still needs to be assembled
alpha2023 has joined #osdev
<mrvn>
nomagno: Assembly that is 1-to-1 coding is a pretty stupic assembly and assembler.
<Griwes>
nonono, C++ has an *abstract* machine :P
<mrvn>
Griwes: that too
<mrvn>
nomagno: last time I had a 1-to-1 coding was the machine code editor on my C64.
<mrvn>
I wouldn't even call it an assembler.
C-Man has joined #osdev
onering has quit [Ping timeout: 250 seconds]
Beato has joined #osdev
<klange>
Hot take: In the academic sense, assembly languages are not programming languages. Sure, they're "languages" that you can "program" in, but they aren't "PLs".
<moon-child>
why?
<moon-child>
they have well-defined semantics, and formal models have been constructed of them
<moon-child>
probably _more_ formal models are made of asm (or, machine code) than of any other sort of language, by cpu vendors
<Bitweasil>
I would agree with klange here. Assembly, in almost all cases, is just mnemonics for the machine opcodes, with a few nice bits added (macros).
<Bitweasil>
I would generally define a programming language as something more abstracted and human-focused that can be converted to something the machine can execute.
<moon-child>
per church-turing hypothesis, anything can be converted to something the machine can execute
<sonny>
assembly is just a low level programming language
<moon-child>
'abstracted and human-focused' is not an interesting definition in an 'academic' sense
Likorn has quit [Quit: WeeChat 3.4]
<Bitweasil>
Yeah, good thing I'm not in academia. ;)
<Bitweasil>
My desire for a PhD has dropped off very rapidly after my Masters.
<klange>
Academia is full of squares. That's why the hats are shaped the way they are.
<graphitemaster>
Everyone knows the only language that actually classifies as a programming language is HTML
<klange>
/kickban graphitemaster
<heat>
html bad
<heat>
css bad
<heat>
vote for generating pages using javascript
<klange>
generate your pages with kuroko
<heat>
kuroko DOM when?
<klange>
when wasm has dom stuff
<klange>
quite a bit of kuroko-lang.github.io is written in kuroko running in your browser
Likorn has joined #osdev
<klange>
especially the IDE, that's almost entirely kuroko with just a few little JS bridges to do DOM stuff
vdamewood has quit [Read error: Connection reset by peer]
vdamewood has joined #osdev
Arthuria has quit [Read error: Connection reset by peer]
Oli has quit [Ping timeout: 272 seconds]
HardWall has quit [Read error: Connection reset by peer]