<immibis>
takes a moment to compile and disassemble
ptrc has quit [Remote host closed the connection]
<immibis>
for some reason 10000 causes the loop to unroll by 5000, but if you put in a prime number, it defaults to unrolling by 8 with some extra iterations at the beginning
wand has joined #osdev
<immibis>
if you put in 9999 it unrolls by 3333
ptrc has joined #osdev
tsraoien has quit [Ping timeout: 272 seconds]
noeontheend has joined #osdev
xhe has quit [Ping timeout: 244 seconds]
xhe has joined #osdev
wand has quit [Remote host closed the connection]
wand has joined #osdev
zaquest has quit [Remote host closed the connection]
gog has quit [Ping timeout: 272 seconds]
zaquest has joined #osdev
[itchyjunk] has quit [Read error: Connection reset by peer]
toluene has quit [Read error: Connection reset by peer]
<doug16k>
seems like gcc is making it possible for the branch predictor to select a history where a mispredict selects branch history values that are correct due to correlation
toluene has joined #osdev
<doug16k>
gcc throws more predictor at it
<doug16k>
but yeah, with a perfectly balanced tree, clang would win
noeontheend has quit [Ping timeout: 240 seconds]
<doug16k>
it has been a while since I used clang a lot. maybe gcc isn't always faster anymore
vai has joined #osdev
<vai>
hi
<doug16k>
I have seen so many times where the clang codegen looked really good, like what a human would write, and gcc generates weird code that looks bad but runs fast
<bslsk05>
reviews.llvm.org: ⚙ D112921 [clang] Enable sized deallocation by default in C++14 onwards
<vai>
Hey guys, I have an idea for you get the funding for your OS, make it Artificial Intelligence, hellish slow all modules,.. like the file system would need to look up 10 Gigs worth of tables just to find a suitable location for a file
<vai>
:)
<vai>
poor monkey brains
<vai>
RE in 20 years, everyone is doing 20 gigs worth of table looks up routinely
bradd has quit [Quit: No Ping reply in 180 seconds.]
frkzoid has joined #osdev
bradd has joined #osdev
foudfou_ has quit [Remote host closed the connection]
foudfou has joined #osdev
frkzoid has quit [Read error: Connection reset by peer]
<moon-child>
doug16k: yeah, ime they are fairly matched; sometimes one will win, sometimes the other will win
<moon-child>
I find gcc has better diagnostics, standards support, and compile time for c, though, so I usually stick with that
<doug16k>
it's a luxury to have two completely different compilers that are so compatible you can almost pretend clang is gcc
<moon-child>
indeed
<moon-child>
what else is in that position? Browsers I guess
<moon-child>
common lisp sorta is, but there's really only one good-quality compiler
<moon-child>
(good-quality = good code quality; wrt behaviour all are fine)
frkzoid has joined #osdev
<doug16k>
yeah, the analogy is almost exact with chrome and FF
<doug16k>
V8 and spidermonkey
<doug16k>
all really good, varying winner
<doug16k>
you can have a hundred objects flying around, doing collision detection and moving elements every frame at 165Hz. way way overkill power for UI scripting
<doug16k>
if you do a lot of js and no native for a while, you become accostomed to it, and it feels instant, you have essentially infinite cpu as long as you use decent algorithms for large item counts
<doug16k>
then you go back to C++ and coffee spray how fast it is
<doug16k>
where you are like, did it really measure it... yep
<doug16k>
cpus are utterly overkill now, come on
<doug16k>
we are seriously reaching a point where I wonder if bloat will keep up, or we will plateau because it becomes too hard to bloat it enough
<moon-child>
ha
<moon-child>
I mean, gpus are still wayy faster than cpus
<doug16k>
how long until unicode has a codepoint which requires the system to show a video of a moon launch for the glyph
<doug16k>
then we will extend utf8 so you can put more bits in, and add the URP to unicode, the utterly ridiculous plane, values > 0x101000
<doug16k>
sure, gpu is a like a dragster. it goes straight with ease, and is awful and turns. cpus can suddenly change direction rapidly, but they are not as good on the straightaway
<doug16k>
if you only want to do one, gpus are terrible. cpu can do one in no time. gpu can do millions in "no time", cpu would be forever
<doug16k>
it
<clever>
i saw a video recently, talking about how a dragster engine is pushed so hard, that it will basically self-destruct if they drive for even 1 second longer then normal
<doug16k>
oops, it's perfect as an accelerator. it is unbelievably good at certain things
<doug16k>
yeah, it wears out in 1-3 runs, depending how hard you push it, then needs rebuild
<doug16k>
rebuild every run would need massive funding
<clever>
the video i saw, claimed a rebuild every single run
<doug16k>
those superchargers put massive load on the engine, so the power you see at the back wheels is actually the engine minus the extremely large supercharger drag
<bslsk05>
'Why Dragster Engines Only Last 4 seconds' by Donut Media (00:15:58)
<doug16k>
so it's just utterly generalized? every team burns it in one run. ok
<clever>
its been a few weeks since i watched it, i may be generaizing it too much
<doug16k>
you are talking about top competitors
<doug16k>
yeah, they are lucky it doesn't split the engine in two, they push it so hard
<clever>
that would explain it
<doug16k>
"hydraulic-ing" the engine they call it
<doug16k>
the crank and bearings and heads are mounted so well, it will just tear a rip through the steel
<clever>
and there was something about them rebuilding the engine in under an hour, so they can do several races in a day
<doug16k>
yeah they are amazing
<doug16k>
so many steps can be screwed up and not know
<doug16k>
so they need to be making very few mistakes
the_lanetly_052 has joined #osdev
<clever>
doug16k: theres also a freelance trucker (think uber but for freight) that i watch on youtube, and a few months ago he did an engine rebuild, and one crazy fact i noticed, is that the main body of the engine, and the crank shaft, never got removed
<geist>
neat, the 4 state fpu state in riscv is kinda handy
<clever>
he basically stripped the entire engine down, and replaced all of the wear surfaces, without removing the main engine block or crank shaft
<geist>
though a little unweildy to switch states, since it's 2 bits in a status register you gotta fiddle with
<doug16k>
like 4 banks of registers and you pick active set?
<geist>
no. 4 states that you control: off, initial, clean, dirty
<doug16k>
ah for eliding
<geist>
off causes the cpu to fault, but the other ones have a simple state diagram that's kinda helpful
<doug16k>
do you usually do lazy fpu on that?
<geist>
especially clean vs dirty. clean lets you load a thread's state, and then will remain in that state unless the code *writes* to a register
<geist>
then it goes to dirty
<doug16k>
ah it affects what happens when you restore and save state
<geist>
in this case i'm not yet doing full l;azy, but you can do a half lazy by not bothering to save the state if it hadn't moved from clean -> dirty in the last context switch
<geist>
initial is less useful, but it lets to track if you have fully zeroed out registers or not. it doesn't otherwise do anything different than clean
<doug16k>
yeah, initial can instantly restore state without reading memory, clean and initial can not bother saving state
<doug16k>
it knows it would just be writing back values already there in the save area
<geist>
yah, it's pretty cute. to make it easy to test it has three sets of these reigsters (float, vector, and 'extension), and there's another bit that's hard wired to be if any of the 3 2 bit states are in the dirty state
<geist>
so they added a one bit test to let you see if you need to do more work. kinda nice
<doug16k>
xsave tries to do that for you transparently if you save back to the same place you restored and other crap
<doug16k>
xsaveopt*
<doug16k>
on x86
<geist>
yah, this is basically the sort of stuff that xsave does behind the scenes
<doug16k>
I do the "off" state by setting the TS flag in my kernel (task switched flag traps fpu use)
<doug16k>
the lazy fpu thing, I make like it is lazily another thread's state, but really I am just blocking fpu in that thread
<doug16k>
I know intel leaks, it's mean for bug catching primarily
<geist>
yeah
<doug16k>
is that 2 bit thing part of the general registers context?
<doug16k>
or do you have to specially do something with it?
<doug16k>
ah probably not now that I think about it. there aren't even flags
<geist>
you specifically have to do something with it
<geist>
it's part of the sstatus register (for supervisor mode)
<geist>
bits 13 and 14
<geist>
sstatus i like the cr0 of riscv. all of the control bits are there, though not available to user space
<doug16k>
ah, so it's a hybrid of the kernel assistance deciding whether to restore fpu state or save the fpu state, and it deciding whether to trap fpu
<doug16k>
setting it to initial instantly zeros the fpu?
<doug16k>
or does initial mean trap and handler goes, "oh initial" and zeroes it, and returns?
<doug16k>
fault would be the word
<doug16k>
so really only dirty and clean don't trap, off and initial traps? handler assist sets initial state?
<doug16k>
seems more risc to fault on initial
<geist>
yah initial does not zero things, or does setting it to off
<doug16k>
almost sounds like the hardware ignores the low bit - that's the kernel's hint whether to load and save to actual memory, and the cpu only looks at the upper bit really?
<geist>
initial is more of a helper, so you can track whether or not you ever loaded anything into it
<geist>
i was thinking that, but really it doesn't seem to quite be that laid out bitwise. really the hardware cares about 00 (off) and software should really care about 11
<geist>
all other transitions 01, 10 go to 11 when you write to fpu
<geist>
so functionally there are 3 hardware states, but the fourth one (initial vs clean) is for your own use
<doug16k>
ah I thought it was off initial clean dirty. yeah I see. ok xor the low bit with high bit for whether you access memory :P
<doug16k>
on save you skip saving the fpu registers if not dirty, right? and on restore, you skip restoring the state if off or initial, otherwise you do restore from the save area, and it doesn't even make sense to resume the thread with dirty, it's guaranteed clean after you read the save area
<mrvn>
immibis: That "#pragma clang loop unroll(full)" basically changes the compiler options. Kind of cheating.
<geist>
oh damn, my whole house of cards is falling down
<geist>
the riscv toolchain is extremely strict about mixing float and no float objects
<geist>
much more so than any other arch i know about. goddamn.
<geist>
i can't simply set an march= without float because then that implies a mabi without floats and then the linker wont link because the abi is different
<geist>
and you can't force one or the other
* kof[away]
use the source geist, the source...
<kof[away]>
isnt there a way to cheat?
<geist>
well i'm sure i can force it via objcopy or something, but i'd rather play by the books
<mrvn>
does that affect -mregular-regs-only?
<geist>
not recognized
<geist>
-mgeneral-regs-only, etc. i couldn't find any switches on the riscv toolchain for that sort of thing
<mrvn>
sounds like they want you to use the fpu in kernel too
<geist>
in general the riscv toolchains rely on march and mabi to specify pretty much everything
<geist>
or not at all. it's the intermixing that's an issue
<geist>
oh whatever i'll fiddle with it later
<geist>
for the moment i'll just leave it full float or not. in the case of riscv the compiler is not throwing in float instructions where it isn't asked, so it seems to be well behaved
<geist>
(ie, no vector fpu yet so not really any gains using it)
xhe has quit [Read error: Connection reset by peer]
xhe has joined #osdev
<doug16k>
are there callee saved fpu registers?
fastru has quit [Ping timeout: 244 seconds]
fastru has joined #osdev
<doug16k>
I wonder if there are scenarios where it saves them, but it can't just do it because fpu might be off
<doug16k>
longjmp or something
bradd has quit [Ping timeout: 240 seconds]
bradd has joined #osdev
<moon-child>
'All x87 registers are caller-saved'
<doug16k>
so you can use emms to rapidly wipe MMX out of FPU so caller can do floating point
<doug16k>
it breaks the register stack
<doug16k>
mmx does
<doug16k>
and for sse, can just vzeroupper and not care
<doug16k>
sometimes an ABI will allow you to wreck the upper part but low easy part is callee saved
<doug16k>
since MMX was already wiping the FPU on call, it made sense to just leave it
<doug16k>
windows added callee saved sse regs though
vai has quit [Remote host closed the connection]
<doug16k>
it's too bad amd dropped 3dnow. we might have mmx and sse at once optimizations by now and gain registers. oh well
mzxtuelkl has joined #osdev
<moon-child>
idk 3dnow; how was it?
<doug16k>
it's was just mmx for single precision float. pair of floats
<doug16k>
it was perfect
<doug16k>
could do 3d game math really well
<moon-child>
my opinion, as far as vectors go, is: the bigger, the better
<doug16k>
it's quite a bit easier when it's just one or two tricks for dealing with things being in high or low half, instead of 4 or something
<doug16k>
you hardly shuffled
<clever>
moon-child: one issue ive seen with a lot of vector stuff, is that when the cpu supports bigger vectors, you have to re-write (or re-compile) your program, to take advantage of the wider vector
<clever>
moon-child: but there is some new armv8 stuff that i think tries to solve that, where you instead have a virtual vector size, and the hw will repeat for you
<moon-child>
well, sure, obviously the _ideal_ is infinite ilp and then you don't need vectors
<clever>
better hw just has to repeat the job less often
<moon-child>
but the whole point of vectors is to eliminate dispatch. The bigger the vectors are, the more dispatch you can eliminate
<moon-child>
variable-size vectors are good, not fully general but good where they work
<doug16k>
using vectors doesn't make your code simd
Brnocrist has joined #osdev
<clever>
the rpi's VPU also has a repeat tag on all vector opcodes, so while its fixed at 16 lanes, you can repeat an operation between 1 and 64 times, while incrementing the row/col
<doug16k>
real simd code can handle whatever width vector you have, because you never shuffle across lanes, every lane is a "thread" that is utterly independent
<moon-child>
sure
<clever>
i ran into that exact kind of problem when trying to vectorize some x265 stuff
<clever>
the inner loops had wildly different sizes, and access orders, and just didnt fit
<clever>
but, the outer loop ran 32 times, on 32 completely independant datasets
<doug16k>
exactly
<clever>
so rather then try to vectorize the inner loop, i vectorized the outer loop
<moon-child>
could you transpose and gather?
<moon-child>
yeah exactly
<clever>
each step was a single vector opcode, that did a single step of the loop, 16 times in parallel
<clever>
then i just unrolled every inner loop, and manually did all register allocations
<clever>
and ran right into total data corruption, because i thought it could do 32bit*32bit->64bit mults
<clever>
but instead, it only has 16bit*16bit->32bit mults
<doug16k>
so you just did 4?
<doug16k>
lo*lo + (lo*hi)<<x + (hi*lo)<<x + (hi*hi)<<(2*x) I think
<clever>
i didnt know what the cause was at the time, and gave up on that x265 project
<clever>
i did need 32bit mults again later, for implementing vectorized soft-float
<doug16k>
+ are with carry across the upper pieces
<clever>
but all of those bit-shifts, and a 64bit add, was a nightmare to implement
<clever>
and the clock cycle count, was begining to exeed 16 scalar float operations
terminalpusher has quit [Remote host closed the connection]
ajr has quit [Quit: WeeChat 3.6]
<doug16k>
yeah that part where <<x means you are spanning two 32 bit things? yeah you add it << 16 to low part and add it >> 16 to high part
<clever>
doug16k: yep, but that now means doing 3 adds before even bringing carry into the mix
<clever>
and 2 shifts
xenos1984 has quit [Read error: Connection reset by peer]
<doug16k>
but you are doing so many
xhe has quit [Ping timeout: 264 seconds]
<doug16k>
integer adds are mind blowing speed though
<clever>
i checked the math on it already, and its already nearing 16 times the cost of a scalar float mult
xhe has joined #osdev
<clever>
if i go over that limit, then regular hw scalar float ops would be faster then vectorized soft-float
<clever>
fixed-point math would be far far simpler
<clever>
though, i would still have some limitations, but they could be checked at compile time i think
<doug16k>
what's the latency of the multiply? you might have a window of might as well add
<doug16k>
or shift or whatever
<doug16k>
would mitigate the cost some
<clever>
at least for back2back mults, its 2 clock cycles per batch of 16
<clever>
ive not tried mixing mult+add yet, to see if it behaves differently
<mrvn>
it would be quite odd if hw float ops where slower than software floats.
<clever>
but mult+accumulate i think is still 2 cycles
<mrvn>
not fmadd opcode?
<clever>
mrvn: i was comparing a scalar hw float, to a vectorized (16 lane) software float
<clever>
thinking the vector mode may make it faster, but its coming in to be slightly slower
<clever>
i dont think there is a fused mult+add in the VPU
<clever>
but the QPU's dual ALU nature lets you manually pipeline it
<clever>
the QPU is a VLIW, where you specify independant mult and add operations, so `a=b*c; d=e+f;` is a single opcode
<mrvn>
For adding the intermediates you basically want 2 carry bits
<clever>
and then the manual pipelining, is just doing `a=b*c; anything;` on the first cycle, then `anything ; d=a+e;` on the second cycle
<clever>
and you can then overlap that with itself, to do multiple of them back2back
<mrvn>
is there a 32bit * 32bit = 32bit and 32bit * 32bit >> 32 = 32bit?
<clever>
on the VPU, no
<mrvn>
So basically you have to do everything in blocks of 16bit
<clever>
there is 16bit*16bit -> 16bit, (16bit*16bit)>>16 -> 16bit, and 16bit*16bit -> 32bit
<mrvn>
16bit + 16bit >> 16 -> 16bit?
<clever>
if you only care about the upper 16bits of the 32bit product
<clever>
uint16_t highside_mult(uint16_t a, uint16_t b) { return (a*b) >> 16; } i believe
<clever>
maybe throw in a cast or 2, so the intermediates dont clip
<mrvn>
Does 16bit*16bit -> 16bit, (16bit*16bit)>>16 -> 16bit get folded into a single multiplication internally if you have one after the other?
<clever>
i think internally, its a 32bit product, and then automatic shifting, to crop it down
<mrvn>
clever: (a*b) is UB. :) You have to cast to uint32_t.
<clever>
uint16_t highside_mult(uint16_t a, uint16_t b) { return (((uint32_t)a*(uint32_t)b) >> 16) & 0xffff; } then?
<clever>
uint16_t lowside_mult(uint16_t a, uint16_t b) { return (((uint32_t)a*(uint32_t)b)) & 0xffff; }
<mrvn>
I still don't understand why uint16_t * uint16_t casts to int instead of unsigned int. Would be harder to spell in the standard but at least give reasonable results.
<clever>
uint32_t mult(uint16_t a, uint16_t b) { return (((uint32_t)a*(uint32_t)b)); }
<clever>
mrvn: thats because i omitted a minor detail, some flags
<doug16k>
because uint16_t fits in int so it promotes to int
<mrvn>
What I wonder is this: Should I suffel the 16bits vector into 2 and do two 16*16=32 mults or do 16*16=16 and 16*16>>16=16
<bslsk05>
github.com: VideoCore IV Programmers Manual · hermanhermitage/videocoreiv Wiki · GitHub
<clever>
and give concrete examples of the input and output
<doug16k>
I also wish unsigned anything promoted to unsigned whatever it is, but I am at peace with that not being the case
<clever>
Rev. Subtract with saturation and carry R_d[i] = R_b[i] - R_a[i] - C[i]
<mrvn>
x86_64 has this nice adox/adoc pair for doing long multiplications. Gives you two carry chains so you can do res = res + a * b + c
<clever>
oh, reverse subtraction, thats another weird one
<clever>
why not just swap a/b?
<clever>
maybe there is some limitation in the opcode encoding, that i'm not aware of
<mrvn>
clever: you have vector - reg/imm but sometimes you want reg/imm - vector
<clever>
yeah
<clever>
and the question is, can both be encoded? or is R_a always a vector?, and you need a reverse mode to swap them
<mrvn>
doug16k: that makes no sense
<mrvn>
clever: pretty certain the later
GeDaMo has joined #osdev
<clever>
yeah, that would explain why they even have a reverse sub
<mrvn>
clever: does div have the same? is there even a div?
<clever>
no vector div
<clever>
only scalar div, int/float
<mrvn>
vector inverse?
<clever>
dont see that either
<clever>
the vector core has mov, shift, rotate, and, or, xor, pop-count, add, sub, mult, and some that are harder to descripe
bauen1 has quit [Ping timeout: 272 seconds]
<clever>
"make mask", "pluck elements at {even,odd} index", "interleave", most sig bit, distance, sign, signed clamp
<mrvn>
doug16k: I just whish the integer and arithmetic promotions where such that no operation is UB.
<clever>
uint16_t a,b; uint32_t c; c = a*b;, its obvious that the product will be 32bits, just extend it
<mrvn>
clever: does 16x16=32 have mask to plug the even/odd for each input?
<GeDaMo>
Is this ARM?
<clever>
GeDaMo: the VPU on the rpi, seperate from the arm core
<mrvn>
clever: harder to do in compound expressions
<GeDaMo>
Ah
<clever>
mrvn: yeah, then you would risk something like a*b*c*d turning into a potential 128bit int
<clever>
and at some point, the compiler just has to throw UB at you
<clever>
or do everything in bigint
<mrvn>
clever: no, just mod 2^n
<doug16k>
all you have to do is promote small things to unsigned, but yeah that sucks in C++ where you don't even know what it is
<clever>
which reminds me, haskell has types to deal with this
<clever>
Int in haskell, spans the range of [-2^29 .. 2^29-1]
<mrvn>
doug16k: can't promote a signed to unsigned
<clever>
but Integer in haskell, is just a bigint
<clever>
and will never clip or overflow
<doug16k>
I mean small unsigned like your uint16 case
<mrvn>
it could be nice to promote everything to the rang of the result
<mrvn>
like clever said: uint16_t a,b; uint32_t c; c = a*b; ==> c = (uint32_t)a * b;
<mrvn>
except with rang and only convert to unsigned after.
<doug16k>
it would be nice if we were silicon based life. we would be almost indestructible
<clever>
and like mrvn said, what happens if you do `(a*b)+c` with 16bit vals, that can potentially overflow a 32bit int, do you just auto-expand to 64bit, and then clip when you discover the dest is only 32bits?
<clever>
throw a warning?
<mrvn>
clever: if you do uint64_t d = ... you expand to 64bit.
<clever>
yep
<clever>
that does sound sane
<mrvn>
int d = ... expand to 32bit and overflow
gog has joined #osdev
<clever>
i would want the compiler to track the potential range of every variable, and throw warnings if they can overflow
<clever>
and auto-expand as needed
<mrvn>
every a*b would warn
<clever>
> (0xffff * 0xffff).toString(16)
<clever>
'fffe0001'
<clever>
based on this, for mults, you can basically just add the bit size of each input
<clever>
and the compiler could track that a is only ever a 5bit int, or a 11bit int, from previous math done
<clever>
so it may not be a full 16+16->32
<mrvn>
that only works with whole program optimization
<mrvn>
and gcc/clang are quite bad at it even just for multiples of 16/32.
<clever>
i dont see why you cant also do it for functions, just assume any arg from "unknown" code is the full width
<doug16k>
you can have better than that. you can built with -fsanitize=undefined and it will actually check every time
<mrvn>
(uint64_t)a * b does a full 64bit * 64bit multiplication.
<clever>
if i take an `uint16_t a` as an input, assume it can span the entire 0 to 0xffff when computing potential outputs
<clever>
but if i do `a >> 4`, then that can only ever reach `0xfff` as a max
<mrvn>
clever: int foo(int x); how many bits does the result have? Any function call you would get full bits making it basically useless.
<clever>
and then the `(a>>4) * b` wont be a 32bit product, but just a 28bit product
<clever>
compound expressions
<clever>
if i'm mixing a dozen vars in a bit math expression, it can track things within that, expanding up&down, potentially
<clever>
and warn if it ever exeeds the final return type of the func, or a var i store it in
<mrvn>
still mostly useless. You would want that to be included in the function signature.
<clever>
yeah, thats where whole-program becomes needed
<mrvn>
int:12 foo(int:16);
<clever>
or just let me have my uint48_t :P
<mrvn>
int:('a-4) foo(int:'a);
<clever>
store it in a 64bit int maybe, but treat it as a 48bit int for all bounds checking
<mrvn>
struct { int64_t x : 48; }
<mrvn>
I also want struct { Foo *bla : 56; }
<mrvn>
or even more usefull struct { Foo *bla : 56:2; }
<mrvn>
lower bit of Foo pointers must be 0
<clever>
another thing that i think i'll have trouble explaining to gcc, is the row/column modes
<clever>
basically, you can have an uint16_t matrix[16][16], and then your vector can be made up of either matrix[constant][i], or matrix[i][constant]
<doug16k>
I made a macro generator that reads bitfield definitions from a DSL in my build to use bitfields safely
<doug16k>
you are right, the built in bitfields are bad
<clever>
doug16k: the official rpi headers have macro's for doing exactly that
<clever>
for a given REGISTER, you then have FIELDS within it, you then have a REGISTER_FIELD_LSB constant, that tells you the LSB of the field
<clever>
along with a _MSD, _SET, and _CLR,
<dzwdz>
hi again o/
<doug16k>
yeah that plus a bunch of other like, a mask with that field at bit 0, and a mask with it in place
<doug16k>
its width, lots of things
<mrvn>
doug16k: bits can be must-be-zero, must-be-one, read-as-zero, read-as-one, must-write-zero, must-write-one
<clever>
_SET is 0xc0000000, _CLR is 0x3fffffff, _LSB=30, _MSB=31, and there is the syntax-error of a sibling, _BITS = 31:30
<GeDaMo>
Hi dzwdz :)
<clever>
i'm not sure how you would really use the _BITS constant
<clever>
for most things, you might do (oldval & _CLR) | (newval << _LSB)
<mrvn>
doug16k: and even worse: if bit0 == 0 all bits are available, otherwise constraints apply.
<doug16k>
just make a field that overlaps the others
<clever>
or (val & _SET) >> _LSB to read it out
<dzwdz>
so i've uploaded all the stuff that i think is relevant to my linker issue from yesterday to https://ttm.sh/wvG.txt
<dzwdz>
i still haven't figured it out
<clever>
but, the names are rather long
<clever>
so it turns into (oldval & SCALER_DISPCTRL_IRQ_EN_CLR) | (newval << SCALER_DISPCTRL_IRQ_EN_LSB)
<doug16k>
yeah and RW1C
<doug16k>
read write one to clear
<clever>
i prefer the style linux does, #define SCALER_DISPCTRL_IRQ_EN(x) ((x << 0) & 0x7f)
<GeDaMo>
Somebody was talking about Galois Field instructions in another channel the other day, this sounds similar, arbitrary bit operations on a vector register
<clever>
and then you can just do `SCALER_DISPCTRL_IRQ_EN(foo) | SCALER_DISPCTRL_DSP1_IRQ_CTRL(bar)`
<clever>
clearing isnt really covered by that style though
<clever>
doug16k: though, i have seen something far more fancy in fuschia i think it was, where you have a c++ class for every single register, with a load/store method, and i think some bitfield structs or something
<clever>
so you can copy mmio->temp, then use bitfields to twiddle it, then store temp->mmio
<doug16k>
yeah, I resisted making it all fancy on purpose
<doug16k>
I totally get carried away with C++ if I don't hold back
<clever>
:D
<zid`>
dzwdz: so from last night, .rela.text is getting discarded at link time for some reason but I also have no idea why
<clever>
also, you cant just use helper "set one field" macros blindly, because several peripherals require a password
<dzwdz>
welcome to the club :D
<clever>
you must `| 0x5a000000` every value you write to the peripheral
<clever>
or it just silently ignores the write
<mrvn>
doug16k: I want a C++ class that can be used as temporary but would be real hard to store permanently. A rvalue-only class. Ideas?
fxttr has joined #osdev
<doug16k>
I really didn't want cute C++ tricks all through my driver modules
<clever>
doug16k: i think you could do the same fancy tricks with plain c, using a union of an int32 and a bitfield struct
<doug16k>
I hate aliasing through unions. I won't, ever
<clever>
i would still want to throw some c++ namespaces in though, to make the names easier to manage, given that you need a type for every single register
<clever>
how else do you re-pack the bitfield into a 32bit int?
<mrvn>
std::bit_cast
<mrvn>
using unions is UB
<doug16k>
mrvn, if you made it private constructor, they have to use your static factory, then you could give them a meaningless id and use that to locate it in memory in static wrappers.
<clever>
mrvn: ah, unions are only meant for a tagged union, where you have a type-tag + union, and only ever access the correct variant?
<mrvn>
doug16k: they should never ever be in memory. Only
<doug16k>
doesn't mean you automatically have to call allocator either. can have pool
<zid`>
dzwdz: just for reference, your -lgcc is in the wrong place
<mrvn>
clever: or where the members of the union have a common prefix.
<zid`>
it needs to be on the right, else it's going to get processed, none of its symbols will be needed, and it will be tossed
<zid`>
then ps2.c.o will appear, need something from lgcc, not find anything, and fail
<doug16k>
mrvn, how can they not be in memory?
<zid`>
link order is left to right
<dzwdz>
so i should put it after the object files?
<zid`>
yes
<mrvn>
doug16k: example: a = b * c + d. I want a Product class so that T * T -> Prod<T>, Prod<T> + T -> fmadd T,T,T = T
<doug16k>
if it fit in one register, you could tell gcc to not use it, then keep your thing there. if you call something, they might push it, though
<dzwdz>
also til that link order matters
<doug16k>
have to ban it altogether
<mrvn>
doug16k: but storing a Prod<T> in memory is not desired.
<doug16k>
of course. but what you do is just write the code as memory and the optimizer makes it all registers
<mrvn>
doug16k: something like Prod<t> t = a * b; should give an error
<doug16k>
when you make a __m128i, picture a register. it probably will be
<doug16k>
if it is right there hot in the evaluation of some operator overloading
<mrvn>
stop thinking of the backend. whether it's a register or spilled on the stack is compiler business.
<doug16k>
even if it took the intrinsic literally, it's parameters in registers and register return
<mrvn>
I want something not to be an lvalue
<doug16k>
ah
<doug16k>
delete operator= ?
<zid`>
I was just fiddling with my kernel tree to look at some stuff.. and I can't remember how to link it :D
<bslsk05>
github.com: boros/elf.h at master · zid/boros · GitHub
<zid`>
same :p
<gog>
aayy
<mrvn>
zid`: is that x86_64 or ARM64?
<dzwdz>
gog: thanks!
<moon-child>
former, presumably
vdamewood has quit [Quit: Life beckons]
pretty_dumm_guy has joined #osdev
bauen1 has quit [Ping timeout: 240 seconds]
bauen1 has joined #osdev
gog has quit [Read error: Connection reset by peer]
gog has joined #osdev
xhe has quit [Read error: Connection reset by peer]
xhe has joined #osdev
fastru has quit [Changing host]
fastru has joined #osdev
fastru has quit []
fastru has joined #osdev
toluene has quit [Quit: Ping timeout (120 seconds)]
toluene has joined #osdev
frkzoid has quit [Read error: Connection reset by peer]
<dzwdz>
my elf loader worked on the first attempt
<dzwdz>
i feel like a god
<mrvn>
dzwdz: you did somethig wrong. Stuff never works on the first attempt. That just shows you didn't add enough test cases.
<dzwdz>
i mean, i definitely did
<dzwdz>
i'm ignoring the program header flags
<dzwdz>
and so far i've only tested this on an elf which only has one phdr
<dzwdz>
but still, it worked \o/
<mrvn>
I wonder if those "Learn C++ by solving interview questions online" sites randomize their test cases. What if you write just `cout << "1\n" << "17\n" << "-1\n" << "23\n";` adding each expected result when the submission fails?
<zid`>
neato
<zid`>
Did you see my elf loader yet?
<zid`>
It's about 4 lines
<zid`>
because I don't do relocations
<mrvn>
just mmap all sections with LOAD flag?
wand has quit [Remote host closed the connection]
<mrvn>
don't you need more than 4 include files already?
<zid`>
ke = (struct elf_header *)kernel_start; p = (struct program_header *)(u32)(ke->e_phoff + kernel_start); for(i = 0; i < ke->e_phnum; i++)
<zid`>
In other news, surprisingly nice today for 35C
<mjg_>
dzwdz: nice
<mjg_>
dzwdz: and it's hard to claim the usual of "are you sure that's what you tested"
<mrvn>
mjg_: Here was my Makefile: testing: test \n test 23
<mrvn>
somehoe all my "make testing" worked perfectly.
Bonstra has joined #osdev
<ddevault>
oh good
<ddevault>
interrupt 18 on my hardware
<mjg_>
mrvn: solid
<gog>
aaay my headers are in a success
<gog>
unlike my project
gildasio has joined #osdev
wgrant has quit [Quit: WeeChat 2.8]
wgrant has joined #osdev
gog has quit [Quit: byee]
<ddevault>
what page allocation schemes are people using here? I was going to do a free list but that may not be a great idea
toluene has quit [Read error: Connection reset by peer]
xenos1984 has quit [Read error: Connection reset by peer]
toluene has joined #osdev
<mrvn>
ddevault: a free list is pretty much perfect.
<ddevault>
well
<ddevault>
when writing a word (later to be a pointer) to every page on my system
<ddevault>
I got a machine check exception
<ddevault>
but memtest86+ reports no errors
<ddevault>
really not looking forward to opening this can of worms to find out more
<mrvn>
ddevault: if you don't have physical memory mapped all the time you might use a list of stacks, each stack using 1 page - next pointer.
<mjg_>
just a free list does not let you do huge pages
<mjg_>
but that's not necessarily something you want to invest into at this stage
<mrvn>
mjg_: sure, you just have to search the list for a page.
<mrvn>
mjg_: My view on this is that shortly after boot your memory will be fragmented to the point where you don't have any huge page fully free. At that point you need something to swap physical pages to assemble a huge page. And then a free list works again even without searching.
<mjg_>
cmon man, people have gigabytes of ram even on laptops for over a decade now
<mjg_>
2MB continous range is not that hard to find if you make provisions for it
<mrvn>
total used free shared buff/cache available
<mrvn>
I bet there is no 2MB page free.
<mjg_>
to counter your argument, freebsd is doing transparent huge page promotion no problem
<mjg_>
but since it does not do any defragmentation that ability goes away after few days of uptime
<mjg_>
but the crux is that it most definitely works fine when freshly booted
<mrvn>
see. Even allocating and freeing pages in some careful and complicated manner only helps for a time.
<mjg_>
that's because they did not provision for making it work in the long run
<mrvn>
mjg_: when freshly booted you can take huge pages from the end of the free list.
<mrvn>
return huge pages to the end as well so they can be reused.
<mjg_>
for example there is a leftover concept from the old unix systems where you actually *never* free pages backing certain objects
<mrvn>
yeah. Thos would get allocated to the front though.
<mjg_>
as the page allocator is oblivious to it it just hands pages which ultimately make it impossible to have a continous range
<mjg_>
and which will *never* get freed until you reboot
<mjg_>
basically it's their fault
<mjg_>
in contrast linux can do huge pages ok-ish
<mrvn>
If you avoid using physical memory for any object that problem is solvable.
<mjg_>
of course it is solvable. the easiest thing to do is to collect all these allocations in the same saet of pages
<mjg_>
then you can in fact even promote the crap at some point
<mrvn>
if they are virtual you can swap the physical pages
<mjg_>
they host object which don't tolerate faults on access
<mrvn>
stop all cores but this one, swap, resume.
<mrvn>
not exactly nice on a 1024 core system but how often will you do that?
<mjg_>
well ye, if you tolerate temporarily halting the box, that is an option
<mjg_>
but this only weakes your take on non-applicability of huge pages in the long run
xenos1984 has joined #osdev
<mrvn>
mjg_: my point is that in the long run it doesn't matter what free thing you use, it will run out of huge pages and you need to defrag.
<mrvn>
same in the short run, anything works. might as well keep the simplest way.
<mrvn>
A clever way gets you from hours to days to a week. But it's all limited.
xhe has quit [Ping timeout: 244 seconds]
gog has joined #osdev
<mjg_>
mrvn: but contraty you what you stated, fresh at boot *is* perfectly feasible to get tons of huge pages
<mjg_>
mrvn: as for long run viability, there are ways to mitigate fragmentation but perhaps more importantnly you may boot with a huge page user
<mrvn>
mjg_: nothing contrary there. A free list will give you huge pages there.
<mjg_>
mrvn: which stays up for the duration
xhe has joined #osdev
<mjg_>
for a user space process? there can be other allocating before it gets to the 2MB range and then you are screwed. similarly it may have happened to start allocating one page past the initial boundary
<mrvn>
mjg_: take the last 512 pages from the free list and you have a huge page.
<mjg_>
basically if you don't make provisions to make huge pages happen, they wont unless by accident
<mrvn>
mjg_: all you need for that is a double linkes list.
<mrvn>
altrnatively have 2 free lists: 4k and 2M.
<mrvn>
but that's getting into clever territory.
<mjg_>
well say you have a process which mmaps BIGNUM megabytes and starts with faulting 10 pages from the beginning
<mjg_>
then it gets preempted and some other process faults 2 extra pages
<mjg_>
how does your free list cope with it vs huge pages
<mrvn>
then 12 pages get used from the front of the free list
<mrvn>
the end still has huge pages to use
<mjg_>
but did not you just "steal" 2 pages from the 2MB range from the first process
<mjg_>
make it impossible to promote to huge page for that one
<mrvn>
mjg_: that's pretty much a given that that happens. When the process allocates pages 11-512 you might have to copy the data to a huge page.
<mrvn>
YOu can avoid that by marking mmaps of HUGE ranges to use huge pages from the start or something.
<ddevault>
ah, my machine check exception might have something to do with writing to the address *after* the memory region I'm working with
<mjg_>
or the kernel could make provisions to hand out "sufficiently spread" pages for mappings
<ddevault>
const addr = phys + (i * arch::PAGESIZE): uintptr; // right
<mjg_>
for example it knowns the mmap area is big and may be viable for huge pages, so it starts from physically aligned to 2MB
<mjg_>
for the other process say mmap is only for 4 pages, then it does not care apart from not messing with the potential range for the first one
<mjg_>
consequenlyt giving out pages from elsewhere
<mrvn>
mjg_: that's what I said
<mjg_>
but then it's not just a free list
<mrvn>
sure it is.
<mjg_>
i don't think this is worth continuing
<mjg_>
so how about we agree to disagree
<mrvn>
mjg_: The fault handler checks if the maped region is SMALL or HUGE. If it's HUGE it takes 512 pages from the end, otherwise 1 page from the front.
<mrvn>
mjg_: Even if you don't map it as huge page instantly you still have to reserve the other 511 pages so they shouldn't be in the free list.
arch_angel has joined #osdev
arch_angel is now known as arch-angel
noeontheend has joined #osdev
arch_angel has joined #osdev
arch_angel has quit [Remote host closed the connection]
<bslsk05>
docs.kernel.org: Booting AArch64 Linux — The Linux Kernel documentation
noeontheend has joined #osdev
X-Scale` has joined #osdev
X-Scale has quit [Ping timeout: 240 seconds]
X-Scale has joined #osdev
X-Scale` has quit [Ping timeout: 240 seconds]
frkzoid has quit [Ping timeout: 240 seconds]
hello-smile6 has quit [Remote host closed the connection]
hello-smile6 has joined #osdev
dude12312414 has quit [Quit: THE RAM IS TOO DAMN HIGH]
ajr has joined #osdev
<zid`>
bytefire: So potentially a warm boot then, being my point.
<zid`>
i.e if I hit the reset switch on a running system, the first instruction the kernel sees is still the same
<zid`>
you can't *know* you're in a cold boot
<mrvn>
Even in cold boot you can't know if the cache hasn't been active before.
frkzoid has joined #osdev
matt__ has joined #osdev
matt__ is now known as freakazoid333
frkzoid has quit [Ping timeout: 244 seconds]
Raito_Bezarius has quit [Ping timeout: 240 seconds]
blockhead has joined #osdev
freakazoid333 has quit [Ping timeout: 244 seconds]
Raito_Bezarius has joined #osdev
cor3dev has joined #osdev
cor3dev has left #osdev [#osdev]
cor3dev has joined #osdev
Brnocrist has quit [Ping timeout: 240 seconds]
mzxtuelkl has quit [Quit: Leaving]
ZipCPU has quit [Ping timeout: 255 seconds]
<geist>
bytefire: the cache disable is pretty complicated
<geist>
and yeah mrvn is onto it
<geist>
basically the cache may be disabled however it may have stale entries in it
tsraoien has quit [Ping timeout: 272 seconds]
<cor3dev>
hi :)
<geist>
howdy
ZipCPU has joined #osdev
the_lanetly_052 has quit [Ping timeout: 272 seconds]
GeDaMo has quit [Quit: There is as yet insufficient data for a meaningful answer.]
frkzoid has joined #osdev
frkzoid has quit [Ping timeout: 240 seconds]
matt__ has joined #osdev
matt__ is now known as freakazoid333
Patater has quit [Quit: Explodes into a thousand pieces]
vdamewood has joined #osdev
Patater has joined #osdev
<zid`>
guys, is honzuki v4p9/v5p1 1/8 scheduled for this week or next?!
<immibis>
mrvn: since it was compiled that way. I believe PIE is something your distribution enables or disables, generally. Of course, I run gentoo so for me it's a system-wide config file option
<immibis>
or at least I assume it is
<zid`>
yea pie has been global on for a loong time in gentoo
tsraoien has joined #osdev
<immibis>
global configurable
<zid`>
the selinux profile is going to want it anyway so why bother maintaining two sets of gcc ebuilds I suppose :p
<geist>
yah that used to be somewhat more controversial, especially in the x86-32 days, but nowadays i think most stuff is just PIE so it can aslr if nothing else
<immibis>
it's not even a different compiler, it's a use flag
<zid`>
use flags are oneshots for compile time
<geist>
i do remember doing the whole global prelink for all the libs and whatnot back in the day
<immibis>
here's a fun fact about CPUs: reset is usually instantaneous. You are perfectly entitled to reset the CPU multiple times per second as part of normal operation. (Dunno whether x86 PC architecture lets you do it)
<immibis>
some old microcontrollers use reset as an interrupt
<geist>
immibis: it's tricky in x86, yes
<zid`>
so you could toggle it sure, but it's sorta 'set' once you pick, and the profile has a default in it
<immibis>
zid`: you can recompile any time
vinleod has joined #osdev
<geist>
yep, it's common to simply power off the core in microcontrollers when you have nothing to do
<geist>
and set some wakeup event (RTC, etc) to just bounce the cpu through the reset vector
<geist>
and you can simply arrange for your kernel to keep running in a few microseconds
<geist>
i think the deeper C states in x86 are probably functionally doing this, it's just hidden behind a few layers so the kernel doesn't have to see it
<geist>
but some ARM machines you end up doing this sort of thing too, for low power states
<zid`>
I've stopped rebooting PCs
<zid`>
It almost never fixes whatever driver issue I've had that's caused me to want to
<zid`>
bios skips too much initialization if the magic_warm_boot_flag is still valid or whatever
<immibis>
IIRC triple faulting to reset the CPU was the only way to exit 286 protected mode
vdamewood has quit [Killed (mercury.libera.chat (Nickname regained by services))]
vinleod is now known as vdamewood
<zid`>
it still is isn't it?
<immibis>
why are you in 286 protected mode
<zid`>
s/286/
<zid`>
otherwise you end up in unreal mode, so you have to manually set up selectors that would match how real mode would behave
<zid`>
some nonsense anywwho
<clever>
immibis: "You are perfectly entitled to reset the CPU multiple times per second", that sounds like using bios realmode drivers in win 95
<immibis>
i thought when you loaded selectors in unreal mode you exited it
<immibis>
clever: and i thought you could go to real mode by a cr bit
<clever>
i suspect such tricks are far more costly on an soc like the rpi, because the dram controller is also killed
<clever>
so you have to re-init the dram after every reset
<zid`>
probably also messes with pci-e on x86 then?
<clever>
at least, for the global reset
<clever>
there is probably an arm-only reset, and an x86-only reset for those chips
<clever>
just the cpu core itself, and nothing else
<geist>
indeed. to power the cpu core off but keep dram running is probably not worth the effort on a rpi
<geist>
but on a microcontroller it may be a net win. its all about considering the cpu to just be one part of many of silicon, of which you usually have separate controls over power gating
<geist>
so depending on how far down you're trying to get your low power state you turn off the things you dont need but also factor in how much work you need to do to get out of it
<zid`>
Yea this is just fractal power gating
<zid`>
You turn the device off when you don't need it, the device turns the cpu off when it doesn't need it, the cpu turns regions off when it doesn't need them, etc
<clever>
for the rp2040, there is a special power-down mode for the sram, that reduces the voltage to the point where it can preserve data, but any access will destroy data
<zid`>
how relevent each of those is depends on the device
<geist>
lots of microcontrollers also have multiple banks of sram, of which yo can power them off individually
<clever>
and that is used to further reduce power when in sleep modes
<geist>
makes it a little harder to write your code since you have to consider separate banks differently
<clever>
and yep, the rp2040 lets you turn each bank on/off as well
<clever>
so if you dont need it all, you can kill some of it
<geist>
but if you can, for example, squirrel away all your recovery bits in sa a 1k sram bank (out of 64k) then you can power it off, kill the cpu, and when you come out of reset if you see the bank is up and has some data, short circuit booting your system and pick up where you left off
<geist>
yep. for a project that never shipped we had LK running on a litle microcontroller off a cr2032. actually would run for months like that since it spent nearly 100% of its time completely powered off except an RTC + something like 256 bytes of backup sram
<clever>
the rp2040 doesnt really need to do that, because the arm registers are preserved across a sleep state
<geist>
but upon wakeup could get LK up and running within a millisecond or so
<geist>
and a lot of that was waiting for the PLL to latch up
<immibis>
not in SRAM - it mgiht power up in a valid-looking state. You need some register to remember the reason the CPU went to sleep. Usually there's a power-on-reset bit which tells you the CPU came online because the chip got powered up
<geist>
immibis: sure, you can check to see if the bank was powered up on the reset vector
<immibis>
of course, you have a very good chance that it won't power up in a valid-looking state... should you risk it?
<geist>
yes. if that's the only solution, you have to risk it
<geist>
easy to fix though, put a crc or whatnot over your structure in memory
<geist>
astronomically low that it'll come up vcalid
<immibis>
it's not the only solution since you have peripheral registers and such. As you just mentioned, the power state of the memory banks could be how you discover it
<geist>
valid
<geist>
indeed. that's the fun part of some of these microcontroller things outside of just hobby board stuff
<immibis>
one in a billion is next tuesday
<geist>
actually really getting everything out of the tinkertoy set you have in front of you
<geist>
easy to extend it way past one in a billion. that sort of 'keep a thing in memory that's checksummed' is really really common
<geist>
frequently even how systems like android or whatnot communicate with themselves across reboots
<geist>
you can put down some pretty complicated structure with magic values and whatnot and then hash/checksum it i think it's pretty much not a problem
Brnocrist has joined #osdev
<zid`>
My bios just writes a magic value into ram to detect warm reboots
<geist>
but as you say there are usually peripheral registers that at the minimum usually signal if the soc had cold booted or reset or if was woken from one of N external sources (RTC, gpio wiggling, etc)
<immibis>
just make sure your checksum is really long
<geist>
and that's already the first major decision point your reset code takes
<immibis>
I made a little toy on a PIC12F505(IIRC) with 27 bytes of memory and long sleep delays using the watchdog timer on its longest setting
<immibis>
(that was the smallest microcontroller in my box of microcontrollers and you're right, it is pretty cool to use)
<zid`>
It's actually the first instruction in one of the roms I looked at
<zid`>
it does mov eax, [sidoisodisods]; cmp eax, 0x37483asdui
<zid`>
which I assume is both magic and polyglot and other things
<mats1>
O_O
<immibis>
magic is definitely how they got an 's' into the number
freakazoid333 has quit [Ping timeout: 244 seconds]
frkzoid has joined #osdev
frkzoid has quit [Ping timeout: 272 seconds]
gildasio has quit [Remote host closed the connection]
gildasio has joined #osdev
<mrvn>
immibis: some gcc binaries are PIE and some are not.
<geist>
hmm, networking question: do modern nics do segmentation offload by fragmenting tcp packets as if they were fragmented ip or do they actually re-send and combine at the TCP level
<geist>
*probably* the latter, but it seems like it'd be easy enough to fragment at the ip layer
<geist>
the latter obviously requires more computation, but would be cleaner
Matt|home has quit [Ping timeout: 244 seconds]
vdamewood has quit [Quit: Life beckons]
<immibis>
since it's called segmentation offload and not fragmentation offload I am going to guess the latter
<immibis>
don't you just love redundant multi-layer features? IPv6 removing fragmentation is a good thing
<clever>
> EL2 actually has a bit to alias the EL2 system registers onto EL1 system registers
<clever>
[5:51 PM]
<clever>
geist: heard this just now, that would allow running an aarch64 LK in both EL1 and EL2, without any major code changes
<immibis>
wait it didn't remove fragmentation. I'm thinking of fragmentation by routers
<geist>
immibis: exactly. it's an extension header in v6
<mrvn>
segmentation offload is when you send a 64k frame to the NIC and it sends 1500 byte frames?
<geist>
clever: that sentence does not parse, retry
<geist>
mrvn: yes, and vice versa. the nic sees a buch of subsequent packets and reassembles them into a larger one to pass to the stack
<geist>
that's why on a modern machine if you do a tcpdump or so you'll probably see up to 64k TCP packets, since thats what the net stack sees
<mrvn>
I guess with fragmentation some stupid router on the internet would try to reassemble the frame and send it on with a different fragment size or something and choke.
<geist>
clever: oh. i see. it's a statement. yes, that's an v8.1 feature, though it's a bit more complicated than that
<geist>
mrvn: i think that's explicitly forbidden in both v4 and v6
<immibis>
mrvn: Actually the problem with fragmentation was the exact opposite. When routers had to generate fragments it forced those packets to use a slow path
<geist>
and v6 routers cannot fragment
<mrvn>
immibis: the problem with fragmentation in IPv4 is that routers don't generat IPMC_FRAGMENT signals, especially DSL bridges.
<immibis>
inside a really fast router 99.999% of packets get processed by dedicated silicon that looks up the next destination and forwards the packet that way. Fragmentation is one of the things the special chip can't do.
<geist>
i think it's basically standard practice nowadays to send the do-not-fragment bit on v4 packets, but yeah that's relying on routers downstream to do the right thing and send back the ICMP
<immibis>
when you have this hundred terabit router that can theoretically generate fragments, how is it actually going to generate a hundred terabits of fragments? it's not
<clever>
there are also 2 forwarding strategies ive heard of
<immibis>
all those fragmented packets are going to be routed to a CPU and processed at a couple of gigabits. It's a bottleneck and it's totally unnecessary
<clever>
either store&forward, or the other one
<immibis>
cut-through
<immibis>
the routers I worked on did not do cut-through forwarding
<clever>
store&forward, requires the entire packet in memory, before it forwards it onward, so it always adds a latency of packetsize / bitrate
<mrvn>
immibis: the CPU doesn't have to look at all the bit. Only the header and then slice the frame. You can pretty much zero-copy that.
<immibis>
although they did split the packet up into a linked list of data cells, so I wonder if they were smart enough to start computing the routing after the header cells only
<geist>
bunch of recomputatios of checksums
<clever>
but cut-through i think, would just parse the ethernet dest in the first 6 bytes, then begin forwarding, so your latency is just 6 bytes / bitrate, best case
<clever>
assuming the dest isnt busy when the packet arrives
<mrvn>
So you would go from 6 byte latency to 64byte latency to see the header
<mrvn>
But horrible complex silicon fro something you just shouldn't do and nobody does.
<geist>
i'd be a little surprised if anything gets that sort of latency anyway
<mrvn>
clever: if you only look at 6 bytes you don't even see if you need to fragment. But with in and out using the same MTU that's OK.
<clever>
yep
<clever>
it cant fragment or merge packets, only ever forward
<mrvn>
no checksum checking either
<clever>
yep
<mrvn>
I guess if you have bad cables you want a store&forward design.
<geist>
there are a pile of extensions that might need to be looked at, but possible they're designed such that routers never have to look at extension headers?
<geist>
or the implicit order of the headers is router ones go first?
<geist>
(thinking v6 here)
<geist>
never writ a router, so haven't gone through that exercise yet
<mrvn>
But if you do store&forward you could just have one of those smart NICs on each port, send it the big frame and then the port itself fragments and sends it out. All nice a pipelined.
<mrvn>
Basically one CPU per port to do it all in parallel at full throughput.
[itchyjunk] has quit [Ping timeout: 272 seconds]
[itchyjunk] has joined #osdev
<bytefire>
mrvn: geist: the cache most likely has been active before but linux expects it the be cleaned to the point-of-coherency.
<mrvn>
bytefire: aparently not if it cleans it
<bytefire>
hmm... trying to think this through
<bytefire>
btw kernel is invalidating it. clean does happen as a side effect but clean should be a no-op otherwise the data it wrote will be overwritten
<mrvn>
some bootloader probably didn't follow the rule and rather than trashing a bunch of boxes with the bootloader in rom they added the clean cache op to work asround it.
<mrvn>
or cache could be in some random uninitialized state on cold boot.
<bytefire>
if a bootloader didn't follow the rule, i.e. didn't clean the cache, then the code in question will lead to a bug
xenos1984 has quit [Read error: Connection reset by peer]
<bslsk05>
lore.kernel.org: [PATCH v5 8/8] arm64: enforce x1|x2|x3 == 0 upon kernel entry as per boot protocol - Mark Rutland
<immibis>
geist: IIRC the "hop-by-hop extension" has to be first
<immibis>
Hop-by-Hop Option Header*
<immibis>
this is signalled in the "next protocol" field. The intent is, of course, to optimize the slow path test by consolidating a bunch of slow path features under the same check (is hop-by-hop option header present)
<immibis>
mrvn: yes, designing a fast router seems like a neat project, if you have the time and inclination. Maybe see how much throughput you can get from a stock CPU and stock NICs
vin has joined #osdev
xenos1984 has joined #osdev
<mrvn>
immibis: to highlevel. I build my own CPU. At some point I have to design my own NIC.
<immibis>
at the router company we had $XXX,XXX traffic generator tools which were incredibly useful but you can only have that if you're a company
<immibis>
may have been $X,XXX,XXX, idk, I wasn't in the accounting department
<mrvn>
And maybe a 10Mit <-> GBit bridge with stock components because I don't think I can make my own GBit NIC.
<mrvn>
immibis: probably something you can do with a $1000 Linux box and some man years idle time to think of all the traffic patterns.
<immibis>
I think you can feasibly do 100Mbps, the signal processing doesn't look hard
<immibis>
for the NIC
<mrvn>
With a 1kHz - 1MHz system?
<immibis>
mrvn: Nah, you have to specify the traffic patterns, at least in the modes I remember using. Yes, a Linux box could probably do it.
<immibis>
mrvn: oh idk. Put an FPGA in the path and you can do 100
<immibis>
mrvn: I do remember using a free tool called ostinato to generate traffic from my laptop a few times
<mrvn>
immibis: FPGAs have network interfaces ready made, you just need the physical stuff. I would want to maybe do both myself and not use some storebought.
<mrvn>
just for the experience.
<mrvn>
Otherwise I can just buy one of the Arduion ethernet shields.
Matt|home has joined #osdev
Dyyskos has joined #osdev
gildasio has quit [Remote host closed the connection]
gildasio has joined #osdev
Dyskos has quit [Ping timeout: 272 seconds]
gildasio has quit [Remote host closed the connection]
<mrvn>
When you ask google / alexa / whoever to flip a coin does it ever land on the rim?
gildasio has joined #osdev
<cor3dev>
pretty sure it will land on the rim if anton chigurh is the one who asks
<gog>
call it, friendo
<cor3dev>
i need to know what we're calling for here