klange changed the topic of #osdev to: Operating System Development || Don't ask to ask---just ask! || For 3+ LoC, use a pastebin (for example https://gist.github.com/) || Stats + Old logs: http://osdev-logs.qzx.com New Logs: https://libera.irclog.whitequark.org/osdev || Visit https://wiki.osdev.org and https://forum.osdev.org || Books: https://wiki.osdev.org/Books
<mrvn> clever: no, you need to 0 out the ints.
<mrvn> moon-child: same thing if you have longs in your code.
<clever> mrvn: all ints are 64bit ints
<clever> even on a 32bit platform
<mrvn> clever: but they are likely <4GB, which is where all the allocated memory is.
<klange> my ints are 48-bit, please send help
<clever> *looks*
<moon-child> mrvn: I use a tiny fraction of the 64-bit address space
<moon-child> and it starts well above the 4gb mark
<clever> $ grep heap /proc/self/maps
<clever> 009a8000-00a13000 rw-p 00000000 00:00 0 [heap]
<mrvn> moon-child: lucky you. Linux doesn't.
<mrvn> (execpt on alpha)
<clever> moon-child: ah your right, my heap is very low in the virtual space, so it could collide with an arbitrary 32bit int
<clever> haskell searches in a similar manner, but every object has a pointer to its type description, that says exactly where to find the pointers
<clever> so it cant mis-interpret an int as a pointer
<mrvn> in ocaml pointers have bits 0/1 == 0, ints have bit 0 == 1 and are only 31/63bit.
<moon-child> http://ix.io/3RrX seems pretty high
<mrvn> 10 ans 11 are used for other special values.
<bslsk05> ​github.com: nix/eval.cc at master · NixOS/nix · GitHub
<clever> mrvn: boehm has its own allocation routines, which could enforce alignment, so bit0/1 are always 0, and it may have extra metadata before the object
<clever> so a pointer into the middle of an object wont count
eddof13 has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
<moon-child> sbcl also has a tagging system, but does conservative stack scanning. Cuz sometimes you wanna spill unboxed values, and making stackmaps is a pain
<mrvn> clever: but it doesn't have control about where you point to. You can point to the middle of a string.
<moon-child> (however, scanning of everything aside from the stack is precise)
Oli_ has quit [Ping timeout: 272 seconds]
<klange> y'all seem to know a lot about gcs
<klange> why y'all ain't helpin me make mine suck less
<mrvn> moon-child: DWARF2 already has the stackmap made by the compiler.
Burgundy has quit [Ping timeout: 260 seconds]
<moon-child> sbcl does not use DWARF
<mrvn> klange: I don't think there is a GC for multi-core that doesn't suck. They just manage to suck about the same as malloc/free.
Oli has joined #osdev
<moon-child> just compacting wins you major points vs malloc/free
<mrvn> moon-child: but costs you time.
<moon-child> it pays for itself
<moon-child> but also: lots of gcs do parallel mark, and there are a few concurrent gcs
<mrvn> and they all suck in different ways
<moon-child> yes. Everything sucks! Such is life :)
<moon-child> (also, on the malloc/free front, interesting recent developments in mimalloc and snmalloc)
<mrvn> the biggest problem is how to modify values, because you have to somehow atomically flag them to the GC to be scanned again.
<moon-child> https://twitter.com/stevemblackburn/status/1494240906006110209 quotes 1% overhead from barriers
<bslsk05> ​twitter: <stevemblackburn> Some of the things I learned: ␤ ␤ Overhead now as low as 0.8% on a modern AMD CPU (was 1.8%). ␤ Intel CPUs are less able to hide the overhead. ␤ Field barrier overheads ~= object barriers on all but the AMD CPU. ␤ ␤ I'm so proud of these students. 🥰 ␤ ␤ 2/2
<mrvn> Luckily in functional languages nearly all values are immutable and mutable the rare exception. So you pay little for the overhead. Trying to GC something like C/C++ you pay a lot for it.
<moon-child> indeed
FatAlbert has quit [Ping timeout: 256 seconds]
<mrvn> Did boehm GC use mprotect to catch writes to already scanned memory?
<clever> mrvn: at least in the way nix uses it, its single-threaded, so scanning cant be interrupted or raced against
<mrvn> urgs, horrible. Stop the word for ages technology. :)
<mrvn> Modern GC do a little bit of work frequently and then have a tiny stop-the-world part on multi-core where all cores need to synchronize
<clever> boehm might be cheating, and doing scans in another thread, while nix does everything from 1 thread
<clever> ive not reviewed its source
<mrvn> unlikely. boehm is quite old.
<mrvn> I've seen other GC use generational heaps and mprotect. The idea is that you create a new heap every now and then and make the old ones read-only. Then if no page fault occured you know the old heap has no pointer into the new heap and doesn't have to be scanned.
<clever> haskell uses a copy collector for its stuff, with a nursery section as well
<clever> recently created objects are in the nursery, which is a smaller region, and gets scanned more often
<clever> if the object survives infant-mortality, it gets moved into the big-boys heap, where its scanned less often
<clever> the copy collector works by just copying every object in the entire heap, while chasing references, and somehow rewriting all pointers as it copies
<clever> and then it just marks the source heap as free
<clever> any objects it didnt copy, where not referenced
<clever> that also improves locality, an object is going to be closer to its siblings in the reference graph, because they got copied together
<mrvn> that's generational heaps with compaction. But haskell knows about modify operations from the compiler and doesn't have to use mprotect()
<clever> nix and haskell are also both functional, objects almost never get modified, at a value level
<clever> but thunks can mutate from a function-ptr+arg, into a concrete value, when you look at them
<mrvn> When you talk to c++ people about performance they always say: avoid indirections, never ever used linked lists. They will kill you. Ocaml, haskell, ... are full and full and full of indirections and still the code runs comparable to optimized c/c++. The linked lists and indirections are just so localized that they are basically always in cache.
<clever> ive seen a talk before about how to not have fps jitter in android games, without native code
<clever> and it basically boiled down to a few things
<mrvn> run GC::sweep() after every frame
<clever> 1: dont create any garbage in your render routines, reuse field members on a constant object, instead of local vars
<clever> 2: disable gc entirely during rendering, and force a gc scan when the fps doesnt matter, like a loading screen
<mrvn> urgs, horrible.
<clever> if you keep garbage creation to a minimum, you can go an entire level without a single gc
<clever> i think the root problem, is that a gc may take more then 1/60th of a second
<mrvn> I create tons of garbage while rendering in anything with fps. All in the nursery if I set it up right and at the end of rendering the GC frees it all in one sweep.
Oli has quit [Read error: Connection reset by peer]
<clever> and if it does, you miss a frame, and the frame rate now has jitter
<clever> the old dalvik engine may not have had a nursery
<mrvn> All the GC has to do is copy the 10 or so objects that are still alive (e.g. new rockets fired by the player) from the nursery into the long term heap and done.
<clever> pretty sure that talk also predates the dalvik->native llvm converter
Oli has joined #osdev
<mrvn> clever: For games you want a GC where you set it up to use a fixed (or given) amount of time to do work each frame. E.g. use 1/60 - <time spend rendering>.
eddof13 has joined #osdev
<mrvn> So if one frame takes longer to render you do less GC work and next frame you catch up again when less happens.
<clever> yeah, being able to tell the gc to do less work could do
<mrvn> You could even set up the GC to run until the vblank IRQ fires.
<mrvn> Some years back they profiled and benchmarked malloc/free in C/c++ and the GC in ocaml and found that programs generally spend the same amount of time in malloc/free as they do in the GC.
<mrvn> So you just have to make sure the GC work is cut into small enough pices to not get jitter.
<clever> i feel like free() should be pretty darn fast
<clever> you just need a structure at a negative offset from the object being freed, and maybe a doubly linked list to similar structures forward/back
<clever> so you just need to flag that slot as free space, and optionally merge it with contiguous free space blocks
vdamewood has joined #osdev
<clever> and maybe update some free-space lookup tables, if you have them
<clever> malloc feels like the costly one, having to search for a hole that is big enough, and then dividing it down into an object+hole pair
<mrvn> free needs to merge holes. It's ~400 lines of code in glibc from what I read.
<clever> !?
<clever> merging holes seems pretty simple, just check the previous and next item
<clever> you only have 4 cases to deal with, object/hole/object, hole/hole/object, object/hole/hole, and hole/hole/hole
<mrvn> check canaries, check if you could do sbrk() to free memory or munmap(), ...
<clever> and the only difference is if your new hole is 1/2/3 objects long
<milesrout> rubbish collection is much less necessary when you have less rubbish to collect, yes.
<mrvn> check if the address is in a heap of malloc/free at all
<mrvn> locking to protect against other threads
<clever> now i'm wondering how lk does things...
<clever> all of my builds are using miniheap
<mrvn> doesn't glibc also have extra hepas for small objects?
<bslsk05> ​github.com: lk/miniheap.c at master · littlekernel/lk · GitHub
<milesrout> the glibc malloc/free implementation needs to work for basically every program out there. of course you could come up with an algorithm that will work better for your program, but can you come up with an algorithm that will work better for every program? thankfully you don't need to when you're writing your own allocator
<mrvn> milesrout: I can, others have too.
eddof13 has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
<milesrout> there are good general purpose allocators out there and some of them may even be better than glibc malloc, i won't deny that
<mrvn> milesrout: especially in C++ you can gain a lot for allocations of known size.
nyah has quit [Ping timeout: 245 seconds]
<milesrout> but often you see people say things like "oh well I replaced malloc with my own memory allocator for this program and it's faster therefore glibc malloc is bad" which is obviously not true
<clever> lets see, miniheap has a linked list of free space, and initially just adds the entire defined heap range as one free chunk
<clever> heap_insert_free_chunk also deals with merging with neighboring chunks automatically
<milesrout> plus the whole malloc/free interface where free doesn't take a size is inherently less efficient
<mrvn> milesrout: not really. You can't trust the size given by bad code anyway.
<milesrout> and that's another reason: the standard library allocator needs to be more defensive, while most custom allocators are less defensively written :)
<clever> size_t is usually the same size as void* right?
<mrvn> clever: usually but not neccessary
<clever> LK's miniheap uses a 3 pointer wide struct to track free space, and just casts the ptr of your free'd object into that i think
<clever> so that implies objects have a minimum size of 3 pointers
<clever> reading more...
<mrvn> clever: isn't it 3 pointer + canary? Or does it not care about alignment rules?
<klange> I need to improve my malloc, it is very bad at freeing. It doesn't release memory at all.
<klange> It's pretty good on benchmarks otherwise.
<milesrout> to be honest that is probably a good thing for some programs...
<mrvn> clever: You have to store the size of every block before the block and looking at x86 memory must be 16 byte aligned. So a minimum block would be 32 bytes where 16 bytes are usable.
<milesrout> there are some C++ programs out there that hang when they quit because they call delete a million times on all those RAII objects only for the kernel to free all that memory when they quit anyway
<klange> It's surprisingly less important to release memory from free'd objects than one might think, but I have a lot of potentially long-running stuff that could end up holding onto megabytes of RAM it's not using and I want to deal with that.
<mrvn> milesrout: if they have RAII then they do something more than free in the destructor.
<milesrout> yes RAII manages resources other than memory
<mrvn> klange: I don't free memory, I buy more. Works a long long time.
<clever> mrvn: let me double-check what miniheap_alloc does exactly...
<mrvn> void *miniheap_alloc(size_t size, unsigned int alignment) {
<milesrout> they also call close on file descriptors that will be closed automatically by the kernel on process exit too :P of course SOME things do need to be done even on quit, like deleting lock files
<mrvn> clever: looks like it can place chars closer together than avx512 rebisters.
<klange> It's not that I don't have a free(), I have a very reasonable free(), and if your allocation patterns involve lots of the same size stuff my malloc is _excellent_.
<milesrout> but most of what is done in the average destructor of a C++ object doesn't actually need to be done atexit
<mrvn> milesrout: AmigaOS does none of that. Write good code and close/free everything properly please.
<klange> It's when you ask for a big thing... then ask for a _bigger_ thing, then free the smaller thing, and then never ask for anything of similar size again that there's problems.
<klange> Which I do sometimes when... resizing windows!
<klange> Some merge + divide for large allocations will help, too.
<milesrout> yeah no thanks, i have no interest in my processes literally hanging my computer when I quit them for seconds so they can do a whole lot of stuff that will be done anyway, just for the sake of the possibility of it one day being ported to a long-dead operating system that is fundamentally broken
<mrvn> milesrout: the average destructor in c++ does nothing. Destructors that do something are the exception.
<clever> mrvn: ah, line 58/63/64, neat, the heap will just omit all of the safety fields if you build without debug support
<milesrout> i am including in the work done by the destructor the automatic destruction of all the object's members
<clever> so, assuming no debug, that is 2 pointers of overhead prepended to each object, with debug it becomes ~5 pointers?
<mrvn> clever: what's the void *ptr for?
<clever> ah, but certain debug also adds a 64 byte padding between objects
<mrvn> clever: previous block?
<clever> mrvn: i think thats the hole its inspecting, and possibly returning back to the user
<mrvn> clever: no, the ptr is for alignment. It gets a chunk, adds a bit to get aligned and then ptr points back to the start of the chunk: line 253
Matt|home has quit [Quit: Leaving]
<mrvn> Line 118 is the mergeing code. bad bad bad
<moon-child> mrvn: re nursery, I wish I could tell gc when I blit, and it can tune the nursery size to 99%ile (or w/e) per-frame allocation size
<moon-child> combine with pretenuring and you get the same performance as manual regions, and way better safety
<moon-child> same applies to e.g. web programming
<mrvn> moon-child: bigger doesn't hurt. It knows how much is used.
<moon-child> it doesn't know when the ticks happen, though. Best-case scenario, gc happens every tick, and copies almost no objects between semispaces
<mrvn> moon-child: In ocaml the GC has a public interface with stats. You can read out the amount of allocations each frame and calculate the delta. Then set the nursery so 1.5*delta.
<moon-child> if you gc in the middle of a frame, you're gonna do a bunch of extraneous copying
mahmutov has quit [Ping timeout: 240 seconds]
<moon-child> ah that's cool
* moon-child hasn't really used ocaml
<mrvn> moon-child: and you call gc.minor() to force a nursery collection
<mrvn> I think such a public interface to the GC is essnetial if you want to tune it dynamically to a game.
<clever> mrvn: ah, line 183 is a bit tricky, if debug is off, the alloc_struct_begin is only 2 pointers in size, but free_heap_chunk is 3 pointers in size, so an malloc(1) gets bumped up to malloc(sizeof(void*))
<mrvn> pathological case. You have to alloc something less than 4, set alignment to less than 4 and the roundup to sizeof(void*) needs to not increase the size. So the only case I can see where free_heap_chunk won't fit is malloc(0, 1)
<mrvn> s/4/8/ for 64bit archs.
<clever> another topic, is mixing both malloc and relocatable heaps
<clever> palmos and the official rpi firmware both do that
<mrvn> what else would yo do?
<clever> small objects go to the normal malloc heap
<clever> large objects go to a special relocatable heap, where alloc returns an opaque token, not an addr
<clever> lock returns the current addr, and unlock unlocks!
<clever> while unlocked, the os is free to move objects around, to fix free space fragmentation
<moon-child> huh neat
<mrvn> oh, that's a totally different interface then.
eddof13 has joined #osdev
<mrvn> so ralocatable objects must not have any pointers, only inidexes into arrays.
<clever> yeah
<clever> its typically used for image buffers
<clever> the pointers and object-tokens are all in the standard malloc heap, where things are small
<clever> the relocatable heap then holds multi-mb images
<moon-child> lock/unlock also smells a lot like gpu apis
<moon-child> so makes sense
<clever> palmos also did it, because there was no mmu, and every active process had to share an address space
<mrvn> glibc malloc uses mmap at a certain size giving you individually mapped regions of memory.
<clever> mrvn: but the VPU on the rpi also lacks an mmu, so it cant just cobble random pages together in a random order
<mrvn> In functional languages you have copying/compacting GCs where the GC will move objects around and fix up all the pointers automatically. That's realy neat.
<moon-child> (also java)
<mrvn> Going back to miniheap. I would add a back pointer to allocations so that on free() you can access the chunk before and after and do an O(1) merge.
<clever> ah, now that i look at it, yeah, ouch
<clever> the header on any object, only has a ptr and a size, but what is that ptr for...
<moon-child> mrvn: funny, I just rewrote my allocator, using that design
<clever> line 248...
<clever> ptr is just "chunk" ...
<mrvn> clever: to undo the alignment.
<clever> ah
<clever> is the alignment always going to just waste enough space to align?
<mrvn> Looks like it. So alignment=4096 wastes a lot of space
<mrvn> it should check if the hole at the front is large enough for struct free_heap_chunk
<clever> i would have chosen to slice a free-chunk up, so it ends at alignment-sizeof(alloc_struct_begin)
<mrvn> size = ROUNDUP(size, sizeof(void *));
<clever> so the unused space remains free
<mrvn> The free chunks are 4/8 byte aligned but memory is 16 byte aligned per default.
<clever> the 3d code is the only time ive had to align things at runtime: https://github.com/librerpi/lk-overlay/blob/master/platform/bcm28xx/v3d/v3d.c
<bslsk05> ​github.com: lk-overlay/v3d.c at master · librerpi/lk-overlay · GitHub
pretty_dumm_guy has joined #osdev
<clever> memalign() is the keyword to search for
<clever> 16 and 256 are my only alignment requirements at runtime currently
<mrvn> the code is odd. Check out line 193-199
<clever> the shader (line 143) must also be 32 byte aligned, but the linker can do that
FatalNIX has quit [Quit: Lost terminal]
<clever> the binner and render control-lists are a custom bytecode
<mrvn> Without alignment the size is rounded up to sizeof(void*). With alignment it's at least 16 bytes.
<Jari--> good morning friends
<clever> and thats what the original hello-world did to generate it
<mrvn> #Then on line 245 the alignment happens.
<clever> line 198 of v3d.c, you have a 112, so you look that up in the pdf, page 71
<Jari--> Enjoy OSDEV: Physical Memory Mapped I/O Memory Space Windows !!!
<mrvn> So miniheap_alloc(32, 0) will give memory that is 4/8 byte aligned and miniheap_alloc(32, 1) will give memory that is 16 byte aligned and wastes 16 extra bytes.
<clever> opcode 112 is "tile binning mode configuration", it is then followed by 120 bits of config (15 bytes)
<Jari--> Can you talk to the PCI bridge to realign the memory mapped devices to some other more proper memory space?
<Jari--> Does todays display adapters support 32-bit PCs?
<mrvn> Jari--: yes and no
<Jari--> mrvn, would you draw filled polygons with HW acceleration or actually with directly poking to the VGA RAM?
<Jari--> Whether which is faster.
<mrvn> yes
<clever> mrvn: i could use a struct with bit-fields, to create that entire opcode-112, and its 120 bits of payload, but is there then any elegant way to dynamically create a blob containing a random mix of structs?
<Jari--> mrvn, so many Radeons on the market.. buying a VGA driver for Radeon would make so many devices functional
<mrvn> clever: a factory returning a pointer to a base type.
<mrvn> clever: or templated factory
<clever> mrvn: but how would i create a byte-array, which just has the raw bytes of 4 different structs concated back2back, where the types are only known at runtime?
<bslsk05> ​github.com: lk-overlay/v3d.c at master · librerpi/lk-overlay · GitHub
<mrvn> by returning a pair of pointer and size.
<mrvn> or std::string
<clever> in this example, i need a 7 byte object (line 303), then a variable number of 3byte + 5byte objects + 1byte objects
<mrvn> should that be returned by value or emplaced in some bigger arrays?
<clever> emplaced into a bigger array, so you dont waste memory bandwidth copying it there later
<mrvn> so pass in the pointer to the current position as reference and modify to point after the opcode on exit.
<clever> and ideally, that bigger array should have some pre-allocation logic, like std::string does
<mrvn> void emplace(void **pos, enum Op op, ...) { }
<clever> yep, and thats exactly what the addbyte(&p, 123) is doing
<clever> about all i would be changing, is making it into a nicer tileBinningConfig(&p, arg1, arg2, arg3);
<mrvn> If you want it save add a pointer to the end or use a struct { void *start, *pos, *end; }
<clever> i can just increment by sizeof(foo) i think
<mrvn> basically a bounds checked iterator of the big array.
<clever> in its current state, the only thing that is actually dynamic though, is the resolution
wolfshappen has joined #osdev
<clever> but if i want support for a variable number of shaders, or changing shader type, it will get more complex
<mrvn> If you have a struct Op121 { uint_t op{121}; uint16_t arg1{0}; ...} then watch the padding and alignment.
<clever> the hardware doesnt expect any special alignment on these, it just uses however many bytes it uses
<clever> but opcode-1 is a nop, so it could be used as padding
<mrvn> but the compiler will alignt it and pad it. and BOOM
<mrvn> probably a use case for attribute packed.
<clever> i could memset the entire buffer to 1 initially, so the padding turns into nop's
<clever> or that
<mrvn> The above would have 1 byte bapdding between op and arg1. That can't work.
<mrvn> s/bapdding/padding/
<clever> mrvn: but look at page 71 of https://docs.broadcom.com/doc/12358545
<clever> youll see that its not actually a big 121bit blob, but rather, a 32, 32, 32, 8, 8, 1, 1, 1, 2, 2, and 1bit field
<mrvn> I would just make helper function to emplace the different opcodes with nice argument lists and then place each byte into the stream manually.
<clever> yeah
<mrvn> Unless you are going to read out those lists a lot it's just not worth it.
<clever> these control lists are write-only
<clever> once i create the list, i pass it off to hardware, wait for an irq, then free it
<mrvn> You can define bitfields internally to move the arguments into place but just shifting things around is probably simpler.
<clever> ah, correction, i dont free it, i just overwrite it on the next frame
<mrvn> some years back I tried defining bitfields for the ARM page tables and MMIO registers for peripherals. But the compiler wouldn't optimize them nicely so that setting a bunch of bits would get combined into a single 32bit write all the time.
<mrvn> Just shifting and masking gets optimized all the time.
<clever> that topic also came up on the subject of MMIO a few days ago
<clever> some of the more sensitive registers, require you to OR 0x5a000000 into every write you do
<clever> if you try to modify a single bit using bitfields, the read/modifiy/write wont put the 5a back in
wolfshappen_ has joined #osdev
<clever> so the hardware then ignores the write entirely
wolfshappen has quit [Ping timeout: 256 seconds]
<mrvn> yeah, but there we talked about the need to volatile them. volatile kills any read/write combining.
<clever> fuschia i think, uses a c++ class system, where you control the load/store
<clever> so you can load from mmio->local var, but then use setters/getters to mutate it
<clever> and then store it back to mmio when you choose to
<mrvn> I have an abstraction layer for that somewhere using templates. I have Bit<i>, Bits<i0,i1,i2,i3,...>, MBZ, MBO, ... templates and a register is a collection of those. That then all gets shifted, masked, ored together with a volatile read/write for hardware access.
<geist> yah that's basically the scheme that fuchsia uses
<bslsk05> ​github.com: rpi-open-firmware/cpr_clkman.h at master · librerpi/rpi-open-firmware · GitHub
<geist> it's like MakeReg().GetValue().SetBit().ClearBit().SetField().Write();
<clever> the originaly released rpi headers, just use a big heap of #define's
<geist> i dont particularly like it, but it does control precisely when you read/write the register
<clever> to give you a starting bit, and some pre-made bit masks
<mrvn> The Bit and Bits types are views into the register. There isn't actually a local copy of the value unless you make one and every access picks out the right bits.
<clever> so you have 0xfffff9ff to clear a value, 0x00000600 to select the value, and 9 to << or >> for inserting or extracting
<clever> all named as CLR, SET, and LSB
<clever> but you then need to: CM_GP1CTL = (CM_GP1CTL & CM_GP1CTL_MASH_CLR) | ((mash << CM_GP1CTL_MASH_LSB) & CM_GP1CTL_MASH_SET);
<clever> and now it gets ugly :P
<mrvn> geist: how about: Reg << ~GLOBAL << PRESENT << KERNEL_RO;
<mrvn> streams get combined into one read, modify, write.
<mrvn> With your code what happens if you forget the .Write() at the end?
<mrvn> [[nodiscard]] error?
wolfshappen_ has quit [Ping timeout: 256 seconds]
ElectronApps has joined #osdev
<mrvn> clever: those masks and shifts break down on x86_64 with pagetables where the adddress is split up and stored at different locations.
<clever> yeah
Matt|home has joined #osdev
<mrvn> for x86_64 I had Bit, Bits, Span and Spans
<mrvn> Bits being random bits, Span a sequence of bits and Spans random Span combined into one value.
<mrvn> But with what geist mentioned the other day about needing to have barriers between access to different peripherals I think I will refactor my code to use a lock_guard approach. First you lock a peripheral into use and then registers if you want multiple accesses to a reg.
<clever> mrvn: thats a thing on the bcm2835/pi0/pi1, something about how the arm axi master cant deal with replies coming back out of order
<mrvn> clever: not just there. Back there you get corrupt read writes, which is worse. But in general you get reordering of writes.
<clever> my understanding of the bug, is that the re-ordering causes values to be swapped
<mrvn> yes, that was the big bug.
<clever> i assume the pi2 and up has fixed it, because with SMP, you cant control what the other cores are doing
<mrvn> But on any ARM the peripherals on the AXI are in a tree as geist described it. When you write the data goes to the root and the gets pushed onto the right child along the tree. Now say you write 20 things to the UART they all get pushed down the left side of the root AIX node and backlog there. Then you write to the GIC which goes to the left of the root AIX node and gets written out immediatly.
<clever> yeah
<mrvn> aeh, GIC on the right side.
<clever> same even happens in the rp2040 MCU
<mrvn> Each register does get the correct content (unlike with the bcm2835/pi0/pi1 bug) but the timing is out-of-order.
<clever> to limit the size of the axi's matrix, all of the "slow" peripherals are on a second axi tree, under a single slave port
<clever> and because of how slow uart is, you only need a write every n clocks
<clever> so contention between multiple masters fighting over that 1 slave, isnt as much of an issue
<mrvn> Now consider this: You turn of IRQs in the UART. Then you turn on IRQs in the GIC. If the GIC gets the write first you might suddenly get an IRQ from the UART because it hasn't turned off yet.
<clever> mrvn: https://i.imgur.com/9RQET8Z.png the rp2040 axi tree
<clever> the main crossbar, is a full 4:10, and can allow all 4 masters to be doing a transfer in the exact same clock cycle, as long as its to 4 different slaves
<clever> but that kind of thing is expensive, and making it a full 4:21 would probably be costly
<clever> so all of the "slow" stuff was shoved onto a secondary 1:11, far simpler with only 1 master
pretty_dumm_guy has quit [Quit: WeeChat 3.4]
<clever> but its still running at the same clock speed
<mrvn> And I challenge you to write multithreaded code that can access the same slave (other than APB Bridge) at the same time and not blow up
<clever> another kind of funky thing, is the flash xip, is on 2 slave ports
<clever> the second port under that other splitter, is only for the bulk flash->ram copying fifo
<mrvn> is one for boot and one for later?
<clever> the direct axi-slave, is the main XIP flash window, a 16mb address range, directly mapped to spi flash
<clever> that way, one master can have a cache hit on the XIP flash window (via the flash-xip block)
<clever> while a second master is doing a bulk copy from flash->ram, with the dedicated bulk-copy fifo (which has a lower priority, and wont trash the cache)
<clever> during boot, the XIP isnt enabled, the rom just drives the QSPI controller manually, to copy things into ram
<clever> there is also a priority system for the axi masters
<clever> so you can choose who winds when there is contention over a given slave
<mrvn> And flash doesn't change at runtime so reading either way and chaching never gets a bad result.
<clever> wins*
<clever> flash can change at runtime, but you have to disable the XIP first
<clever> and there are routines to flush the cache as well
<bslsk05> ​github.com: pico-sdk/flash.c at master · raspberrypi/pico-sdk · GitHub
<clever> __no_inline_not_in_flash_func tells the linker to place this function in ram, so it can survive XIP being turned off
<clever> flash_init_boot2_copyout will pre-copy the XIP configuration block from flash->ram
<clever> flash_exit_xip turns XIP off, so you can issue plain SPI commands to the chip
<clever> flash_range_program and flash_flush_cache do what the name says
<clever> flash_enable_xip_via_boot2 then turns xip back on, using the boot2 that was previously copied out
wolfshappen has joined #osdev
masoudd has joined #osdev
<clever> its getting late, i should get off to bed
kingoffrance has joined #osdev
<mrvn> it's past late, it's getting early.
<mrvn> n8
* kingoffrance stares at 10h ago 2h and 3.7G. I wonder if the code will ever compile or if the source is a halting problem.
<gog> noite
<kingoffrance> i can either check the logs and assumed it finished, or assumed it took 10h and still going
eddof13 has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
[itchyjunk] has quit [Ping timeout: 260 seconds]
<moon-child> AGHHHH
<moon-child> I thought __builtin_clz was polymorphic like __builtin_popcnt and co
<moon-child> apparently nope
<moon-child> need to explicitly __builtin_clzll
[itchyjunk] has joined #osdev
<gog> consistent interfaces? idk what that is
<gog> :D
[itchyjunk] has quit [Ping timeout: 240 seconds]
[itchyjunk] has joined #osdev
Oli has quit [Ping timeout: 272 seconds]
<Clockface> i was thinking that libs could "hook" a string in my kernel, where a program submits a string, and if anything "hooked" it, the program is returned the adress to call to access that library
<Clockface> is this a decent way of doing things?
Oli has joined #osdev
<Clockface> how do dynamically linked libs normally tell the program what to call?
<moon-child> Clockface: why do you want to be able to do that?
<Clockface> i want to get a way to build my kernel of modules early on
<Clockface> since a modular kernel sounds nice
<Clockface> and its about time i decide on a way to link stuff together
<moon-child> a kernel module is not generally considered a program
<moon-child> do you want to use this as a mechanism for kernel modules to communicate with each other, or for userspace to communicate with userspace, or for userspace to communicate with the kernel?
<Clockface> kernel-kernel
<Clockface> module calls module manager, gives it a pointer to "mydependancy"
<Clockface> mydependancy is hooked to the string "mydependancy"
<Clockface> so it is returned the adress of mydependancy
<Clockface> so it can call it
<moon-child> ok. So the idea is that 2 modules can independently define an interface for communication, which the kernel knows nothing about?
<Clockface> yeah
<Clockface> the programmer of the module defines how to work with it
<moon-child> ok. That is doable, but somewhat fraught
<moon-child> basically recreating the issue of service discovery :P
<Clockface> exactly!
<moon-child> I think time has shown unified architecture works better
<moon-child> if you really want it, though, I suggest uuids rather than strings
<moon-child> (cf uefi)
<moon-child> to avoid collisions and permit versioning
<Clockface> im hoping strings result in less collisions
<Clockface> since then modules can be given a unique name
<Clockface> so 2 people dont decide on the same number for 2 different things
<moon-child> ermm no
<moon-child> rngs are very good. 128-bit uuid is not gonna have collisions
<moon-child> names will totally have collisions
<CompanionCube> there's a standard form of UUID, you can try those
<moon-child> (no one is picking uuids by hand :P)
<Clockface> oh cool
<CompanionCube> moon-child: iirc the GPT partition uuid for GRUB was hand picked
<moon-child> :<
vdamewood has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]
<Clockface> well, 2^128 possible ID's and one billion already existing unique things has an acceptibly low collision chance of 1 in 340282366921000000000000000000 according to my calculator
<Clockface> im convinced
<CompanionCube> it's a memorable example because in ascii its 'Hah!IdontneedEFI"
vdamewood has joined #osdev
<Clockface> that has a suspect number of 0's in it
<Clockface> but its still massive
<Clockface> i dont need an exact number to be convinced
<moon-child> that's not quite right
<moon-child> due to birthday paradox it is lower
<moon-child> but still high enough
<Clockface> what is the birthday paradox?
<moon-child> it's not enough to consider the probability that I collide with an existing uuid when I generate a new one
<moon-child> you have to consider the probability that _any_ newly created uuid collides with _any_ existing uuid
<moon-child> it is so-called because of the apparently-paradoxical fact that, in a room of 21 people, there is a 50% chance that 2 of them have the same birthday
<moon-child> s/-//
<Clockface> ah
<Clockface> well bigger number than i can count to on my fingers regardless
<Clockface> im convinced
<Clockface> acshually, i can do a hybrid approach
<Clockface> because everyone loves doing both solutions at the same time despite common sense!
<Clockface> i will SAY you should have it be a 16 byte long randomly generated integer
<Clockface> but it will be null terminated
<Clockface> so you can use a normal string as well if your stupid
<Clockface> >:)
<gog> i can count to 1023 on my hands
<CompanionCube> inb4 instant null pointer exploit
<gog> 1048575 if i use my feet too
<moon-child> Clockface: if you do not force people to not be stupid, they will be stupid
<moon-child> and if you do not care whether people are stupid, then it doesn't seem worth bothering at all
Oli has quit [Ping timeout: 272 seconds]
<Clockface> but some might be good too
xenos1984 has quit [Read error: Connection reset by peer]
<Clockface> its a hybrid approach
<Clockface> :)
<Clockface> hybrid and modular everything
<Clockface> but yes, thank you for suggesting UUID's
<bslsk05> ​en.wikipedia.org: Phantom OS - Wikipedia
<Jari--> "Persistence – Application code does not see OS restarts and could live forever—this makes the concept of a file obsolete and any variable or data structure could be stored forever and at the same time be available directly through a pointer. Differently from hibernation that is done in other OSs, persistence lies in the very core principles of the Phantom OS core. It is done transparently
<Jari--> for applications; in most cases it requires no reprogramming of an application. Persistence stays even if the computer crashes."
<gog> that's pretty cool
Oli has joined #osdev
<moon-child> EROS&co did something like that too iirc
<Jari--> It really changes lot of the implemntation too, right?
<Jari--> Application-wise.
<gog> said it's transparent
<moon-child> hmmm probably depends
<moon-child> like, say you wanna make a word processor
<moon-child> you no longer have to make a custom serialization format ... unless you wanna be able to send documents to other people
<moon-child> (this is why I go referentially transparent; no pointers, so you can redistribute all the objects with no problem)
<gog> and while i can think of a number of sticky points involved with restoring state from a poweroff, it seems no different than when a process' context is suspended during a task switch
<moon-child> I have also been thinking about durability guarantees wrt ipc/transactions. But I don't know if they aim for fault tolerance
xenos1984 has joined #osdev
<moon-child> (if you don't force sequentially consistent ordering of storage, and fault/restore, maybe app A says to app B 'can you give me a token' and app B says 'but I gave you the token already')
<Clockface> my goal for mine is that everything that interfaces with non-kernelspace is like a DOS extender kind of, no syscalls are interrupts are hooked unless you install something that does so
<moon-child> (presumably you maintain coherence within a given application, but it could be prohibitive to do so across multiple apps)
<Clockface> my DOS extender analogy mostly is intended to convey that the OS itself yeilds complete control can be yeilded to the "extender" which runs programs of its own
<Clockface> everything runs in kernelspace until you run something to run userspace stuff
<Clockface> thats why i consider kernel modules a high priority
epony has quit [Read error: Connection reset by peer]
epony has joined #osdev
<gog> i'm working towards having modularity. its already feasible with a little more work
<Clockface> is it much trouble to port real mode assembly to 32 bit assembly?
<Jari--> Miss OS/2, but hellish, this is Microsoft's OS/2 I am using. Windows 10!
<Clockface> seems like it should not be, ill just be wasting some room in the registers
<Clockface> well to be honest, i plan on using virtual 8086 mode for the protected mode version, and my assembly interpreter for long mode
<Clockface> this is getting ugly
* Jari-- dreams his OS will be as stable as Windows 95 one day
<Jari--> Only ceremorial intended errors.
<Clockface> i plan on making windows look like a good option for a mainframe running something involving banking
<CompanionCube> huh?
<Clockface> windows: stable as a rock
<Clockface> comparitevly at least
<Clockface> my design has slowly morphed into a 16/32/64 bit monstrosity
<Jari--> Clockface, remember FARPTR* ?
<Jari--> The joy of DOS.
<graphitemaster> Everyone talks about nearptr and farptr but no one has had to deal with hereptr or thereptr
* gog slaps roof of memory
<gog> this bad boy can hold so many pointers
<CompanionCube> for sheer reliability, something like VMScluster or IBM parallel sysplex would be interesting if probably v dificult.
<Clockface> i missread that as FARTPTR* addmitedly
<Clockface> i proceeded to look that up to see if there was some sort of joke surrounding it
<graphitemaster> All farts have fartptrs, it's when someone points their finger and says "it was them"
<Clockface> lol
epony has quit [Remote host closed the connection]
<CompanionCube> huge fartptrs
<Jari--> enormous
<graphitemaster> stink up your code today, add a bunch of fartptrs
<graphitemaster> this bad boy has so many cache misses
<kingoffrance> hither thither whither hence thence....fseek() fseek() lseek() BUGS this document's use of whence is incorrect english, but is maintained for historical reasons. there is a whence...
<kingoffrance> "from what place" ...kind of like come from :)
<gog> from whence you came you shall remain until you are complete again
<kingoffrance> s/fseek/&o/
epony has joined #osdev
vdamewood has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]
FatAlbert has joined #osdev
<FatAlbert> what's the difference between stty and termios.h i ? i guess the latter providees more granularit ?
<klange> stty is a utility, and termios.h is one of a couple of headers defining the values used in interfaces that stty, among other things, would use to configure terminal stuff
<FatAlbert> i know the definitions .. i guess most of the guys wouldn't write C progrm to define terminal stuff unless they need very specific set of values ( which i guess can't be handled with stty ) ?
gog has quit [Remote host closed the connection]
gog has joined #osdev
Jari-- has quit [Remote host closed the connection]
GeDaMo has joined #osdev
<klange> `stty` is generally implemented as little more than a call to tcgetattr+tcsetattr, maybe ioctl(..., TIOCGWINSZ, ...) and TIOCSWINSZ, and then some string parsing / generation to understand / explain settings
<FatAlbert> gog: don't hear you ..
<gog> mew
<FatAlbert> that's my girl
<gog> o:
<klange> I used to ship minix's stty prior to 2018, for a "real" example: https://github.com/klange/toaruos/blob/v1.0.0/userspace/extra/stty.c#L132-L190
<bslsk05> ​github.com: toaruos/stty.c at v1.0.0 · klange/toaruos · GitHub
<klange> and my own implementation these days: https://github.com/klange/toaruos/blob/master/apps/stty.c
<bslsk05> ​github.com: toaruos/stty.c at master · klange/toaruos · GitHub
Oli has quit [Ping timeout: 272 seconds]
[itchyjunk] has quit [Read error: Connection reset by peer]
zaquest has quit [Remote host closed the connection]
zaquest has joined #osdev
mahmutov has joined #osdev
amazigh has quit [Ping timeout: 240 seconds]
amazigh has joined #osdev
epony has quit [Ping timeout: 240 seconds]
epony has joined #osdev
<klange> Another weekend has gone by in which I did a bunch of stuff _except_ the thing I was "planning" on doing.
<gog> don't feel bad, i planned on doing things and i ended up playing factorio
<gog> ¯\_(ツ)_/¯
<klange> At least I did do useful things....
<klange> Active window resizing, reimplemented signals, and made my userspace malloc actually release memory. All of these were related.
<FireFly> usually my weekends go by planning to deal with $personal_infra_technical_debt or work on $hobby_project or so, and just end up watching youtube videos and lazing about
<FireFly> so I mean, comparatively you're more productive :p
<klange> I also played a lot of pokémon
<FireFly> neat
<FireFly> this weekend for me is just trains
<moon-child> I reimplemented most of my allocator
<gog> choo choo
<moon-child> also played a bunch of celeste
<klange> I deleted a bunch of my allocator.
<moon-child> and watched hbomberguy video
<gog> the deus ex one?
<moon-child> yeah
<gog> same
<moon-child> well, I watched part of it
<moon-child> really long!
<gog> i watched all of it. i uh, i haven't slept lol
<moon-child> :D
<klange> > 3h33m33s
<FireFly> oh there's a new hbomberguy video?
<FireFly> cool I guess I have something to watch during the train ride :p
<FireFly> and now I'm even in the part of the ride with decent network uplink, heh
<gog> yeah it's about deus ex: human revolution
<moon-child> youtube-dl while you have the chance
<gog> primarily
* FireFly nods
<gog> but he talks about the original game a lot too which i appreciated because that was a good game
<FireFly> moon-child: yeah, fired up a ytdl, we'll see if it takes forever
<gog> i should reinstall it
<moon-child> I have never played any deus ex game
<gog> i highly recommend the original
<gog> very well made, top-tier worldbuilding
<moon-child> might take a look later. Though I look at the list of games I have yet to play and ... ...
<gog> yeah i have a whole steam library of games i have never even played. waiting for my wife to get sick of skyrim before i try to take a turn :p
<gog> she spent a solid 3 weeks playing new vegas nearly every moment of her free time lol
<gog> finally got bored of that, took a run at life is strange, then got into skyrim just as i thought she was getting gamed out
<moon-child> I tried to play skyrim and fallout, never could get too deeply into them
<moon-child> probably played maybe 10 hours of each
<moon-child> it was fun, but didn't really grab me
<gog> fair
<gog> idk what i want to play next. i wish rimworld had a non-steam version i could download because factorio does and it plays so much better without steam
<GeDaMo> «If you want to run RimWorld without Steam, I think you can delete 'SteamAPI.dll' from the game's directory, and all Steam functionality will cease.» https://old.reddit.com/r/RimWorld/comments/86z9yb/how_can_i_play_this_game_without_steam_and_offline/dw95ph1/
<bslsk05> ​old.reddit.com: DrCubed comments on How can I play this game without steam and offline
<gog> hm
<graphitemaster> whence should've been named relative_to
<graphitemaster> But the C people were terrified of underscores for some reason
<FireFly> it's on gog as well (heh), so should be playable without steam
<FireFly> I should play rimworld at some point, it's just that I know it'd nerdsnipe me and distract me for a while and idk if I want to spend the time for that yet
<gog> yeah it'll do that for sure
<gog> it's a very complicated game and it's frustrating and engaging and just a wild ride
Starfoxxes has quit [Ping timeout: 240 seconds]
Starfoxxes has joined #osdev
X-Scale` has joined #osdev
X-Scale has quit [Ping timeout: 256 seconds]
X-Scale` is now known as X-Scale
Starfoxxes has quit [Ping timeout: 240 seconds]
valerius_ has joined #osdev
<geist> been playing a bunch of horizon forbidden west
<geist> thus far a solid sequel to an already excellent original
<clever> geist: ive been reading the miniheap code, and the alignment thing bothers me a bit, what about cutting the free chunk into 3 chunks, a before/target/after, where target is aligned correctly and has no waste?
<clever> is the mini refering to the metadata or the .text size of the implementation?
<geist> uh
<geist> lemme see
<bslsk05> ​github.com: lk/miniheap.c at master · littlekernel/lk · GitHub
<clever> at this point, it can turn a chunk that is too large, into a target+after pair, and put after back into the free_list
<geist> yes
<clever> but alignment works by just making the allocation too big (the alignemnt size), and then shifting the object within that allocated area, to align it
<geist> that's what it does
<geist> yes
<clever> what if you also shifted the alloc_struct_begin object as well, and put the waste from before it, into a new free_heap_chunk
<clever> and then you dont allocate more then you needed
<geist> yes
<geist> thesed are all things. yes.
<geist> miniheap is designed to be minimal. simple. does the basics, and nothing more
<geist> it's not optimal, but it's small and simple
<geist> and what you are talking about only really happens when using allocations with large alignments
<clever> yeah
<clever> 256 is the largest alignment i use
<geist> it simply overallocates and then aligns within the result. basically memalign
<clever> and thats plenty of room to create a free_heap_chunk
<geist> yes.
<geist> it could you are correct. but that's wht the other heaps do better
<geist> cmptmalloc and dlmalloc are more complex, and handle this sort of thing
<clever> and lk also has cmpctmalloc, i should study it as well, you did mention i should try changing heaps a while back
<geist> miniheap is designed to be extremely simple, not very much code gen
<geist> most of the 'big' arches default to cmpctmalloc
<geist> zircon actually uses it, almost unmodified
<clever> ah, i already see improvements only 20 lines in, multiple free lists!
<geist> miniheap is basically useful for lower end embedded stuff that doesn't really use a heap that much. maybe a handful of allocs and nothing more. on some cortex-m style enviroments it works great since it allocates little chunks from the novm in units of 1k or so
<geist> basically minimum overhead
<clever> ah yeah, i have seen convos on the rpi forums about the mcu line, and how you should basically never use malloc
<clever> statically allocate everything, then you cant run out of ram
<geist> yah
<clever> and my bootcode.bin case is even more restrictive, 128kb for me, 264kb for the rp2040 mcu
<clever> and i must share that limit between both data and code, so its more like 20kb
<clever> while the rp2040 has a full 16mb xip window into spi flash
<clever> i need to dig into l1/l2 stuff more, things get unstable if i try to do anything new
<bslsk05> ​github.com: lk/cmpctmalloc.c at master · littlekernel/lk · GitHub
<clever> ah, and i can kinda see a trick here, ->left is the previous, but maybe this + ->size == next?
<klange> After that conversation the other day, I'm thinking of putting some effort into improving my memory management systems.
<clever> geist: oh, another intesting thing i note, miniheap is using LK's linked list code, while cmpctmalloc is managing the prev/next directly
<clever> its more self-contained, and can work outside of lk
pretty_dumm_guy has joined #osdev
Burgundy has joined #osdev
<bslsk05> ​github.com: lk/cmpctmalloc.c at master · littlekernel/lk · GitHub
<clever> geist: ah, thats simpler then i was imagining, it over-allocates by both the alignment size, and the structs for managing free space, using the regular malloc
<clever> then it re-grabs the mutex, and messes with internal metadata, to cut it up into smaller pieces, after having allocated, and return the extra back to free
<clever> so instead of finding a hole that covers a region meeting my requirements, it just finds a hole thats too big, and then sub-divides it after allocating
ElectronApps has quit [Remote host closed the connection]
elastic_dog has quit [Ping timeout: 250 seconds]
elastic_dog has joined #osdev
Starfoxxes has joined #osdev
Burgundy has quit [Ping timeout: 272 seconds]
nyah has joined #osdev
elastic_dog has quit [Ping timeout: 252 seconds]
wootehfoot has joined #osdev
elastic_dog has joined #osdev
the_lanetly_052 has joined #osdev
the_lanetly_052_ has quit [Ping timeout: 272 seconds]
bauen1 has quit [Ping timeout: 256 seconds]
[itchyjunk] has joined #osdev
Oli has joined #osdev
<mrvn> geist: concerning miniheap I see some inconsistencies: 1) If you use alignment=0 then you get 4/8 byte aligned data because you always keep the size a multiple of void*. This is insufficient to do a double register load.
<mrvn> 2) If you ask for alignment then the alignment gets bumped up to 16. No way to get 8 byte aligned data. And alignment always wastes alignment many bites not matter what.
<clever> mrvn: cmpctmalloc doesnt have that waste issue
<mrvn> 3) You DEBUG_ASSERT len % sizeof(void *) and len > sizeof(struct free_heap_chunk) but then in the code you deal with those cases.
<mrvn> 4) alignment should I think use ROUNDUP(size, alignment), not size += alignment.
<clever> mrvn: if the user wants a 40 byte object, that is 10 byte aligned, and you round-up, that will remain as a 40byte size, the malloc can then pick addr 35, and it has no wiggle room to fix things
<mrvn> clever: DEBUG_ASSERT len % sizeof(void *)
masoudd has quit [Ping timeout: 272 seconds]
<clever> the idea there, is to add 10 to the size, so you now have a 50 byte window, where that 40 byte object can land
<clever> allowing you to shift its addr by 10 bytes forward
<clever> so moving it from 35 to 40, creates a 5 byte hole on either side of the actual payload
<mrvn> clever: you are half right. it might have to waste space at the front and back.
<clever> yeah
<clever> but you cant know how much you need to waste on the front, until youve already picked a hole in the free space
<mrvn> Yeah, I take that part back.
<clever> cmpctmalloc solves that problem, by mutating the heap metadata AFTER youve allocated, to turn the before/after waste back into free space objects
<mrvn> The wasteage problem comes later. When the chunk is too big the unused part is returned to the free list. But because the original size is lost the extra waste reserved for alignment purposed can't be returned too.
<mrvn> In line 219 the code should compute how much space is left after the allocation after alignment and using the original size.
<mrvn> but maybe that counts as too complex.
<mrvn> anyway, the important case is 1) as that could unexpectably fault.
vdamewood has joined #osdev
<mrvn> what should miniheap_alloc(0, 0) return?
<mrvn> or general malloc(0)?
<mrvn> Specs say: "If size is 0, then malloc() returns either NULL, or a unique pointer value that can later be successfully passed to free()."
masoudd has joined #osdev
<mrvn> Am I remembering that issue with double register loads wrong: https://godbolt.org/z/ds6531q8j
<bslsk05> ​godbolt.org: Compiler Explorer
<mrvn> Line 16 shows Pair being only 4 byte aligned but bar(Pair&) does a double register load.
<mrvn> And what is the compiler doing in the asm in lines 2,3,4,6?
<mrvn> Make space on the stack, get a pointer to the original address, store the Pair and never ever use it.
<j`ey> use clang its a betterer compiler!!
dude12312414 has joined #osdev
<bslsk05> ​godbolt.org: Compiler Explorer
<mrvn> better in some, worse in others
<j`ey> weird it doesnt do the ldrd
<j`ey> in bar
<mrvn> clang does two ldr, which I believe is required.
<j`ey> required for?
<mrvn> alignment. iirc ldrd will fault when the address isn't a byte aligned.
<mrvn> 8 byte aligned
<mrvn> let me check that ...
<bslsk05> ​developer.arm.com: Documentation – Arm Developer
<mrvn> Look what happend when I tune it for the RPi1: https://godbolt.org/z/Yo16hYnnT
<bslsk05> ​godbolt.org: Compiler Explorer
<mrvn> It's odd. Now gcc uses ldmia in bar but 2 ldr in baz while clang uses ldr in bar but ldmib in baz.
<mrvn> Is there a tool that shows cpu cycle times for ASM code?
Ermine is now known as Santurysim
Santurysim is now known as Ermine
Ermine is now known as Santurysim
Santurysim is now known as Ermine
<sortie> My new ports system prototype is now fully functional and merged to my volatile builds :)
<sortie> It's BSD ports style. /src/ports contains a subdirectory for each port containing metadata with links to upstream releases and patches to apply to them. The build system automatically downloads the appropriate files when first used
<sortie> The ports system is part of the main repository and versioned together. All ports are built by default but can be overridden using the PACKAGES variable
<sortie> In other words: You can literally just boot up my latest volatile iso and type "cd /src && make" and it will rebuild EVERYTHING natively OUT OF THE BOX
<sortie> This is how easy I wanted to make developing my OS. Just download it. Bam instant dev environment. No need to build cross-compilers. It's all there. You can optionally install it if you want a persistent environment.
<mjg> congratulations. you reached templeos level :)
<mrvn> he still lacks god
<zid> now rename it to stage3.tar.gz
<zid> and you're at gentoo level
<sortie> I probably exceed the other systems because my WHOLE ports system cross-compiles cleanly out of the box too
<mjg> templeos recompiles everything it can run afair
<mjg> so..
<sortie> It's a quite impressive how big a tech loop I've built :)
<sortie> It's even connected to IRC for some reason
underscoreanne has joined #osdev
Burgundy has joined #osdev
the_lanetly_052 has quit [Ping timeout: 240 seconds]
the_lanetly_052 has joined #osdev
brynet has quit [Quit: leaving]
the_lanetly_052 has quit [Ping timeout: 256 seconds]
not_not has joined #osdev
<Clockface> GNU/purgatory is GNU/heaven you must compile from source
<not_not> Ahahaha
<not_not> Im doing lfs right now acturally
xenos1984 has quit [Read error: Connection reset by peer]
<not_not> But its heaven if U dont like sleep
<not_not> And if you truely hate the Sun and everyone who walks below it
vdamewood has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]
<Clockface> eh, i decided ill do randomly generated null-terminated sequences
<Clockface> so its variable length, in case i feel like saving memory or 600 trillion peices of software end up getting made
xenos1984 has joined #osdev
underscoreanne has quit [Ping timeout: 256 seconds]
brynet has joined #osdev
<zid> TIL nvidia is open source
<gog> lol
<gog> yes
<geist> mrvn: you're right, the miniheap stuff may not deal with minimum alignment properly,but i think it alwyas gets you at least a word alignment
<geist> which on the 32 bit arches its used for it sufficient
<geist> but i dont think that's good for x86-64 or whatnot
<geist> which i think has a general 16 byte alignment (2 words)
<geist> like i said it's only really used on microcontrollers or whatnot
<geist> the bigger arches default to cmptcptmalloc or dlmalloc
<geist> but... i should at east write some docs at the top f the file or whatot saying this
dude12312414 has quit [Quit: THE RAM IS TOO DAMN HIGH]
<geist> also while looking at it, can probably simplify it a bit and use a single linked list for free, since it always walks the free list in order anyway
<geist> i honestly have't look at that code much in the last 12 years or so. it's been a workhorse so i dont think there's any overt bugs with it, just optimizations not taken
<mrvn> geist: I was under the impression that ldrd would fault when the address is not aligned but the docs say it just does 2 32bit loads in that case. So void* alignment is sufficient.
<geist> yes, it does void alignment
<geist> but you're right, the ABI may want you to do 2 word alignment
<geist> or, sort of more specifically, the compiler may *assume* the heap does a particular alignment and optimize accordingly
<mrvn> geist: gcc and clang both do and don't depending on $random.
<geist> i think we found that on cmptctpmalloc in zircon. it was giving us 8 byte aligned pointers, but the compiler was assuming 16
<geist> and then if you really wanted something 16 byte aligned, called memalign, it would actually call malloc
<geist> etc
<geist> this is actually worth double checking in LK with all the arches
<geist> note there's a compiler switch to override the assumed heap alignment
<mrvn> What I would change is the check for minimum alignment >= 16. Nothing wrong with asking for alignment = 8.
<mrvn> or 4
<geist> well depends o 32 vs 64
<geist> and arch, so actually i think a really good solution is a bit more complicated
<mrvn> not really. If you can't handle 4 byte alignment then don't ask for it.
<geist> but, in this case, with miniheap it basically takes any align < sizeof(void *) and rounds it up
<geist> eh, i disagree. if you ask for lower alignment than the minimum, then give the minimum
<mrvn> the roundup to sizeof(void*) is a different case. That's ok and for the internal sanity.
<geist> especially sicne it's potentially arch specific
<mrvn> Which means you already get 4/8 byte aligned depending on the arch.
<geist> right
<mrvn> But when the user specifically asks for an alignment then you bump that up to 16.
<geist> hmm why?
<mrvn> lines 193-195
<geist> hmm, trying to grok that
<mrvn> I see no reason to increase the alignment beyond that sizeof(void*) requirement. If the user asks for 1 byte aligned data then there is no reason to align to 16 byte.
<geist> hmm, trying to look through the git history to see why that's the case
<mrvn> Just seems odd to force something on the user when they specifically asked for less.
<geist> hmm, drilling through history that line was last touched in 2008
<mrvn> As for "the compiler was assuming 16". That's also an issue. Depending on the arch malloc() is assumed to return aligned data for good speed. E.g. on x86 you get 16 byte aligned for SSE registers to work.
<geist> 'initial commit' so yeah there's no prior history
<geist> yes. most likely that's what i was observing at the time and simply set the minimum alignment to 16 in that function
<mrvn> maybe add a DEBUG_ASSERT to see if it even triggers anywhere in the code.
<geist> and then literally haven't thought about it
<geist> since min alignment 16 is pretty much the max min alignment i've observed on all arches
<mrvn> then you messed that up. :) The default is alignment == 0
<geist> alignment == 0 simply means 'i have no opinion'
<geist> the fact that the heap code itself then rounds it up to 16
<geist> is an implementation detail. does the heap need to do that? probably not, depends on the arch
<mrvn> except in that case it doesn't.
<mrvn> I would change the code to: if (alignment == 0) alignment = 16;
<geist> ah i see. yes it defacto aligns to sizeof(void *) because that's the alignmebt of free slots
<geist> ah i see. no it's more complicated than that. i think it WAI, but it's implicit
<geist> alignment = 0 is okay, it just then goes through a slightly implicit path in the allocator
<geist> it's just not that clear
<mrvn> if alignment==0 then in line 244 you don't align. So you get the natural void* alignment of the free block.
<geist> right, which is WAI
<mrvn> And I think the compiler would assume 16 byte aligned on many arches in that case.
<geist> because later on when it's searching for a block it doesn't then try to round up the pointer
<geist> well, *that* is a different problem, yes
<geist> and you're right, except this is omly ever run on arches where void * alignment is okay
<geist> though i should double check that
<geist> and/or statically assert it
<geist> *or* override the compiler flag for minimum alignment if this heap is selected
<mrvn> So my implementation would be: alignment == 0 ==> 16 byte, alignement < sizeof(void*) ==> sizeof(void*), else as requested.
<geist> okay
<mrvn> or instead of 16byte whatever the arch has as assumed alignment.
<geist> i dont really agree, but i think there's a solution in there
<geist> yes. i think the per arch thing is important
<mrvn> Is the miniheap used a lot in LK or just during bootstrap?
<geist> it's used in embedded builds of it
<geist> usually cortex-m. stuff where the heap is in the 10s of K
<geist> it doesn't scale at all for larger heaps, or lots of alloc/frees sine it's entirely O(N)
<mrvn> it has O(n) performance but with 10s of K that's probably still ok.
<geist> exactly
<geist> it performs *horribly* in a highly alloc/free environment (C++ with lots of new/frees)
<geist> new/deletes
<mrvn> on the other hand with that little memory a best-fit strategy might be usefull.
<geist> possibly? best-fit has a tendency to create the smallest possible holes
<geist> also it first fits which *probably* generates pretty good GCing
<geist> sicne it can return unused blocks to the system allocator
<mrvn> yeah, there was some flavour that used perfects holes and otherwise picked holes to split smarter.
<geist> it has a self trim mechanism
demindiro has joined #osdev
<geist> for builds like cortex-m what LK has is a 'novm' VM. basically it carves up all the free space after the text/bss/etc to the end of RAM into fixed sized pages. up to the build, but maybe 512 bytes or 1K or whatnot. uses a bitmap (because small)
MiningMarsh has quit [Quit: ZNC 1.8.2 - https://znc.in]
<geist> the heap grabs chunks out of that to expand itself, but otherwise code is free to allocate directly from the novm
<geist> basically lets larger allocs skip the heap if it wants
<geist> or run some sort of second heap (like some garbage collected, language heap)
eddof13 has joined #osdev
<geist> so miniheap is pretty good about being frugal with its free list (stores the free list in the free pages itself) and fairly aggressive about returning free chunks to the novm
<mrvn> If you have a lot of malloc/free it really helps to have separate free lists for 8, 16, 32, 64 byte objects.
<geist> so i think in general being an agressive firt-find would tend to cluster allocs at the start of the heap
<geist> mrvn: yah that's precisely what cmptctpmalloc does
<geist> its a standard binning heap
<mrvn> but speed vs. complexity :)(
<geist> more overhead because ore lists, etc, but the performance is far better
<geist> so for builds like arm64, x86, riscv, etc usually use cmpctmalloc
<geist> or even dlmalloc, which is also in the tree, but my testing is cmpctmalloc does as good of a job and is at least a little bit easier to read
<geist> but i should also write some unit tests for this stuff, etc
<mrvn> can I make qemu-system-arm print all cpu exceptions and interrupts?
<geist> `-d int` probably?
<gog> yes
* geist pets gog
<gog> if it's enabled in your build
* gog prrrrs
<mrvn> thanks.
<geist> hmm that's true, it may be possible to disable it, but i think you'd have to have explicitly done so
<mrvn> any way to get a register dump when one happens?
<gog> it dumps every time
<mrvn> Exception return from AArch32 mon to svc PC 0x448
<gog> you're gonna get a buttload of terminal spam
<mrvn> doesn't tell me much.
<gog> oh
<gog> okay then
<geist> i think it may be per arch
<gog> yeah x86_64 spits out the whole state
<geist> if you have something that's crashing very early you can get away with mega spam with
<geist> `-d int,cpu,exec`
<geist> one of cpu or exec, i froget, dumps the state
<mrvn> thx
<geist> but it dumps it also on every block of code, so it's more than just interrupts
<geist> but since the int is a forced jmp essentially i thik you can line up the nearby cpu state block with the irq
<geist> useful if you have an early crash, or you're trying to debug your x86 real mode to long mode code or whatnot
<mrvn> trying to see why my code stops after enabling the MMU
<geist> yah and sadly there's no `info mmu` for arm
<mrvn> Taking exception 4 [Data Abort]
<mrvn> ...with DFSR 0x7 DFAR 0x3f201018
<mrvn> That's the UART0 status register
<geist> oooh, trying to write to a register that's no longer mapped?
<j`ey> is it mapped?
<mrvn> if only I had "info mmu" to check
knusbaum has quit [Ping timeout: 256 seconds]
MiningMarsh has joined #osdev
<geist> gosh i wish apple would fix this. seems that on my M1 mini every few days the 'systemstats' process grows and grows, leaking ram such that it fills up the swap
<geist> it deals with it admirably, but eventually i think starts slowing the system down
<geist> in this case it was 30GB virtual space, 29.5GB of compressed memory, using 8GB of swap
eddof13 has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
<mrvn> ups, I set the size of the peripherals to map in r5 instead of r6. So it mapped total garbage.
FatAlbert has quit [Quit: WeeChat 3.4]
<mrvn> I'm still annoyed that "qemu -M raspi2" loads the kernel at 0x10000 instead of 0x8000 where a real one does.
<mrvn> or is that some old firmare vs. new firmware thing?
rwb is now known as rb
<geist> i have vague recollections that the firmware changed somewhere, it broke LK at some point too
sonny has joined #osdev
<sonny> floppy disk boot sounds so much easier :-)
<geist> than what?
<gog> i watch retro computer youtubers just to hear the seek test
<gog> bwwwmm-bwm-brrrrmp
<geist> hah yes
<geist> was watching some guy fix a RX02, but that's a bit more noisy
<gog> o:
<geist> with a nice CLICK sound every time the head goes inactive
<gog> nice
<gog> did you see adrian's latest
<gog> i'm disappointed in our man
<geist> there's apparently a little solenoid to pull the head back
<gog> he said 28mins was a "long" video
<geist> hmm, which one? i'm on his patreon and he just posed a vid of his basement
<geist> he has a pretty rad man cave
<gog> he rescued a pet
<gog> it needs rehab
<geist> ah yes, haven't watched the latest
<gog> i wanted to know what was wrong with it and he stopped the video!
<geist> oh i thought he just fixed it? haven't watched that one yet
<gog> nah brand new pet
<gog> 4120
<geist> the one with the bad memory detect
<gog> yes
<geist> ah
<gog> wait did he post a new one? /me checks
<geist> maybe patreon
<geist> usually posts them there a few days early, i lose track
<gog> ah fair
<gog> the only patreon i subscribe to is wtypp
<geist> this one is pretty reasonable. $4 gets you the goods
<geist> but yeah i patreon a few of em
<gog> i pay $2 to listen to jaded people explain why everything is awful and we can't have nice things
<gog> it's therapeutic
<geist> heh, if that works for you!
<gog> :D
* geist is defragging a few egregiously fragged files
<gog> oops
<geist> not really therapeutic though, with no sound
<gog> if a file gets fragemented in the forest but nobody hears it did it really get fragmented?
<geist> yah and does it make any performance difference? no. but in MY DAY we cared about fragmentation
<geist> i can just think about all the spillover FILE records to track a file with 50k extents
<geist> does it matter? nope.
<gog> lol
<gog> i mean it did matter
<geist> and technically the cpu is doing a bit more work there to seek through the extent list
<FireFly> oh I just finished the latest wtypp vid, heh
<gog> i've been listening to true crime shows all day
<sonny> geist uefi
<geist> sometimes just lisxtening to Steve at gamers nexus blab on about benchmarks for 30 minutes is good background noise
<geist> i think his brain works too fast for his mouth
<sonny> I just watched https://youtu.be/xD14SLU2u2k
<bslsk05> ​'Pushing the limits of floppy disk boot sectors: sectorLISP' by The Oldskool PC (00:14:23)
<gog> nice
<sonny> yeah, I didn't think it would be so simple
<geist> sonny: ah yes. UEFI is definitely a designed by committee mess. OTOH despite it's ergonomics, i do rather like the overall structure
<geist> ie, giving you an api/environment to write your 'whatever you want to do' bootloader is fairly nice
<sonny> yeah that is cool
<geist> and, and this is really important to me, it's portable across arches
<gog> ILikeUefiButTheFunctionNamesAreSoBad()
<sonny> lol
<geist> so for thigs that support it, you end up with a single loader which is very nice
<sonny> more than amd64 support uefi?
<geist> yah, arm, riscv
<gog> ^
<sonny> oh cool
<sonny> yeah that's very helpful
<geist> yah same API, etc. just compile yuor code differently
<gog> and ia32
<geist> now, obviously the issue is different ARM boards, etc dont *have* t support UEFI, but i think the trend is slowly moving that way
<sonny> so no more device tree stuff?
<j`ey> lots of those boards can get uefi via uboot
<geist> device tree can coexist with UEFI
<j`ey> (but not UEFI+DT not acpi)
<geist> i have to actually try it, but there's a UUID to get the DT blob out of UEFI
<geist> GetSystemTable or something
<gog> SystemTable->ConfigurationTable
<gog> same as acpi
<gog> you get a pointer to the bloob with the UUID
<geist> yah
<geist> so based on how you wanna roll, you can use ACPI or DT (or both) to configure your drivers
<sonny> I looked at the uefi hello world and the string "hello world" wasn't there D:
<geist> windows uses ACPI, and i think linux tends to favor DT< though it's unclear precisely what uses ACPI and what uses DT when both are present
<geist> that's because it's probably stored as UTF16
<geist> so look for h\0e\0....
<geist> because MSFT
<bslsk05> ​github.com: edk2/HelloWorld.c at master · tianocore/edk2 · GitHub
<geist> yeah?
<geist> i dunno what that PcdGetPtr stuff is, but that's what documentation is for
<geist> maybe some way to build per language strings or something
<geist> but indeed does not seem like a very friendly way to build a hello world
<sonny> ok, I'll investigate that later
<geist> but the real gist is it calls Print() with a string
<geist> so thats hello world
<sonny> fair
<bslsk05> ​github.com: efitutorial/loader at main · adachristine/efitutorial · GitHub
<gog> the best uefi hello world
<sonny> ah, thanks a lot!
<kingoffrance> dont see it on sonny's link, i think one can gather most of the C "hello world"s into an "unnecessary use of printf()" page, since puts() suffices for most of them
<geist> heh well the compiler will probably replace it anyway
<sonny> yeah, just that there puts() seemed convoluted
<sonny> s/there/their/
<geist> so can counter that as an 'unnecessary microoptimization that the compiler will do for you'
masoudd has quit [Ping timeout: 272 seconds]
<kingoffrance> eh, that doesn't bother me, just seems like messing with noobs right from the start
<geist> what i dont like about puts is it puts int he \n at the end
<geist> which i think folks should generally be aware of right off the bat
<geist> though other languages tend to have two variants of their printf that inserts or not
<geist> so that may just be old skool me
<gog> i'm the 83rd most active dev known to be in iceland :D
<sonny> congrats, is that stat via github?
<bslsk05> ​github.com: top-github-users/iceland.md at main · gayanvoice/top-github-users · GitHub
<gog> 212 if you count all contributions rather than public :p
<mrvn> Urgs, my kernel puts() is broken, it doesn't add a newline. :)
<gog> i gotta up my game here
<gog> a commit for every changed line
<geist> also fputs *doesnt* so you can't just implement puts as an alias of fputs(stdout)
<geist> annoying it is
<mrvn> kingoffrance: printf("Hello, world!\n") or printf("%s\n", "Hello, world!")?
<sonny> I don't use any web service long enough to get on those lists :(
<gog> i do not like implicit \n if i wanted \n i would have put it in the string
<geist> mrvn: if you're puts doesn't do it and you have -fbuiltins, you'l find out pretty fast, since hte compiler will happily replace printfs with puts for you
<geist> gog: exactly
<mrvn> I have putc, puts, puti, putx from the days before printf.
<gog> nice
<not_not> Nice
<not_not> Gog same
<sonny> computer memory doesn't have a [signed] representation right? It's just bits?
<GeDaMo> Yes, just bits
mahmutov has quit [Ping timeout: 272 seconds]
<mrvn> the signed bit comes from the opcodes
<gog> it's an array of (minimum addressable unit)
<mrvn> .oO(9 bit)
<gog> hehe
pretty_dumm_guy has quit [Quit: WeeChat 3.4]
Teukka has quit [Read error: Connection reset by peer]
xenos1984 has quit [Remote host closed the connection]
xenos1984 has joined #osdev
Teukka has joined #osdev
<vin1> :q
vin1 has quit [Quit: WeeChat 2.8]
vin has joined #osdev
GeDaMo has quit [Remote host closed the connection]
<kingoffrance> mrvn: first one should be puts() :) second is unnecessary unless the output is configurable/changeable
<kingoffrance> < a commit for every changed line of course, that's how you get persistency and rollback features
<mrvn> run "make" in the pre-comit hook and reject on error :)
<kingoffrance> sonny, there was sign and magnitude, but i'm not sure how that was defined i.e. could programmer toggle it, or was it a "side effect" of instructions. nevertheless, at some point, was probably still just another bit
<mrvn> that's called 1's complement and is just a different interpretation of the bits in memory.
<sonny> kingoffrance I'm looking at the 704 and that stuff seems to only be defined for numbers
FreeFull has joined #osdev
<bslsk05> ​en.wikipedia.org: Signed number representations - Wikipedia
eddof13 has joined #osdev
<kingoffrance> sonny: Fixed-point numbers are stored in binary sign/magnitude format. :) from https://en.wikipedia.org/wiki/IBM_704
<bslsk05> ​en.wikipedia.org: IBM 704 - Wikipedia
<sonny> yeah
<mrvn> The format is no longer supported in the latest C/C++ standards.
<kingoffrance> ^ believe it is all twos complement
<mrvn> sign/magnitude is ones complement
<mrvn> i believe
<sonny> sign/mag doesn't use a complement iirc
<kingoffrance> ^ wikipedia example: 10000010−125 8 bit "ones complement" for sign & mag, should just i believe be 1 <sign bit somewhere> 1111101
<kingoffrance> i just dont know if say from a c program, you can access that magic "sign bit"
<kingoffrance> or is it "transparent"
<kingoffrance> somewhere i have in my notes such a system c compiler, but have not used it
<kingoffrance> *system +
<mrvn> you can always access it.
eddof13 has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
<mrvn> bool negative = x < 0;
<kingoffrance> yeah, but thats what seems to me "transparent"
<kingoffrance> i mean with a mask, say a uchar
<mrvn> or (unsigned)x >> bits;
<mrvn> I highly doubt the cpu would hide the sign bit for unsigned ints.
<kingoffrance> can you toggle padding bits?
JanC has quit [Remote host closed the connection]
JanC has joined #osdev
<mrvn> geist: Are ARMv8 with only a single VA range common?
<geist> VA range how?
<j`ey> you mean PA range?
<mrvn> "For a translation stage that supports a single VA range, a 48-biz VA width gives a VA range of 0x0000000000000000 to 0x0000ffffffffffff."
<geist> physical address range?
<geist> aah no. that's referring to EL2 and whatnot
<mrvn> No, virtual address
<geist> in EL2 without the type 2 extensions enabled (i forget the acrynom) EL2 only supports a single VA range
<geist> hence why that verbiage exists
<j`ey> VHE
<geist> yah VHE
<j`ey> with !VHE there's only TTBR0_EL2
<mrvn> Can you explain the address tagging in AArch64 state in a few words?
<geist> the ASID stuff?
<geist> (there are a few things with tagging and whatnot)
<mrvn> If that's ASID then yes. I thought it might be but the wording is horrible.
<geist> there's also a bit you can set that lets you put 8 bits of whatever you want in the top 8 bits of the address
<geist> i think that's also called tagging
<mrvn> that's the one I'm reading
<kingoffrance> the good news is, if clang, llvm, whatever else, keeps abstracting into arbitrary-sized integers, eventually they will abstract far enough these things will be a thing again
eddof13 has joined #osdev
<geist> that's a different bit. lets you (per addrses space) tell the cpu to ignore bits 63:56 i think
<geist> and then bit 55 becomes the one that determines user and kernel
<geist> can use it for various schemes
<geist> we're actually enabling it in zircon for user space, like right now
<mrvn> Only thing the docs say is that the top bits get iggnored for <list of things>. Are those bits used for anything?
<j`ey> TBI top byte ignore
<geist> yah TBI
<geist> when TBI is set, no
<geist> that's the pioint, lets code use the top bits for their own purposes
<mrvn> What's the use case for that?
<j`ey> MTE uses the top bits
<geist> we're enabling it in fuchsia for similar stuff. i forget the exact thnig
<geist> but basically user space pointer tagging
<j`ey> (MTE = memory tagging extension)
<mrvn> j`ey: No, MTE is somethig else.
<j`ey> "MTE is built on top of the ARMv8.0 virtual address tagging TBI (Top Byte Ignore) feature"
<bslsk05> ​www.kernel.org: Memory Tagging Extension (MTE) in AArch64 Linux — The Linux Kernel documentation
eddof13 has quit [Client Quit]
sonny has quit [Quit: Client closed]
lkurusa has joined #osdev
knusbaum has joined #osdev
<mrvn> ARM always has so many config registers and options. The docs are nearly unreadable. I'm trying to see the page table structure and I'm 10+ pages of "if this then that else that" into the chapter and not one bit about what a page table looks like.
<mrvn> but I have 4 levels.
<klange> haha yes, it's utter garbage
<mrvn> I want a "How do I use this" document instead of a "How to I build this".
<geist> yah gyuess i kinda have a stockholm syndrome with it
<bslsk05> ​armv8-ref.codingbelief.com: D4.3.1 VMSAv8-64 translation table level 0 level 1 and level 2 descriptor formats · ARM Ar­chitec­ture R­eferen­ce Man­ual fo­r ARMv­8-­A
<geist> been using that doc so long it's hard to remember how hard it is
<mrvn> geist: In my mind the whole document is backward. It first describes all the millions of exceptions before giving even the base format. The docs should start with a graphic showing the required layout all cpus must have.
<geist> yah i suppose. trouble is there really are a bunch of variants
<sortie> I just merged a links(1) port made by a contributor :)
<geist> there really isn't a base format, because the base format is 'you chose these options, this is what you get'
<sortie> I can now $LC_SEARCH :)
<mrvn> In the AMD docs it has 3 pictures of address translation for the various options. You look at it and then you can understand what the text below each is talking about.
<geist> yeah. the pic you seek is there, you just have to know which variant it is
<geist> it's pretty much what klange linked
<mrvn> yeah, it's there somewhere. But by the time you find it you are utterly confused.
<geist> it's the fact that it's so configurable that makes it hard. ie, there's no 4 level paging. it's paging can go up to 4 levels, dynamically
<geist> for example
<geist> i'm certain this is why info mmu doen't exist on qemu. the amount of work to parse all the options is a lot more than x86 and especiallt riscv which is almost comically simple compared to the ARM versions
<mrvn> That's bad code design in qemu. The MMU should have an abstraction layer so that the implementation of the MMU also generates the info mmu output for free.
<klange> but qemu _implements_ the mmu, so surely the work was already done and it's just laziness...
<klange> yeah what mrvn said
<geist> right, yeah
<geist> i looked at the existing code for these and its in no way shared with the rest
<mrvn> Each CPU should just define "this is how a pagetable looks like" and then generic page walking code and info mmu takes over or similar.
<mrvn> With a few exceptions there should be a ton of code different cpus can share there.
<geist> well... that's probably not a good idea for speed purposes
<geist> trying to make a generic page walker routine that works on all arches is a recipe for a slow implementation
<mrvn> .oO(Isn't that what the TLB is for :)
<geist> sure but the TLB is also epr arch too
<mrvn> qemu has a TLB?
<geist> i actually dunno how qemu handles translations
<klange> TCG has a generic TLB
<geist> it almost certainly has some sort of fast cache yes
<mrvn> I don't think it emulates the actual TLB.
<klange> last I checked it was not per-arch
<klange> and does not emulate the arch-specific TLB behaviors
<geist> right. some sort of generic translation cache indeed
<geist> and since it's compiled again and again it can have some #defines to change its behavior
<mrvn> In the page table format a "block entry" is a huge page, right?
nyah has quit [Quit: leaving]
<geist> yah
<geist> basically a terminal page table entry
<mrvn> what about 4k pages? Are those blocks too?
<geist> not at the last level
<geist> they are, but i think they're just not called block entries
<geist> probably for legacy reasons. i *think* they act exactly identically if it's an entry at L0 or higher up the stack
<mrvn> Look identical apart from the number of reserved bits
<geist> i think in arm32 they were called block entries yes
<geist> also there are 'contiguous pages' which are different
<geist> that's wen you mark say 8 or 16 pages in the final page table entry as combined with the nearby ones
<geist> ie, how you get 64k pages in a 4K page granule table
<mrvn> hah, not the same. A block ends in 01. A page ends in 11 (like a table entry)
<geist> well, yeah not *identically*. they're marked differently for sure
<mrvn> And 01 on the lowest level is reserved. Why did they do that?
* geist shrugs
<mrvn> Now I need 2 descriptions for the same bit. Level 0,1,2 it's Block vs. Table and level 3 it's revered vs. page.
<geist> right, was going to say if you're going to try to build some sort of bitfield struct for it, you're going to have a bad time
<geist> there's a fair amount of 'if this then interpret this as that'
<geist> in the upper PTE attributes especially
nyah has joined #osdev
<geist> i know its not your style but if you need some reference i've already flattened most to https://github.com/littlekernel/lk/blob/master/arch/arm64/include/arch/arm64/mmu.h#L172
<bslsk05> ​github.com: lk/mmu.h at master · littlekernel/lk · GitHub
<mrvn> I guess nobody at ARM likes the fractal/recursive page mapping trick from x86.
<geist> yah does not work, i do not think
<geist> but then i dont either, so no loss for me
<klange> just seems fragile and inflexible to me
<geist> agreed and as i've pointed out before it starts to get very hard to manage in a multi-cpu environment
<geist> also has some issues with seeing data before it's available, etc. basically in production stuff it gets hard to deal with
<klange> Really need to focus on USB stuff this week... I'm starting to face a deadline...
<mrvn> You always have that on the kernel level. When you map a page it's there but uninitialized.
<geist> right which is why you can't map a page before initialized
<geist> also means you can't add a page to the page table structure before it's initialized and memory barriered
<mrvn> I don't see where a fractal mapping is any different there. You map it to a temp location, initialize, map it to the real location.
<geist> you just hit it right on the head
<geist> you map it to one place, then unmap it and map it again
<geist> that's a *lot* of extra work
<mrvn> same amount of extra work with and without fractal mapping. you always do that.
<geist> not if you keep all pages mapped
<geist> then you dont have to map/unmap to access the page tables
<mrvn> same with fractal mapping.
<geist> anyway i've had this discussion liek 27 times over the years, dont want to get into it now
<mrvn> you can have fractal mapping and all pages mapped if you want.
<geist> there are a lot of reasons it becomes sub-optimal such that the utility of it does not outweigh the disadvantages
<geist> but they usually onkly show up once the design has reached enough sophistication that it causes trouble
<mrvn> Where I found it got complex was when copying data between different address spaces, e.g. 2 processes.
<geist> so it's great for hobby stuff, but over time becomes a crutch
<geist> and anyway its x86 specific anyway, which is jsut one grain of sand on the beach
<mrvn> What I like about is that you can define a page table entry once and then use it at every level.
<geist> sure, it's nice while it works
<geist> not sayign there's not a utility with it, but that's moot. a thing that's nice that becomes a hindrance you have to just move past
<geist> it's like a simple library/API that you hang onto far past when you should have moved onto something more complicated and powerful
<geist> human nature is to keep trying to fit the square peg you're used to into the round hole
vdamewood has joined #osdev
<mrvn> I need an App for the page tables. Something that lets me set all the bits in control registers and then shows me just the docs pertaining to that config.
<geist> anyway, sorry about being salty about this one
<geist> i've just had this argument before
<geist> mrvn: yeah years ago someone i know wrote a python script to decode page tables and it was really neat
<mrvn> no problem. I'm just testy because I'm having information overload here.
<geist> yah
<gog> now what about using only the page table recursively
<geist> i totally get that
<geist> it's like every time i try to dig into memory barriers or memory order on ARM it's like <sigh> and then half the day is wasted digging through the manual
<mrvn> If I could just mask out all the bits about EL2/EL3 the docs would be half the size already.
<klange> much ugh, very bleh
<klange> I still have lingering issues that I think are related to instruction caches on return from fork()
<mrvn> on fork()? Not exec()?
<klange> the child process on return from fork() _without_ Cow.
<klange> Bovines might fix it.
<klange> Or at least work around it.
<mrvn> but the instructions should be identical for parent and child.
<klange> But the child starts on a core that was running something else previously.
bradd has quit [Ping timeout: 256 seconds]
<geist> are you using ASIDs when context switching?
<klange> what's an asid
<geist> so you're not. okay. so lemme think
<mrvn> ahh. SO you get instructions from a previous random process.
vinleod has joined #osdev
<mrvn> Doesn't reloading the page table register flush the instruction cache?
<klange> doesn't seem to, no
<geist> hmm, that is strange
<geist> so I and D cache on arm64 is supposed to be VIPT, or at least behave that way
<mrvn> fork() and multitasking should be the same there. If fork() fails the multitasking should fail too.
<geist> so unless you're reusnig the same physical pages (which may be your problem) you should't get a bad alias
vdamewood has quit [Killed (erbium.libera.chat (Nickname regained by services))]
vinleod is now known as vdamewood
<geist> however. if you had a physical page A with some instructions in it that the cpu ran some time in the past
bradd has joined #osdev
<geist> and it gets recycled, mapping a *different* piece of code that you then run without dumping the icache
<geist> you can run old code that was there
<klange> ^ The physical pages are almost definitely reused things with other code
<geist> so possible during fork you're cycling thrtough some pages
<mrvn> Are you copying the pages for the .text segments on fork()?
<geist> okay, so basically what you need to do here is whenever you allocate a ppage for the purposes of mapping code and go to map it you need to dump the icache on it
<geist> this can happen even without fork, if you were demand faulting in stuff from .text and pulling in one page at a time
<geist> you have to make sure there's no stale icache entries that cover it
<geist> of coruse you also need to flush the icache whenever you modify the data in the page too
<mrvn> or when migrating a process to another core
<geist> which you can kinda think of is also what happens when you copy or fill a page with code
<geist> no, not when migrating
<geist> or at least if you follow the rules above, then multicore doesn't matter
<mrvn> if you haven't flushed the other cored icache when loading the .text then you have to do it on migrate.
<geist> but that's not how flushes work on armv8
<geist> you haven't gotten to that particular pile of complexium yet
<geist> TL;DR if you alway use global (broadcast) flushes then when you flush locally you flush on all cores simultaneously
<geist> and there's almsot no reason to not do that all the time
<geist> *especially* for icache flushes
<mrvn> ahh, nice. that's one headache gone.
<geist> yah TLB and cache maintenance is much more straigthfroward (though of course complex) on armv8
<geist> than v7 and especially x86
<geist> klange: so yeah back to the basic first princple: before you allow a process to run code on a page that has been modified since the last time code *may* have run on it, you must flush the icache for that page (or globally)
<geist> if you always follow that rule you should be okay
Oli has quit [Quit: leaving]
<klange> I think I see what I did wrong. I was actually trying to clear icache, but I was calling a function I had set up for exec + ld.so and it does a page walk to check if the address given is accessible to userspace...
<klange> And I was running it on a temporary kernel mapping
<geist> yah for now i'd do a global i cache flush
<geist> it's a single instruction iirc
<geist> there are Reasons for that (VIVT vs VIPT)
<geist> doing a global flush is sub optimal but it ensures that all aliases to the page are got
<geist> (there's a bit in CTR_EL0 that tells you if you can do this or not, but a global flush is safer for now)
<geist> these are all things i've learned the hard way over the last few years
<mrvn> Are there ARMv8 that don't support 4kB granule?
<geist> not that i know of. but 16K and 64k are not guaranteed
<geist> if there's gonna be a core that ditches the 4k granule it'll be apple with their M2 or whatnot
<geist> because OSX runs with 16k granules. i suspect they only have 4k around because of x86 emulation
<geist> most newer ARM cores, post about cortex-a75 or so support all 3
<klange> I have no solid way of testing this beyond running `sysinfo` (does a bunch of forks to call shells that call utilities) over and over and see if anything crashes
<klange> so far so good
<geist> yah that's the hard part. this is a toughy to test, or even reproduce reliably
<klange> unrelatd, but ever since I moved a couple of cables around on my desk my serial console has been stable :)
<clever> when i was getting linux to boot on the pi3 i think it was, i had a real nasty problem, where as soon as i got into userland, i had major data corruption
<clever> i eventually tracked it down to linux lacking permission to flush the arm L2 cache
xenos1984 has quit [Read error: Connection reset by peer]
<klange> I think it was electrical interference from charger.
<clever> so any dma was not being coherent with the caches
<clever> i think it was ACTLR that controlled that?
<clever> but linux ran surprisingly well in that mess, until you hit userland
<klange> My startup process also launches about 60 or so processes before getting to a desktop, so booting at all is at least somewhat of a stress test.
<geist> yah though fresh processes will never trigger a stale thing
<geist> it's when you start cycling through things
<mrvn> if you don't reuse pages immediately you can run quite a bit before a page gets reused and by then the cache is probably clear.
bauen1 has joined #osdev
<klange> I think my current physical page allocation pretty aggressively reuses pages.
<mrvn> or if you have ASID that probably extends the time till you can get a collision.
<geist> indeed, though cache entries arent tagged by ASID
[itchyjunk] has quit [Remote host closed the connection]
<geist> ASID jsut lets you keep from excessively flushing TLB entries
<mrvn> my page allocation is FILO or LIFO too.
<klange> I have a dumb bitmap allocator with a floating index to the next place it thinks has available space. Whenever something is freed that is lower than that index, the index will be moved down; whenever any allocation happen, the index moves to the page after that; the bitmap gets walked from that index until an available page is reached; if it hits the end, it resets the index to the bottom and tries again
<klange> before panicking.
<mrvn> I just have a stack. No memory wasted.
<klange> I can fairly easily find arbitrarily sized contiguous segments.
<mrvn> assuming you have any. :)
<mrvn> I decided that when I need it I can defrag phsical memory.
<mrvn> do you support huge pages?
<klange> Not for anything I give to userspace.
<mrvn> for anything that may allocate later or just during boot?
<klange> Just during boot.
heat has joined #osdev
<mrvn> That's easy then. You always have long ranges free at boot.
<heat> updog
eddof13 has joined #osdev
<mrvn> My though process was like this: Assume the system has been running a month. Now something wants a 2MB page. What's the chance the physical memory is not totaly fragmented and there is a 2MB contigous chunks?
<mrvn> Probably 0.
<zid> what's up dog
<mrvn> wuff wuff
<heat> it's all good
<heat> wbu
<zid> friendo isn't here to play scrabble with me tonight, sadge
sonny has joined #osdev
<mrvn> I once tried to make a scrabble AI that played optimal. It's rather hard (as in takes ages to make a move).
xenos1984 has joined #osdev
<heat> friends that miss scrabble night are bad friends
<mrvn> friends should play scabble online to avoid covid.
<nanovad> good news, covid isn't a valid scrabble word
<mrvn> Note to self: Build a pair of scrabble bots that synchronize a board over the internet.
<mrvn> nanovad: lol
<kingoffrance> consider all the words that slowly change meaning, and spelling ^ nanovad i think it trends towards halting problem
<mrvn> and by scrabble bot I mean: https://www.youtube.com/watch?v=pCBufhnrbDE
<bslsk05> ​'ITRI's Scrabble-bot first look at CES 2018' by Engadget (00:01:49)
<kingoffrance> or at least, some kind of fractal jurassic park thing, chaos theory
<zid> of course it is
<kingoffrance> if php is a fractal of bad design, what is english?
<kingoffrance> words cannot describe
<mrvn> minus the AI. just placement of tiles.
eddof13 has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
<klange> kingoffrance: English is a pickpocketer that has pilfered phonetics and philology from parents and pals
<zid> English corners languages in dark alleys for loose words and grammar
<clever> english isnt a language, its 5 languages in a trench-coat :P
<heat> tbf english is pretty simple when compared with other european languages for instance
<clever> "i before e except after c" was it?
<zid> no
<clever> how many exceptions are there do that rule?
<zid> that has never once been a rule, just a bad joke
<zid> English is complicated by just.. baggage, none of the vowels line up with out they're *presently* pronounced (which will again change), has more vowels than it has vowel letters by approximately 25, variously re-spells and doesn't respell foreign origin words, which eliminates things like phonotactical rules etc
<kazinsal> English is the Habsburg jaw of languages
<mrvn> I think I will use my favourite way to write asm to implement aarch64 boot.S: gcc -S
<mrvn> Urgs, that's a new warning: kernel/main.cc:27:10: warning: array subscript 0 is outside array bounds of ‘volatile unsigned int [0]’ [-Warray-bounds] 27 | while (*UART0_FR & (1 << 5) ) { }
<zid> incredibly free form grammar also makes incredible nuance often present
FreeFull has quit []
<zid> but at least it isn't japanese.
<mrvn> Since when can't I dereference a pointer?
<geist> dunno, how did yuo declare the UART0_FR?
<geist> if it's some [0] stuff that's probably not okay anymore
<mrvn> volatile unsigned int *UART0_FR = (volatile unsigned int *)(UART0_BASE + 0x18);
<zid> oh no, french uarts
<geist> huh. that's odd
<mrvn> zid: I think it's the Flag Register
<heat> mrvn, you can always use C
<heat> no need to -S
<mrvn> heat: boot.S. Have to setup the stack and MMU first.
<klange> stack yes, mmu not really I do my mmu setup in C
<heat> yes, C still works
<heat> all you need to give it is a stack
<heat> and that's trivial
<geist> not really. no
<mrvn> heat: nope, the code isn't 100% position independent. Have to map it to a fixed address first.
<geist> in this case there's a fair amount of set up before yo can do arm64
<heat> how?
<geist> stack, control registers, position independent code, etc
<clever> PIC is the biggest problem i can see there
<geist> but you dont have to write a *lot* of asm, you jsut can't get away with doing it in pure C
<mrvn> heat: any static initialization involving an address of aomething is not PIC
<geist> there are some control regs you need to set up pretty quickly because of assumptions of C
<geist> SCTLR bits basically
<heat> well, yes, I meant that you write a small _start that pretty much starts calling C code
<geist> not a big deal, jsut gotta do it before getting into C
<mrvn> heat: and I do that in C, g++ -S, check and fix PIC issues.
<geist> yah or godbolt, good way to learn the arch