klange changed the topic of #osdev to: Operating System Development || Don't ask to ask---just ask! || For 3+ LoC, use a pastebin (for example https://gist.github.com/) || Stats + Old logs: http://osdev-logs.qzx.com New Logs: https://libera.irclog.whitequark.org/osdev || Visit https://wiki.osdev.org and https://forum.osdev.org || Books: https://wiki.osdev.org/Books
<heat> sorry
<mjg> + if (sysctlbyname("debug.flame_graph_data", buf, &bytes, NULL, 0) < 0) {
<mjg> + perror("debug.flame_graph_data");
<mjg> + exit(1);
<mjg> + }
<mjg> this is how userspace gets it
<heat> they're not using kmem, but it's kmem compatible
<heat> UNIX(r) Backwards Compatibility(tm)
<mjg> look i don't know why the macro is there
<mjg> i suspect a leftover
<mjg> uh
<mjg> + bytes = sizeof(size_t) +
<mjg> + sizeof(struct flame_graph_pcpu) +
<mjg> + sizeof(struct flame_graph_entry) * FLAME_GRAPH_NENTRIES;
<mjg> that is defo crap
<mjg> but the kinks of the interface can be worked out later
<mjg> if you don't have a general exporter from the kernel you can just temporarily add a fcntl or something
<heat> i'm adding a device file
xenos1984 has quit [Read error: Connection reset by peer]
<mjg> just to get this off the ground
<heat> as a true unix fan
<mjg> oh a man of culture
<heat> almost as good as /dev/poll
<mjg> or /dev/sliding_doors
<mjg> all this talk however reminds me to implement cheap lock profiling for freebsd
<mjg> with stacktraces and hookers
<mjg> there is this funny bit where there are 2 mechanisms, one turbo expensive based on dtrace, another kind of ok, but unable to grab stacks
<mjg> making it rather limited
<mjg> and by cheap i mean on the cpu, i defo expect to hog some ram
<mjg> and may even need to dedup stacks :[
<heat> do I need dtrace?
demindiro has quit [Quit: Client closed]
<heat> how does that even work
<mjg> you want something for instrumentation *at some point*
<heat> i'm only familiar with ebpf
<mjg> dtrace provides a dsl, but oversimplyfing one could argue they provide similar features
<mjg> [i know ebpf is more powerful :>]
<mjg> so not sure what you mean by how
<mjg> for example there is func entry/return tracing and that patches standard func prolog/exit with int3
freakazoid332 has quit [Ping timeout: 244 seconds]
<mjg> which i suspect ebpf is also doing
<mjg> (or maybe they have nops now
<heat> is dtrace also bytecode based?
<mjg> internally it has a vm of sorts afair, but you are not expected to know something like that as its user
<mjg> key design decision behind dtrace was to make suer it is safe to use in production
<mjg> to that end they implemented a dsl without loops and whatnot
<mjg> so they can prove the code finishes
<mjg> personally i like it to an extent, but it does come with limitations which are sometimes a problem
<mjg> you want to port something, port ebpf
<mjg> :>
<mjg> but i'm pretty sure that's not a weekend project, while flamegraph generation definitely is
<mjg> well if you are asking if dtrace sripts compile to direct cpu asm, then no
<heat> yeah but they compile to some bytecode yeah?
<mjg> perhaps you have been warped by systemtamp et al which create c -> asm -> actual cpu binary code
<heat> or do you just pipe strings to the kernel?
<mjg> no it has some form of a bytecode, but i never looked into it
<mjg> and i don't know who is doing the compilation
<mjg> plausibly the kernel while trying to elide or combine probes
<mjg> anyway
<mjg> + bytes = sizeof(size_t) +
<mjg> + sizeof(struct flame_graph_entry) * FLAME_GRAPH_NENTRIES;
<mjg> + sizeof(struct flame_graph_pcpu) +
<mjg> style or layering aside, this is a bug
<mjg> you most likely want to use offsetof
<heat> yeah
<mjg> or better yet sizeof(struct fg_export_entry) * entries
<mjg> i would just write a short tool which grabs how many seconds to sample for, tells the device and everyone is happy
<mjg> once it's all done the ioctl returns and your user buf is populated
<mjg> well you need to know how to much memory to alloc for it, so you could ask the device
demindiro has joined #osdev
xenos1984 has joined #osdev
<heat> legit the most horrifying code I've written
<heat> i'm taking all the shortcuts
<heat> like a true unix
<mjg> btw what would be great is lock contention flaemgraphs
<zid> sizeof (struct fg_export_entry[entries]) for life
<zid> even if it does need VLA support for the syntax
<mjg> as in you have lock1 and wait for lock2
demindiro has quit [Quit: Client closed]
<heat> mjg, i eventually want an actual tracing system like what perfetto can do
<mjg> indicate this is what happened, so that when you see crazy lock1 wait times, you know where they acame from
<bslsk05> ​ui.perfetto.dev: Perfetto UI
<heat> see the android example
<heat> it's so sweet
<mjg> never played with it
<mjg> meh
<mjg> i'm blanking on a name, but there are tools which do flamegraph0y stuff with time axis
<mjg> one thing you can easily add to have a leg up on unix
<mjg> is to track off cpu time waiting on i/o 'n shit
<mjg> and add it to time(1)
Iris_Persephone has quit [Ping timeout: 244 seconds]
<mjg> i wanted to add it on freebsd, but you can't without screwing the abi
<mjg> or adding a new variant of waitpid which exports the bigger struct
<mjg> and i don't think doing that just to get the extra numbers i can arguably obtian with dtrace is justifiable
<mjg> hmmm
<mjg> now that i said it
<mjg> wait4 takes an options flag
<mjg> i could add one which indicates "the target rusage area is actually the extended stuff"
<mjg> ye i think i'm gonna do it
<heat> that seems controversial
<mjg> have you ever seen perf stta?
<heat> yes
<bslsk05> ​dpaste.com: dpaste: 9LYHTL6R8
<mjg> freebsd has an equivalent
<mjg> now imagine extending that with off cpu stuff, which is blatantly missing
<mjg> ... and which i know for a fact is a major factor
gxt has quit [Read error: Connection reset by peer]
<bslsk05> ​reviews.freebsd.org: ⚙ D24217 amd64 pmap: fine-grained pv list locking
gxt has joined #osdev
Matt|home has quit [Quit: Leaving]
Iris_Persephone has joined #osdev
<heat> ok i have something
<heat> lets see how it does
<mjg> did it crash? :)
elastic_dog has quit [Ping timeout: 260 seconds]
<heat> yes
<heat> my "is stack pointer out of bounds logic may be broken"
<mjg> i would say first iteration should just sample IP
<mjg> and fuck everything else
<mjg> once you know general machinery works you can start unwinding
<mjg> but you do you
<mjg> now that i mention it
<mjg> you can check if ip falls iwthin kernel range
<mjg> and if not, ignore the sample
<mjg> or write a placeholder
<mjg> as you sample
<mjg> storing a magic value like 0 could then be post processed to '[userspace]'
<mjg> bailing for the day, cheers
elastic_dog has joined #osdev
<mjg> fg repo https://github.com/brendangregg/FlameGraph.git see stackcollapse.pl
<heat> yeah I see
<heat> thanks for the helping
<heat> s/helping/help/
<mjg> now i wonder if a nasty user which just jumps to a kernle address would fuck this up
<heat> how?
<mjg> they literally jmp $something_in_kernel
<mjg> but before they crash
<mjg> yours ampling func finds it
<mjg> i would say something to worry baout later
<heat> but they're crashing immediately
<mjg> if there is no time window for this to get a false positive that's finew ith me
<mjg> haha wtf, see aix-perl.pl in the repo
<mjg> holy shit
<mjg> this has to be SO BAD
<heat> what's it doing?
* heat can't read perl
<mjg> + foreach my $pid (@proc){
<mjg> + my $command = "/usr/bin/procstack $pid";
<mjg> + print `$command 2>/dev/null`;
<mjg> + }
<mjg> you don't need to know perl for this one
<mjg> and this happens in a loop
<heat> getting stacks for userspace processes?
<mjg> getting stacks for everything i would think
<mjg> but this is so many forks and execs it has to be disfiguring the shit out of everything
<mjg> also how often can you sample
<mjg> defo not 1000 per second
Persephone has joined #osdev
Iris_Persephone has quit [Ping timeout: 252 seconds]
vdamewood has quit [Read error: Connection reset by peer]
vdamewood has joined #osdev
gog has quit [Ping timeout: 246 seconds]
frkzoid has joined #osdev
srjek has quit [Ping timeout: 244 seconds]
elastic_dog has quit [Ping timeout: 244 seconds]
elastic_dog has joined #osdev
nexalam_ has joined #osdev
Persephone is now known as Iris_Persephone
nexalam__ has quit [Ping timeout: 260 seconds]
Iris_Persephone has quit [Quit: Leaving]
freakazoid332 has joined #osdev
frkzoid has quit [Ping timeout: 244 seconds]
freakazoid332 has quit [Ping timeout: 260 seconds]
epony has quit [Ping timeout: 252 seconds]
saltd has quit [Remote host closed the connection]
saltd has joined #osdev
[itchyjunk] has quit [Remote host closed the connection]
vdamewood has quit [Read error: Connection reset by peer]
vdamewood has joined #osdev
<heat> mjg, flamegraphs need a stable frequency right?
<heat> I can't just plop a sampling point
<heat> yeah that sounds about right
<heat> ...or does it
<heat> yeah probably
<heat> else it wouldn't really correlate to time
<kof123> flamegraphs, like winamp ? j/k...but not really...
<mrvn> heat: you could divide the amounts by the time interval to normalize it. But the flames wouldn't fade out right unless you tell it the time passed too
<mrvn> you loose resolution on the flame then when you add samples too slowly.
epony has joined #osdev
moberg has quit [Quit: Disconnecting]
moberg has joined #osdev
Andrew is now known as haxcpu
GeDaMo has joined #osdev
<heat> https://gist.github.com/heatd/07ac7ba0be21e5e90a5ae1b56e969148 last flamegraph of the night (now day)
<bslsk05> ​gist.github.com: onyx-flame.svg · GitHub
<heat> without the staggering 70% idle
<heat> now I just need a way to get info about locks and sleeping processes :v
<heat> some of this stuff is making me real worried though
<heat> particularly pselect6 being 50%+ of the samples
<heat> and malloc + spinlocks inside malloc popping up there
<heat> I know musl's malloc was bad but this bad? just 4 threads :|
heat has quit [Ping timeout: 260 seconds]
<mrvn> it might say it's in pselect6 when it's sleeping there.
<mrvn> or rotating in a spin lock.
<mrvn> This is how a flame graph should look like: https://www.youtube.com/watch?v=jUJiULU4i0k
<bslsk05> ​'XFlame: From the XScreenSaver Collection, 1999.' by yesthatjwz (00:02:00)
wootehfoot has joined #osdev
wootehfoot has quit [Quit: Leaving]
vdamewood has quit [Read error: Connection reset by peer]
vdamewood has joined #osdev
bauen1 has quit [Ping timeout: 264 seconds]
saltd has quit [Remote host closed the connection]
saltd has joined #osdev
CryptoDavid has quit [Quit: Connection closed for inactivity]
vdamewood has quit [Read error: Connection reset by peer]
vdamewood has joined #osdev
gog has joined #osdev
lkurusa has quit [Quit: I probably fell asleep (or went out). Who will ever know.]
bauen1 has joined #osdev
<bslsk05> ​faultlore.com: Compiler Optimizations Are Hard Because They Forget - Faultlore
<kazinsal> me: man the early research unix filesystem sucks, I can probably design something better off the top of my head
<kazinsal> also me: [accidentally shits out a design that looks eerily close to the minix filesystem]
<clever> > And really at what point does the cost outweigh the benefit and aaAAaaAAAaAAAAA!!!!!
<clever> lol :D
<MelMalik> kazinsal, hehe
<zid> I already designed the best possible filesystem, dw
<sham1> kazinsal: thus is the hubris
<MelMalik> you probably can, though, just don't think that you can
<kazinsal> the research unix fs(5) optimizes for fast search of free block and inode lists through the use of a block list in the superblock + additional indirect blocks whereas the minix fs(5) -- and mine, apparently -- just allocates a contiguous block bitmap on disk and uses the contiguous list of inodes themselves to signal free/used status
<MelMalik> i wish you great luck
<gog> meow
* kazinsal gives gog headpats
* gog prrr
<gog> i'm doing inadvisble stuff with c++ and it's fun
<gog> namely, using c++ at all
<clever> kazinsal: zfs doesnt store the actual state of the free space map, but rather has a log, where every allocation/free is recorded
<clever> and when loading a spacemap, it creates a set of buckets, with holes of 2^n long in ram
<kazinsal> this filesystem is intended to run on an 8088 lol
<clever> ah, if your low on ram, yeah
qookie has quit [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
Arsen has quit [Quit: Quit.]
qookie has joined #osdev
Arsen has joined #osdev
vdamewood has quit [Read error: Connection reset by peer]
vdamewood has joined #osdev
StoaPhil has quit [Quit: WeeChat 3.6]
[itchyjunk] has joined #osdev
SpikeHeron has joined #osdev
dude12312414 has joined #osdev
srjek has joined #osdev
demindiro has joined #osdev
dude12312414 has quit [Quit: THE RAM IS TOO DAMN HIGH]
demindiro has quit [Ping timeout: 252 seconds]
epony has quit [Remote host closed the connection]
epony has joined #osdev
<vin> Is it possible to make writes slower (by injecting some delay) than reads to memory? For example memcpy(region2, region1, size) should be 2x faster than memcpy(region1, region2, size).
<sortie> vin, there's some caching options available e.g. through the Page Attribute Table on x86, which may be interesting or useful to you. Can you tell us more about your use case?
<sortie> E.g. write combing is usually turned on for RAM by default, so writes happen asynchronously in deduplicated combined manners. For video memory, e.g., you would often want to turn it on to make writing to it more efficient. Meanwhile you definitely don't want it for memory mapped registers.
<sortie> *write combining
<vin> Thanks I'll check it out sortie. I am trying to emulate a device where reads are faster than writes.
wootehfoot has joined #osdev
wootehfoot has quit [Remote host closed the connection]
wootehfoot has joined #osdev
<vin> Actually its not the latency that I want to change but rather the bandwidth. Where write latency should be worse of than read.
<vin> I am thinking of running parallel workloads to keep the memory controller when write operations are happening (inducing the slow down)
frkazoid333 has joined #osdev
<sortie> If you turn off write combining, you might slow them down considerably
<sortie> But idk feels a bit futile
<jimbzy> Hey sortie!
<jimbzy> How have you been?
<sortie> eyyo jimbzy!
<sortie> Oh man been out there traveling the world, partying to the best music :)
<jimbzy> Nice!
<jimbzy> Sounds like time well spent
<sortie> Plus you know working a bunch and even doing some osdev :)
<sortie> I just finished up my new init system, going through code review now!
<sortie> Only took me like half a decade to finish it
<jimbzy> Sweet!
<jimbzy> Hey, it's just sands in the hourglass, amiright?
<sortie> Wait who told you that's privileged information!
<jimbzy> XD
gxt has quit [Ping timeout: 258 seconds]
xenos1984 has quit [Ping timeout: 246 seconds]
xenos1984 has joined #osdev
wootehfoot has quit [Ping timeout: 246 seconds]
<geist> woot
<gog> woot
xenos1984 has quit [Ping timeout: 246 seconds]
<zid> ..w-woot?
<zid> I am like the dog wagging its tail when the humans get good news
<zid> just happy to be included
<GeDaMo> I bought some RAM, you can 'woot' about that if you want :P
<zid> yay, did you buy me any?
<GeDaMo> Certainly didn't
<zid> I need 2x8GB PC3-14900
<zid> I don't have any paypal dollars atm
<zid> and amazon wants triple the price
wootehfoot has joined #osdev
<zid> GeDaMo: you're supposed to brag about what you got
<GeDaMo> I bought 16GB, now I have 20GB! :P
<zid> sick
<zid> dual channel is just a scam by big ram to sell you more dimms afterall
<gog> i added 8 to my machine last week
<zid> I'll trade you 32GB of DDR4 for 4x8GB DDR3 41866MHz URDIMMs
<gog> i don't have
<zid> 1866*
<zid> yea nobody does unless they happen to own a mac pro 2013
<gog> sorry :(
<bslsk05> ​www.amazon.co.uk: 503 - Service Unavailable Error
<zid> 12800? didn't even go for the 14900?
<zid> also that's a fucking lot of money or that kit
xenos1984 has joined #osdev
<bslsk05> ​www.ebay.co.uk: SK Hynix 2x8GB (16GB Total) 2Rx8 PC3L 12800U Desktop Memory | eBay
<gog> £22.33 delivery
saltd has quit [Quit: joins libera]
<zid> you're in iceland
<zid> so.. I bet
<gog> oh right
<gog> might try to convince my boss to let me build a workstation
<gog> on his dime
<zid> Your workstation needs 32GB of 32GB URDIMMs btw
<gog> yes
<zid> for the.. monitor
<gog> yes the monitor memory
<zid> but you should definitely let me do a professional inspection first
<gog> so it can show images faster
wootehfoot has quit [Read error: Connection reset by peer]
<zid> free of charge
<zid> My monitor needs ECC tbh, it snows for a bit when cold
<zid> The CRT next to it is still perfect though :P
frkazoid333 has quit [Read error: Connection reset by peer]
demindiro has joined #osdev
isaacwoods has joined #osdev
heat has joined #osdev
<heat> whats up bozos
<mjg> heat: the fuq were you doing to get pselect high on the profile
<mjg> is that from a gmake run?
kof123 has quit [Ping timeout: 268 seconds]
<heat> lmao
<heat> make -j4
<heat> on will-it-scale, output all piped to devnull
<heat> there's a lot of shit I know is slow (like mmap and munmap)
<heat> my vm rb tree isn't great
<heat> needs some love
<mjg> that pselect shit is terribad
<mjg> note gmake uses a shared pipe between all children
<mjg> as in all gmake procs are pounding on it
bauen1 has quit [Ping timeout: 246 seconds]
<heat> i did find some good optimization on it
<heat> that svg is better than what I had lmao
<heat> I think malloc is really struggling there
<bslsk05> ​github.com: Onyx/spinlock.cpp at master · heatd/Onyx · GitHub
<mjg> you want this instead: do { cpu_relax(); } while (....);
<mrvn> mjg: it's a token system. There are 4 tokens in the pipe and every make fork waits to read on token, compiles and writes it back.
<mjg> mrvn: i'm painfully aware
<mjg> this does not scale for shit on bigger systems
<mjg> should not be visible with 4 workers tho:-P
<mrvn> usualy make jobs don't just last for miliseconds. :)
<bslsk05> ​gist.github.com: onyx-flame-vfsmix.svg · GitHub
<heat> mjg, think that makes a difference?
<mjg> at this scale it wont, your spinlocks don't scale anyway
<mjg> but that's the idiom
<zid> hey, I'm not an idiom, YOU'RE an idiom
<mjg> you literally just failed the op, there is 0 reason to instantly load again
<mjg> wow, wtf
<mjg> @ vfsmix fg
<mjg> do you have a global rwlock for lookups or something?
<heat> no
<heat> rwlock for each dir
<heat> they're contending all on /tmp
<mjg> ok, all these use the same dir anywya
<mjg> but it would be of note if they also fuck each other up when reaching /tmp
<mjg> ok i see
<heat> it's really worrying to me how much it's spinning while dequeuing itself when waking up
<mjg> you instnatnly go to sleep when faced with contention
<mjg> this alone is a huge problem
<mjg> in rw_lock_tryread:
<mjg> do
<mjg> {
<mjg> l = lock->lock;
<mjg> this load should 1. use atomics 2. be moved prior to the loop
<mjg> the cmpxchg op updates the value for you
<mjg> that's minor, what you really need to change is the going off cpu stuff
<heat> i have some spinning in mutex.cpp
<heat> might be able to port it to rwlocks too
<zid> My suggestion is tr/.*/return -EAGAIN/g
<mjg> do you have means to safely access lock owner?
<mjg> to check whether it is off cpu
<mjg> oh shit
<heat> hrm, no
<mjg> rwlock_prepare_sleep
<mjg> that's what you are leading with?
<heat> mutex does tho
<mjg> that's openbsd quality man
<mjg> :)
<heat> i do a fast path?
<mjg> oh, you sneaked it on top, my bad
<mjg> fast path should not be mixed with the slow path man
<mjg> you are partially defeating the purpose
<mjg> anyway, general mechanism aside, this will mostly clean itself up if you add adaptive spinning
<heat> something that's worrying is how sched_is_preemption_disabled so expensive
<mjg> now waiting for *writers* is easy since you can know what they are
<heat> it's a fucking trivial function
<heat> 0xffffffff8019de20 <+0>: push %rbp
<heat> 0xffffffff8019de21 <+1>: mov %rsp,%rbp
<heat> 0xffffffff8019de24 <+4>: mov %gs:0x7fe624fc(%rip),%rax # 0x328 <preemption_counter>
<heat> 0xffffffff8019de2c <+12>: test %rax,%rax
<heat> 0xffffffff8019de2f <+15>: setne %al
<heat> 0xffffffff8019de32 <+18>: pop %rbp
<mjg> the real problem is readers, since you hae no idea if all of them are on cpu
<heat> 0xffffffff8019de33 <+19>: ret
<mjg> is that the real func tho?
<mjg> i presume you are not really tracing interrupts here
<heat> how is it not the real func?
<heat> wdym tracing interrupts?
<mjg> the way i see it you may be doing smething really nasty in there and your code will mistakenly sample code which got irq'ed
<heat> I trace whenever my timer fires
<mjg> do you handle interrupts on the same stack?
<heat> yes
<mjg> hm
<mjg> then it is indeed peculiar
<mjg> are you using kvm here? it is plausible perf from the host iwll be able to shed some light
<heat> yes I am and no I can't use it for some reason
<mjg> it is a hit or miss whether it works, when it does, it can provide you with ips
<mjg> right, i never found out why sometimes does not work and then it stopped being a problem for me :)
<heat> heatd about intel pt, maybe it can give me more data?
<heat> heard*
<mjg> i would say the lock stuff is a problem and has a known path forward
<mjg> so i would take care of it first
<mjg> maybe the above will make itself clear later
<heat> anyhow, I like how this case is mostly action, no bad blocking problems
<heat> the make -j4 + gcc one is nasty
<heat> 70% idle :|
<mjg> it's probably idle from you going off cpu
<mjg> :)
<mjg> with lock waiters
<heat> yea
<heat> which is why I kinda want to trace that
<mjg> i would say should not be very hard
<mjg> the general mechanism you used for sampling can be repurposed
<mjg> you store stacktraces + off cpu time
<mjg> and maybe lock addr
<mjg> then again you export this into a flamegraph
<mjg> except instead of counts you get cumulative time
<heat> + off cpu time? what do you mean
<mjg> before = timstamp(); go_off_cpu(); off_cpu_time = timestamp() - before;
<heat> yeah but where would I store this?
<heat> not in this current system I have for sure
<mjg> you copy-patse the code you have for on cpu sampling
<mjg> you can sack space for 1 frame and just replace it with the time
<mjg> and have your tooling know about
<mjg> it
<heat> hrm
<heat> i'll have a think about it
<heat> I was thinking about getting Real Tracing(tm)
<mjg> that sounds like several weeks of work
<mjg> :->
<heat> is it?
<mjg> i would say best bang for the buck right now is the above, but you do you
<heat> get a ring buffer and write events to it
<mjg> i thought you want ebpf-esque soution
<heat> noooooo
<mjg> being real tracing 'n shit
<heat> that's too fancy
<mjg> well my proposal is above, i think very simple to tackle onto your existing code
<mjg> but it's your pick
<heat> btw I don't know if you're looking at the vfsmix svg still but most of my write(2) time is in malloc
<heat> :v
<heat> I may have outgrown this garbage malloc
<mjg> i'm kind of path lookup biased
<mjg> ;)
<heat> if I went with your approach, I couldn't use flamegraphs right?
<mjg> yep looks pretty crap
<mjg> why not?
<mjg> you do realize all the flamegraph stuff is just a fancy presentation of whatever you stacked up in there
<heat> would replacing the frequency with the time work?
<mjg> the numbers are whatever teh fuck you please
<mjg> there are even flamegraphs for file sizes
<mjg> at the bottom i added a special 'frame' "all", the value is total sleep time
<mjg> above that is wait channel
<mjg> again, value is sleep time on that fucker
<mjg> and above that are stacktraces
<heat> that graph screams "prince"
<mjg> by "convnetion" you add --color=io and that's how itl ooks like
<heat> where do you add that? flamegraph.pl?
<mjg> the --color arg to flamegraph.pl
Terlisimo has quit [Quit: Connection reset by beer]
<mjg> all the rest is just hte input file
<mjg> lemme generate an example real quick
<mjg> well will be 5, have to boot my test box
<mjg> heat: perl ~/mjg/FlameGraph/stackcollapse.pl | perl ~/mjg/FlameGraph/flamegraph.pl --color=io > out-off.svg
<heat> what's that name above the number?
<mjg> in most cases lock name
<mjg> in others it is wait channel
<mjg> you can just prop your lock address for the time being
<heat> also really important question
<heat> why are lots of vfs functions in freebsd all-caps
<heat> like VOP_LOCK1_APV
<mjg> i don't know how tha came to be, i suspect they started as macros
<heat> why were they not changed back?
<heat> s/back//
<mjg> that's like the smallest problem about them
<heat> yes but erm
<heat> is weird
<mjg> i did not even think to do something like that
<mjg> instead of talking shit about vfs
<heat> maybe a genius
<mjg> vmocol
<mjg> kernel`vm_object_collapse+0x13b
<mjg> the crap i complained about already visible
<mjg> the trace is from building the kernel
<mjg> heat: a safe bet is lsd
GeDaMo has quit [Quit: Physics -> Chemistry -> Biology -> Intelligence -> ???]
Terlisimo has joined #osdev
<mjg> re sched_is_preemption_disabled, i wonder if the func is fine, but you are just calling it a metric fuckton
<mjg> also why does it exist to begin with
<mjg> as in the code should know
<mjg> hm now that said it, are you sure you are allocating per-cpu memory in a manner which avoids cache bouncing with other threads?
<mjg> bare minimum you want per-cpu bufs with sizes rounded up to a multiple of 128
<mjg> oh man you use linker sets for this fuckery?
<mjg> i would argue waste of cpu time
<mjg> [not that freebsd is better :]]
<mjg> heat: ok man, your cred code needs a revamp
<mjg> heat: stock standard approach is to have copy-on-write objs
<mjg> heat: your rw lock around them is a serious problem
<mjg> heat: well will be once you fix the current stuff :)
<zid> we don't take kindly to people who know what they're talking about mjg
demindiro has quit [Ping timeout: 252 seconds]
<mrvn> Is there something worse than ordering food online, getting a ETA with count down clock, confirmation that the food is on the way and yet nothing arrives. Do they just fake all the status updates, seriously?
<gog> grubhub?
<mrvn> lieferando. They even have a marker on the map where the delivery bike is supposed to be.
<gog> hm
<gog> they might be faking it
<mrvn> Oh, the status has changed: "Moaz ist mit Deiner Bestellung bei Liki Burger auf dem Weg." Was just confirmed to be on the way before.
<gog> auf dem Weg
<gog> possible that the driver jumped the gun on confirming pickup
<gog> or bicyclist
<mrvn> now the bike is moving....
<mrvn> If only the food where free if they take too long to deliver.
bauen1 has joined #osdev
freakazoid332 has joined #osdev
C-Man has joined #osdev
gxt has joined #osdev
freakazoid332 has quit [Ping timeout: 244 seconds]
frkzoid has joined #osdev
buffet has left #osdev [The Lounge - https://thelounge.chat]
frkzoid has quit [Ping timeout: 244 seconds]
<mjg> mrvn: these statuses are mostly fake
vdamewood has quit [Read error: Connection reset by peer]
xenos1984 has quit [Read error: Connection reset by peer]
opal has quit [Ping timeout: 258 seconds]
vdamewood has joined #osdev
scoobydoo has quit [Ping timeout: 244 seconds]
scoobydoo has joined #osdev
xenos1984 has joined #osdev
freakazoid332 has joined #osdev
DanDan has quit [Ping timeout: 252 seconds]
scoobydoo_ has joined #osdev
scoobydoo has quit [Ping timeout: 265 seconds]
scoobydoo_ is now known as scoobydoo
freakazoid332 has quit [Ping timeout: 244 seconds]
<heat> mjg, sorry for ignoring you, had to sleep what I didn't sleep last night
<heat> linker sets?
<heat> you mean a linker section?
<mjg> and all this time i thought you are furiously coding off cpu tracking
* mjg is disappointed
<heat> i was furiously coding on-cpu tracking last night til 8am
<mjg> on a serous note i wrote some genuine feedback
<mjg> with the cred stuff being lowest priority, but definitely to-be-fixed
<heat> think that will screw something up?
<heat> what's the alternative, refcounted struct cred + cmpxchg on write?
<heat> i thought about an rw lock because it's genuinely a case where you're not probably writing much
<heat> hence no need to worry about writer starvation
<mjg> copy-on-write
<mjg> free access at all times, no need to synchro squat
<mjg> apart from one special case
<heat> ...which sounds exactly like what I'm thinking
<mjg> typical approach is you check at user<->kernel boundary if your creds are current
<heat> hm
<mjg> performance problems of rw locking aside, you are establishing a lock ordering
<mjg> which, if you add lock ordering verification, i guarantee will eventually sohw deadlocks
<mjg> for example if you happen to hold this across i/o which writes to something which gets a page fault
<mjg> an improvement on checking for creds specifically is recognizing there may be other COW structs to sync
<mjg> and instead having a struct cow_objs { ... } thing you have a pointer to
<heat> why deadlocks?
<mjg> or some form of a generatin counter
<mjg> ok, just trust on me on this one, if you get a big enough kernel, all possible lock orderings which show up are pretty funny
<heat> oh i know
<mjg> did you know SOLARIS, world-famed SMP kernel does not have a lock ordering verification facility?
<mjg> while linux, a hippie-written kernel, does?
<heat> lmao
<heat> damn communists
<mjg> i'm guessing solaris kernel devs just don't write deadlocks
<mjg> 's why
<heat> world-renowned SMP experts
<heat> oh yea
<heat> something I want to ask
<heat> what's with unix and genunix?
<bslsk05> ​www.illumos.org: Bug #13243: deadlock on ZFS during concurrent rename and mkdir - illumos gate - illumos
<heat> are they still separate in modernish bsds?
<mjg> i don't know what's up with the split
<mjg> in bsd you just have kernel
<mjg> so apparently it was already reported 6 years ago by another freebsd dev
<mjg> 2 years ago i ran into it and gave them a reproducer
<mjg> they reprod 1 year ago
<mjg> no fix in sight, at least none mentioned
<mjg> :>
<mjg> would be funny to check is solaris proper still has the problem
<heat> in the example dtrace in the flamegraph repo there's still a separation
<mjg> i know
<mjg> i don't know why they roll with it
<heat> solaris proper? does that even handle concurrency? :P
<heat> btw thanks for the help and tips
<heat> really helpful :)
<mjg> i can rant, but it is getting late
<mjg> :[
<heat> lmao
<mjg> i know of some actual fixups to smp in solaris
<mjg> after it diverged from illumos
<mjg> some of it is stuff they should have done years ago
<mjg> and other is a combination of a smart idea implemented in a stupdi manner
<mjg> :[
hmmmm has joined #osdev
<mjg> heat: want a rant here it is https://www.illumos.org/issues/13057
<bslsk05> ​www.illumos.org: Bug #13057: pessimal mutex behavior - illumos gate - illumos
<heat> you have rants for days don't you
<geist> yeah and honstly.. i dunno. there's mmore to life than speed
<vin> sortie: Replying to you late. Wouldn't dissabling write combining make all writes slower. Anychance writes to a particular memory region can be made slower?
nexalam__ has joined #osdev
<geist> everyone has their thing, but performane performance prformance scaling, etc is not everything
<vin> For context: I want to limit just the write bandwidth made to far numa node memory.
nexalam_ has quit [Ping timeout: 246 seconds]
<mjg> geist: well enjoy your openbsd manuals in a vm man :)
* geist shrugs
<geist> just sayin
<mjg> i have my kinks, you have yours
<mjg> look i'm happy to shut about solaris if it is seen as a problem
<mjg> believe it or not :->
<geist> well anyway
<geist> it's late!
<heat> noooo
<heat> i like this channel's diversity
<geist> okay, sorry never mind
<heat> what have you been up to geist?
<geist> oh just general home maintenance, etc
<geist> thinking of doing some of FS hackery here in a sec
<heat> nice nice
<heat> still fat32?
<geist> yeah gotta finish it up
<geist> hate write wrorking the other day, now have to wire up all the remaining ops, run some stress tests and declare it v1
<mjg> got fsx on it?
<heat> do you have some gotos for stress tests?
<geist> what is fsx?
<bslsk05> ​github.com: lk-overlay/cksum.c at master · librerpi/lk-overlay · GitHub
<heat> apple's shtick
<heat> i linked it the other day
<geist> oh. then no.
<clever> geist: if you pop this module into your build, and change the fstype on 262, it should help to stress test fat
<clever> it uses psci to shutdown qemu automatically when done
<geist> sure. but.. uh frankly that's a really lame stress test
<vin> what changes are you making to the fs geist ?
<geist> oh, just implementing it
<geist> a driver, that is
<heat> i stole fsx from apple and fsstress from ltp to stress my stuff
scoobydoo_ has joined #osdev
<vin> oh a driver for a new device?
<heat> a driver for the filesystem
<geist> no, a FS driver for FAT*
scoobydoo has quit [Ping timeout: 260 seconds]
scoobydoo_ is now known as scoobydoo
<clever> geist: what might you do to improve the stress test, maybe fire up threads and sha256 in parallel?
<geist> implement write
<clever> ah yes
<geist> that's not a stress test, that's simply a validation test
<clever> yep
<geist> its useful, but not really what i'd consider something that's stressing a fs
<mjg> want to seriously stress this -- xfstests
<heat> I've thought about that but it has too many deps
<mjg> has a barrage of tests, but i don't remember how portable it is
<heat> fsstress is something I can trivially add to my src/ and make it run on every CI
<clever> yeah, i had totally forgotten about write support, and that is something i also want to get working
<geist> huh i kinda wonder if fsx is a derivative of some fs code i wrote when i was at apple on the fs team
<heat> take a look
<bslsk05> ​github.com: fstools/fsx.c at master · apple/fstools · GitHub
<geist> i remember there being basically zero stress tests, so i started building a thing and then handed it off when they pulled me into iphone
<heat> also you were on the fs team?
<heat> you've been everywhere! :P
<vin> FAT32 did not have a driver? I am trying to understand the purpose behind the driver
<geist> ah no that predates it
<heat> "gcc -arch ppc -arch i386 -arch ppc64 -arch x86_64" what are these -arch switches?
<heat> are they for gcc-disguised clang?
<geist> ah actually yeah i did have a bit to do with this
<geist> i was there are 2005, write some standalone tool that did something similar. looking at the history of it they rolled some of the machinery into this in 2006
<geist> yeah this is sort of a distant derivative of some stress test code i wrote at the time
<heat> very cool
<geist> well portions of
<geist> most of it predates it, if nothing else because the style is not mine at all
<mjg> heh solid
<mjg> well your old code can come back to haunt you in a new way then
<heat> lk isn't posix
<heat> might be hard to do
<geist> but basicall what i wrote is fairky simple. had somehting like it at Be: spawn a crapton of threads, each thread goes and does a bunch of random stuff, run until something falls over
<geist> oh gosh no am i using any apple code in my projects
<geist> *that* is a stress test. try to find all the edge conditions
<heat> :D
<geist> i remember writing this tool in 2005 and it would almost instantly fatally corrupt HFS+. a bug was found and fixed
<geist> i remember the tseting folks were like 'oh this is great!' and they added it to their stuff
<geist> but it's the usual issue i have with unit testing vs stress testing. lots of places uit test, but dont stress test, because the latter takes time and has harder to define end states
nexalam__ has quit [Quit: Leaving]
<heat> tbf fsx.c is widely considered more of a unit-test thing these days
<geist> it's easy to run a unit test every time someone makes a CL, but much harder to say run a bank of machines, beating up on the software, trying to find edge cases
<mjg> fwiw freebsd has a stress testing machinery and it is pretty good
<bslsk05> ​google/file-system-stress-testing - A tool that can be used to stress test POSIX filesystems. (26 forks/88 stargazers/Apache-2.0)
<mjg> found tons of bugs
<heat> mjg, ever used that? ^^
<heat> I don't know how good it is but it seems to be made for freebsd
<geist> mjg: yeah i think OS projects are generally more amenable to stress testing, since there's essentially an infinite number of monkeys
<geist> vs a company where time is money
<mjg> heat: no, looking
<mjg> i wanted to note though that existence of instrumanetation + things like syzkaller definitely changed the landscape
<mjg> heat: heh it even has a mnual for freebsd
<geist> yeah. what i like to look for in a FS implementation is all the internal locking issues, and pushing all of the ops to MT collisions, etc
<mjg> wait, that's a stale repo
<mjg> what's the current one
<geist> also generating crazy lomg journal transations that blow up internals
<geist> rename() in particular is hell
<heat> mjg, seems to have been frozen, I don't know
isaacwoods has quit [Quit: WeeChat 3.6]
<geist> ie, via a crapton of random stuff end up with huge, incredibly fragmented files, then simultaneously rename A onto B onto C while C is renamed into A, etc
<geist> that sort of stuff finds all sorts of edge cases and failure cases
<mjg> ye, i even linked to [redacted] system deadlocking with it
<heat> [TOP SECRET]
<geist> anyway, that kinda stuff you can't easily unit test, since it relies on lots of heavily threaded slams on a machine
<geist> sometimes newer/faster hardware even hids the problems. sometimes handy to test on slow ass hardware, where races are wider
<mjg> there is the funny technique were you inject delays to artificially expand race windows
<mjg> i mean tooling is doing it for you
<geist> (qemu using TCG is actually fairly good for severe SMP racy stuff)
<mjg> heat: so does onyx survive https://www.netbsd.org/~riastradh/tmp/dirconc.c ? :)
<geist> yah single threaded qemu does by default, sicne it context switches between emulated cpus. has gigantic race windows as a result
<vin> https://github.com/utsaslab/crashmonkey this is a reasonably good crash testing tool that was published recently
<bslsk05> ​utsaslab/crashmonkey - CrashMonkey: tools for testing file-system reliability (OSDI 18) (27 forks/176 stargazers/Apache-2.0)
<geist> anyway yeah i should try to port one of these things to LK, or at least take inspiration. passing that is what i'd consider a V1 release of a FS
DanDan has joined #osdev
<heat> mjg, haven't tried yet
<heat> i'm tackling the budget-ass wait tracing rn
<mjg> +1
<heat> i can't stress how much I love looking at flamegraphs and tracing shit
<heat> it just looks soo good
<mjg> my man!
<heat> these svgs are particularly satisfying
<heat> you can even click shit
<vin> +1
<mjg> this is not a safe space for such claims though
<mjg> heat: you should read up on brendan gregg's blog then
<mjg> in particular there graphs for sleep and wakeup stuff
<bslsk05> ​www.brendangregg.com: Flame Graphs
<mjg> how you go off and how you get back
<bslsk05> ​www.brendangregg.com: Linux Wakeup and Off-Wake Profiling
<vin> For visualizing low latency measurements I like dick sites graphs more https://queue.acm.org/detail.cfm?id=3291278
<bslsk05> ​queue.acm.org: Benchmarking "Hello, World!" - ACM Queue
<mjg> oh ye the guy is great
<mjg> tracing is such a rabit hole though
<mjg> may i also recommend gil tene talking about latency
<vin> I am reading his book right now, so much of what he says is something I want to follow "measure and then build"
<heat> booooring
<heat> build and then measure 3 years later
<heat> or in my case, a solid 7 years
<vin> Haha
<mjg> overpromise and under deliver
<mjg> work for a corp for 1 year and you will know what i mean
<vin> My research has been mostly studying things carefully rather than coming up with some idea and hammiring it all the places
SpikeHeron has quit [Quit: WeeChat 3.6]
<vin> *hammering
<vin> Also, regarding my previous question. Is it possible to make just writes to far NUMA memory slow? I have a tmpfs on the other numa node and would like to make all writes to a mmaped region slow by some delay.
SpikeHeron has joined #osdev
dude12312414 has joined #osdev
<vin> I really want to mimic assymetric read write bandwidth of a device. Any thoughts?
<mjg> if you strictly control everything, i would just patch the kernel to add an artificial delay
dude12312414 has quit [Remote host closed the connection]
<heat> that seems... complicated
<vin> where exactly will I add this delay mjg
<heat> just add that in user space
<mjg> provided all the writes happen through write(2) et al, somewhere in tmpfs write
<heat> tmpfs write isn't a thing
<mjg> you spot you are in the "screw with it" area and just artificially wait
<mjg> things get harder if you mmap
<heat> rather tmpfs_writepages (but it got renamed now, it's readahead iirc?)
<heat> and also that's useless, writepages is only called when flushing
<heat> and we'd need to go into the details of linux vfs and... yeah
<heat> fuck that
<vin> I would love to just use "wait" in userspace but the thing is I am calling a library that move data around in this mmmaped region.
<vin> There are other places where I memcpy from a malloced area to the mmaped region where I can add a wait to induce this delay
<mjg> hm
<mjg> is this something you need to execute on bar metal?
<mjg> if you were playing valgrind you could inject delays at that level
<vin> mjg: Unfortunately yes. I am doing benchmakrs on how something would work on future byte addressable storage devices.
<mjg> welp good luck :)
<vin> haha, I will figure something out. Thanks for the discourse :)