klange changed the topic of #osdev to: Operating System Development || Don't ask to ask---just ask! || For 3+ LoC, use a pastebin (for example https://gist.github.com/) || Stats + Old logs: http://osdev-logs.qzx.com New Logs: https://libera.irclog.whitequark.org/osdev || Visit https://wiki.osdev.org and https://forum.osdev.org || Books: https://wiki.osdev.org/Books
LittleFox has quit [Quit: ZNC 1.8.2+deb2+b1 - https://znc.in]
<jimbzy> Have you ever read The Blue Nowhere?
<jimbzy> By Jeffery Weaver, or maybe it was Deaver?
LittleFox has joined #osdev
LittleFox has quit [Remote host closed the connection]
<geist> no! should i?
LittleFox has joined #osdev
<geist> looks interesting!
thinkpol has quit [Remote host closed the connection]
<jimbzy> I thought it was pretty good.
thinkpol has joined #osdev
<bslsk05> ​devblogs.microsoft.com: Why load fs:[0x18] into a register and then dereference that, instead of just going for fs:[n] directly? - The Old New Thing
<heat> thoughts?
outfox has quit [Ping timeout: 255 seconds]
<klys> lea bx,[fs:0x18] ; cli ; mov eax,[bx] ; mov [bx],edx ; sti ; ret
<geist> watching a few recent bjork videos (new upcoming album)
<geist> unable to hold a coherent thought
<klys> lea bx,[fs:0x18] ; cli ; mov eax,[fs:bx] ; mov [fs:bx],edx ; sti ; ret
<klys> int 6: cpu-generated (80186+): invalid opcode
gog has quit [Ping timeout: 252 seconds]
[itchyjunk] has quit [Ping timeout: 264 seconds]
outfox has joined #osdev
[itchyjunk] has joined #osdev
<zid> oh that post made *no* sense until the comments clarified that fs+0x18 holds the fsbase
elastic_dog has quit [Ping timeout: 248 seconds]
<geist> i started to read it but it was prettyu confusing and using the syntax i dont read much, and just below the threshold of interesting
<heat> yeah I took some time to figure out what 0x18 was
elastic_dog has joined #osdev
<geist> i guessed what it was (usually it's canonical to put it at offset 0) but just didn't feel like diving into that set of x86 minutae today.
<heat> is x18 on arm64 tp?
<geist> the abi says that x18 can be used by the OS to do what it wants, otherwise its another temporary reg
<geist> but yeah, usually x18
<heat> ahh
<heat> what's ffixed-x18 for then?
<geist> no idea what windows does, but iirc their abi is fairly close to the standard ELF one
<heat> do gcc/clang default to x18 = temp reg?
<geist> -ffxied-x18 is precisely for that: tell the compiler not to use the register for anything
<geist> probably. or it's defined in the triple
<geist> but for -elf it may define it as a temp
<heat> windows seems to use it for tls
<heat> which is interesting
<geist> yah, i think linux does too
<zid> cute that fs:18 and x18 go together
<zid> wonder how that happened
<heat> is there a big disadvantage in reading tpidr instead of x18?
<geist> iluminati confirmed
<geist> oh oh yeah, duh, tpidr. sorry, my head is a little frazzled
<geist> yeah linux uses tpidr_el0, not sure it uses x18 in user space.
<geist> kernel has of course its own use for those things, but that's a different can of worms
<geist> windows, i have no idea
<heat> I think linux uses ffixed-x18 in the kernel
<geist> yah fuchsia absolutely dees to
<geist> for user space fuchsia uses x18 for the safe call stack too
<geist> as well as the kernel
<heat> yeah I think that's mandated by the sanitizer?
<geist> at least is mandated by safe stack, since you need a ready to use register holding the pointer or it's not usable
<heat> yeah
<geist> i guess you could move it from tpidr, load the SS pointer, use it, store it back
<geist> but it makes a generally free security thing somewhat more expensive
<heat> what I'm wondering is: why x18 for tls? is mrs TPIDR_ELx, reg slower?
<geist> no idea
<geist> that's what windows does?
<heat> well, you do that
<heat> so does the linux kernel
<geist> uhuh?
<geist> wait wait. now i'm paying attention. what do you think is going on?
<geist> clean slate here.
<heat> you're using x18 to store the per-thread/per-cpu stuff right?
<geist> no. not at all. that's tpidr
<geist> x18 holds the safe stack
<heat> ah!
<zid> ah, sounded like you were, think you meant "we are reservng x18" earlier, not "we are using x18 for tls"
<geist> yes and then i corrected myself but i wasn't clear enough
<zid> It's a magic trick, indirection
<geist> we are reserving x18 but it's for another use
<geist> tpidr is used for the TLS bits in both linux and fuchsia. i have no idea if linux does something with x18 in user space. (i think it doen't)
<geist> for kernel, well it's all different, but that's how kernel is.
<zid> Talk about tls then confirm you are reserving x18 also, but not why. Throw them off the scent but don't give the lawyers any ammo to go after you for perjury. Smart.
<geist> fine. anyway yes just to be clear that's what's going on
<zid> yea we know
<heat> the linux kernel seems to use x18 to get the current task
<zid> we moved on to jokes
<geist> the arm ABI says that x18 can be a temp or used by the OS. ie it has no other purpose
<heat> but the switch was fairly recent
<geist> and it being the last temp means it is the last to be allocated
<heat> I assume there's no penalty in mrs tpidr, reg
<geist> x19 is the first of the saved regs
<geist> also to be more confusing there's two of them in user space: tpidr_el0 and tpidrro_el0
<geist> we (fuchsia) dont use the latter in user or kernel, and i think linux uses the latter for kernel purposes, so same thing in user space
<heat> what's tpidrro used for?
<heat> i assume it's read-only from the title
<geist> that's right
<geist> it's the same as tpidr except read only to EL0
<geist> but RW to EL1+
<heat> so how is that useful?
<geist> beats me
<geist> you could put something in it that you dont want used to fuck up, like, say the current cpu #
<geist> and then not context switch it
<heat> ah shit, I think you meant darwin?
<geist> or, i dunno, maybe a pointer to the current vdso or something
<geist> i dunno is that what darwin does with it?
<heat> "In Apple open source, TPIDRRO_EL0/TPIDRURO is used to save the CPU number,"
<geist> ah okay there you go
<bslsk05> ​opensource.apple.com: cswitch.s
Ram-Z has quit [Ping timeout: 252 seconds]
<geist> makes sense. you set it up once and then just leave it that way
<geist> i dunno, i dont look at darwin source
<geist> so yeah i guess that's a reason for it
<geist> i dont think we use it at all in fuchsia, though maybe the need would arise at some point
<geist> i think linux hard fixes it at 0 when in user space, and uses it somewhat like the SSCRATCH register in riscv, where it's used for temporary bits when the kernel is in kernel space
<geist> someone had mentioned it can hold an anchor when in EL1 so the cpu can detect a recursive stack overflow, but then set it to 0 when in user space
<geist> or somethig like that
<geist> but i honestly dunno off the top of my head
<geist> or use it as a temporary scratch space to move a local reg into when taking an EL1 -> El1 exception so you get a reg to do some stack work with
<geist> much like how you use the sscratch reg in riscv in general
<geist> only requirement there with tpidrro is you'd have to make sure you zero it out before switching to EL0 or you'd leak kernel info
<heat> yeah
<geist> i think holding the current cpu number for user space doesn't really get you anything IMO. in the case of rdtscp where it simultaneously returns the time stamp and the cpu number, that makes total sense,
<geist> you are guaranteed that both things happened on the same cpu
<geist> but in the case of having a separate, preemptable instruction to read the cpu # i dont think that'd be particularly helpful
<heat> why? you skip a syscall or something
<geist> sure but the current cpu # is only really meaningful if it corresponds to something like a time stamp
<geist> which you can't gather atomically
<geist> otherwise it's just a piece of info that probably isn't worth burning a whole register on
<mrvn> and the timestamp is meaningless if the cpu number change since the last time you checked
<heat> struct timespec ts; clock_gettime(CLOCK_MONOTONIC, &ts); unsigned long cpunr; __asm__ __volatile__("mrs tpidrro_el0, %0" :: "=r"(cpunr)); printf("Event X on time %blah, cpu %lu", &ts, cpu_nr);
<heat> you don't need the timestamp at the same time because time isn't supposed to shift
<geist> yah but the os can preempt between the clock and the tpidr
<mrvn> heat: but that's something different than the rdtscp
<heat> i know
<heat> but you get the same thing, the time + cpunr
<mrvn> if the time you read is global then the core you get it on is irrelevant
<heat> yes
Vercas6 has quit [Write error: Connection reset by peer]
gxt has quit [Write error: Connection reset by peer]
opal has quit [Write error: Connection reset by peer]
<geist> like, it's nice to have the current cpu number, but it's not really that helpful if you didn't get it atomically with the time stamp
gxt has joined #osdev
<geist> and even the latter isn't that useful in a constant TSC world
opal has joined #osdev
<mrvn> geist: is the tsc constant even if a core was powered down a bit?
<heat> what's the cpu number useful for?
<geist> but since constant/invariant TSC wasn't always the case
Vercas6 has joined #osdev
<geist> which is bak when rdtscp was added
<mrvn> heat: to see if the core changed since the last measurement
<geist> a) assume the TSC doesn't tick the same on every core
<geist> b) then read the TSC + cpu number atomically, now you can use this to measure time between two points for benchmarking purposes
<geist> and if the two values dont have the same cpu # you can toss it
Ram-Z has joined #osdev
<mrvn> how does constant TSC work in bigLITTTLE world?
<geist> mrvn: there are cpuid bits that say yes or no on that
<geist> if it is constant + invariant TSC then they tick at the same rate, or those cpuid bits are lying
<heat> mrvn, the TSC ticks based on the FSB
<heat> FSB frequency, I mean
<geist> well, not precisely. it's complicated
Iris_Persephone has joined #osdev
<Iris_Persephone> hia, don't mind me
<geist> but basically a modern x86 says 'i have a constant & invariant TSC which ticks at <rate to be determined via a number of ways>'
<Iris_Persephone> just lurking a little
<mrvn> anyway the TSC used to be different per core so any change in core # makes measuring differences meaningless
<geist> but if that's the case, then it *usually* ticks are some fairly high rate similar to base cpu frequency
<geist> but correct, yes. if the TSC doesn't tick at the same rate then the core # is important
<geist> mrvn: re: big.LITTLE on x86, specifically alder lake i have personally confirmed that TSC ticks a tthe same rate on all cores
<geist> which makes sense, or it'd totally fuck everything up
<geist> for ARM the big.LITTLe stuff and the tick rate of the time stamp counter is *always* global and constant
<geist> it's simply defined as such
<geist> re TSC and apic tick rate, i was just a few days ago fixing a couple bugs i fuchsia here, so it's fresh on my mind
<heat> I was $today years old when I found out MSR_PLATFORM_INFO is a thing and has the frequency
<geist> exactly
<geist> i think section 19.7.3 or something. have a CL up on fuchsia for review to tighten that up
<geist> here's the problem (and the bug i'm trying to fix): all of that is fine and dandy until you're in a hypervisor, in which case all of that fixed frequency nonsense is completely gone. so the fix is to if in the presence of hypervisor, fall back to calibration for the apic and tsc *unless* cpuid 15h/16h is present
<geist> which tells you exactly what the freq is
<geist> anything post about ice lake just fills that in on intel, and AMD still doesn't
<JerOfPanic> hi
<geist> the problem with MSR_PLATFORM_INFO is it's very specific to what gen intel cor eyou have, so you already need a bunch of code to detect precisely which core you're on, etc
<geist> it's all an annoying mess
<geist> ARM just straight up has a register that firmware is supposed to fill in that tells you the tick rate. thank you arm
<heat> geist, what's the problem with hypervisors?
<bslsk05> ​fuchsia-review.googlesource.com <no title>
<geist> their APIC tick rate may be something else
<geist> (which is precisely the case here)
<geist> so the falling back to assuming it's 24 or 25Mhz or so, according to the intel manual, is invalid
<heat> so you can't derive the APIC tick rate from the FSB frequency you measure?
<geist> you have to measure it
<heat> yeah
<geist> vs assuming it's 24/25/100/etc mhz, like the manual says
<heat> AHA yes
<heat> this is the fuckery I was thinking of
<geist> so it's about short circuiting the code we have there that says 'ah i know this is a sandy bridge/skylake/etc, and cpuid 15h doesn't say what the FSB is but i know it's X'
<heat> KVM hardcodes the FSB frequency
<heat> as in, the APIC frequency
<geist> the TL;DR is the code we have here was assuming the APIC was ticking at 25mhz, when it was actually ticking at 1000mhz because KVM
<heat> which is why
<geist> so we were firing timers at 40x rate
<heat> gEfiMdePkgTokenSpaceGuid.PcdFSBClock|1000000000 per OVMF
<heat> hardcoded
<heat> OVMF doesn't even attempt to calibrate anything
<geist> so i short circuited it so that if we weren't explicitly told in cpuid 15h, return 0 here, which causes it to calibrate elsewhere
<geist> (and we're in a hypervisor)
* JerOfPanic is 63. day 0 smokes on smoking cessation program quiting - on British American Tobacco's Zonnic nicotine replacement product
<JerOfPanic> ;-P
<heat> i'm fine with measuring the APIC timer tbh
<JerOfPanic> two months, never did this before since I began smoking in China on 2009
<heat> I'd just really like a way to get the TSC frequency directly, everywhere
<geist> yah there was a bit of discussin on chat as to thwether or not it's worth even bothering returning a hard coded apic value
<geist> you can argue that unless cpuid 15h tells you precsely what it is, just calibrate it
<geist> i may just do that in a later CL
<mrvn> with kvm who knows how much cpu time the guest runs anyway
<JerOfPanic> I can code again :-O
<geist> cpuid 15h of course tells you what the TSC is ticking at, if present
<heat> with a precise enough TSC you will get precise enough measurements of the APIC frequency
<geist> *or* the KVM pv_clock stuff
<heat> yea
<geist> really the interesting thing is i fyou're using the apic TSC deadline mode, which more modern stuff uses, you dont even need to know the apic tick rate
<heat> IIRC there's a good bit of errata on TSC deadline
<vin> How important is it for system developers to learn rust? I see a lot of new projects being started in it.
<geist> for fuchsia we dont even bother reading/calibrating the rate if using apic non deadline
<vin> My only concern is rust won't be able to take years of optimization C/C++ went through, so for performace code will it really be faster
Matt|home has quit [Quit: Leaving]
<geist> my experience is tha tmost folks i know that use rust or write large systems code, performance isn't the primary concern
<geist> it's a concern, but you choose rust because you want safe/etc code first and foremost
<geist> and the fact that it's also pretty quick is a plus
<heat> "When the local APIC (Advanced Programmable Interrupt Controller) timer is
<heat> configured for TSC-Deadline mode, a timer interrupt may be generated much earlier
<heat> than expected or much later than expected"
<heat> when messing around with IA32_TSC_ADJUST
<geist> yah much earlier on occasion is fine, much later is a problem
<vin> geist: I am trying to figure it if it is for me. I mostly do independent research and write protoypes rather than production ready code.
<geist> vin: yeah honestly i can't tell ya. i'm gonna have to learn it more soon too, i'm sure. i have dabbled in rust enough that i can kinda read it
<geist> but dont really know how to speak it, per se
<vin> I guess I will take the plunge and try it for a project.
<Iris_Persephone> I read the wiki, saw all the warnings about how complex this was, but... it is just now starting to hit me
<heat> erm
<heat> we're talking about complex, late game shit
<heat> as a beginner nothing of what we're talking about is relevant
<heat> there's layers to osdev
<jimbzy> True story, heat.
<heat> geist, oh yes something I forgot to ask: why was the isb needed after the write to tpidr_el1?
<geist> possible it isn't
<bslsk05> ​stackoverflow.com: kernel - What is the purpose of Thread ID registers like TPIDR_EL0/TPIDR_EL1 in ARM? - Stack Overflow
<heat> but now it's not
<heat> i'm struggling to understand where I need barriers
<geist> i think it was just an abundance of caution. ie, isb after writing to MSRs unless it can be proven you dont need to
<geist> oh there's a bunch of verbiage in the ARM manual about precisely this topic
<geist> i think there's even a whole section about which ones you do and dont need to
<Iris_Persephone> At this moment I am just trying to learn everything I can - I didn't expect to be at the point where I could actually code anything for _years_
<heat> I've noticed you need to isb and dsb immediately after writing to page tables because the ARM cpu can speculate like crazy
<geist> correct
<Iris_Persephone> So you guys hopefully won't mind if I stay for a little, just try to pick things up?
<heat> Iris_Persephone, just start
<bslsk05> ​wiki.osdev.org: Bare Bones - OSDev Wiki
<geist> yah you'll learn a lot more by starting and stumbling a bit
<geist> otherwise you'll just get overwhelmed
<jimbzy> I love those bare bones projects.
<jimbzy> I learned a lot by working through them and breaking them.
vdamewood has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]
smeso has quit [Quit: smeso]
gxt has quit [Ping timeout: 258 seconds]
gxt has joined #osdev
smeso has joined #osdev
<geist> also that's what we're here for: to provide help for those that are breaking things
<Iris_Persephone> I was half-expecting you guys to tear me to pieces for being uninformed
<jimbzy> Hah.
<jimbzy> How long have you been trying to help me fix the things I break, geist? 20 years? XD
<geist> Iris_Persephone: oh gosh no, everyone starts somewhere
<geist> jimbzy: heh, feels like it
<geist> though unless you've been on as another nick you've probably only been here maybe 5 or 6 years? i forget
<jimbzy> I honestly don't remember my old nickname.
<geist> or (probably) much longer and my concept of time is off
<geist> especially the last few years
<jimbzy> You were working at Danger if that narrows it down.
<geist> okay, yeah that was 20 years
<geist> you musta been on another nick back then
<jimbzy> Yeah.
<Iris_Persephone> The wiki says that beginner-friendly Linux distros like Mint are not recommended; any specific reason why this is?
<jimbzy> You or air told me to get the dinosaur book at order hard-copies of the Intel manuals.
<geist> also fun fact: the first Danger Hiptop was released 20 years ago on october 1st, so 20th anniversay will be next weekened
<jimbzy> That's crazy.
<heat> Iris_Persephone, that's bs
<geist> Iris_Persephone: dunno. I use mint linux myself
<geist> and yeah i'd generally regard that as bs. indeed
<heat> where does that say?
<Iris_Persephone> Getting Started, subheading "Choosing your development environment"
<jimbzy> Who has the time to roll a custom gentoo distro or build slackware?
<geist> a statement like that with no real backed up reason shouldn't be on there, and i cant think of a good rationale for it
<geist> but all that aside, to be clear this channel and the wiki aren't strictly speaking connected. just defacto
<Iris_Persephone> Ah, the page I got this link from just said "partnered with" so I wasn't sure of the actual connection
<heat> Best distros for kernel development are (but keep in mind this is also a matter of personal taste, so these distros are not required rather suggested, and they usually require some experience): Arch, Gentoo, Solus, Slackware, void etc. even Puppy.
<heat> lmao
<geist> heh yeah. and i take a bit of umbrage to the notion that mint linux is not a general purpose distro
<geist> shows that whoever wrote it really hadn't used it directly
<geist> i get the idea that dont use a linux distro that's trying to hide the linux side of things (ie no development tools, no command line), but i think those are more of an exception than the norm
<jimbzy> heat, It makes OS dev a lot easier because you're spending all your time manually configuring your OS and build environment. :P
<jimbzy> You can't freak out over your build script if you can't get the buildtools to build. Think about it.
<geist> anyway this whole getting started page reads a lot like a bunch of blabbing about whatever someone's experience was
<geist> instead of a bulleted list of things to set up or whatnot
<heat> fixed that linux distros thing
<geist> boom.
<heat> that whole paragraph was bullshit inc.
<jimbzy> The system works.
<heat> you're not a real kernel hacker unless you use gentoo btw
<heat> or linux from scratch
<Iris_Persephone> I was actually just in the middle of doing LFS
<jimbzy> You call that "operating"? I use a bank of 128 toggle switches.
<jimbzy> Unlabeled.
<Iris_Persephone> I think I borked my Mint install, though, after my power got cut
<kazinsal> is it really a computer if you don't have to toggle in your bootloader at poweron?
<jimbzy> I know, right?
<geist> it's just an appliance if it boots itself
<heat> just an appliance? I'll let you know I flashed coreboot on my fridge
divine has quit [Ping timeout: 265 seconds]
<Iris_Persephone> I think I'll start Bare Bones once I get my LFS in a semblance of normality, thanks guys!
<heat> use linux mint
<heat> "If you are unsure, try Ubuntu, Fedora or Linux Mint. " <-- just to spite whoever wrote that before
<Iris_Persephone> I want to finish one project before I start another :p
<jimbzy> Ubuntu works well enough for me.
<jimbzy> Heck, just using Debian would be an improvement over the ones listed before.
<heat> oh wait
<heat> it was bzt
<heat> lmao
<heat> dude will track me down and beat me up
<heat> jimbzy, using debian is never an improvement
<Iris_Persephone> Oh!
<Iris_Persephone> My Mint _isn't_ borked
[itchyjunk] has quit [Read error: Connection reset by peer]
divine has joined #osdev
divine has quit [Read error: Connection reset by peer]
divine has joined #osdev
divine has quit [Client Quit]
divine has joined #osdev
heat has quit [Ping timeout: 250 seconds]
Iris_Persephone has quit [Ping timeout: 264 seconds]
Iris_Persephone has joined #osdev
vdamewood has joined #osdev
vinleod has joined #osdev
vdamewood is now known as Guest8017
Guest8017 has quit [Killed (calcium.libera.chat (Nickname regained by services))]
vinleod is now known as vdamewood
SGautam has joined #osdev
Iris_Persephone has quit [Read error: Connection reset by peer]
Iris_Persephone has joined #osdev
GeDaMo has joined #osdev
Vercas6 has quit [Quit: Ping timeout (120 seconds)]
Iris_Persephone has quit [Ping timeout: 252 seconds]
Iris_Persephone has joined #osdev
Persephone has joined #osdev
Iris_Persephone has quit [Ping timeout: 264 seconds]
Vercas6 has joined #osdev
Persephone has quit [Remote host closed the connection]
Persephone has joined #osdev
scoobydoo has quit [Ping timeout: 265 seconds]
scoobydoo has joined #osdev
Persephone has quit [Ping timeout: 244 seconds]
Persephone has joined #osdev
Persephone has quit [Remote host closed the connection]
Persephone has joined #osdev
opal has quit [Ping timeout: 258 seconds]
opal has joined #osdev
Persephone has quit [Remote host closed the connection]
Persephone has joined #osdev
Persephone has quit [Remote host closed the connection]
Persephone has joined #osdev
vdamewood has quit [Quit: Life beckons]
Persephone has quit [Remote host closed the connection]
Persephone has joined #osdev
opal has quit [Remote host closed the connection]
opal has joined #osdev
Persephone has quit [Remote host closed the connection]
Persephone has joined #osdev
bgs has joined #osdev
Persephone has quit [Ping timeout: 248 seconds]
Persephone has joined #osdev
Persephone has quit [Ping timeout: 264 seconds]
Persephone has joined #osdev
bgs has quit [Remote host closed the connection]
Persephone has quit [Ping timeout: 268 seconds]
Persephone has joined #osdev
Persephone has quit [Ping timeout: 264 seconds]
Persephone has joined #osdev
<mjg> real life unix question
<mrvn> real life unix answer
<mjg> is there a fcntl F_GETSIZE or whatever other name in any unix system?
<mrvn> see "CONFORMING TO"
<mjg> i know postgres is using funny games with lseek to find the size and hopefully avoid full blown stat in the process
<mjg> what?
<mrvn> SVr4, 4.3BSD, POSIX.1-2001. Only the operations F_DUPFD, F_GETFD,
<mrvn> F_SETFD, F_GETFL, F_SETFL, F_GETLK, F_SETLK, and F_SETLKW are specified
<mrvn> in POSIX.1-2001.
<mjg> i just came up with the flag
<mjg> and the name
<mjg> the q is if there is a functionality like that somewhere already
<mjg> if so i would reuse their name
<mrvn> I don't see such a flag mentioned at all. seek + tell is the way to get the size without a stat I think.
<mjg> wow darwin really went at it with adding F_ flags
<mjg> interestingly they have F_SETSIZE (!)
<mjg> bsd/sys/fcntl.h:#define F_SETSIZE 43 /* Truncate a file. Equivalent to calling truncate(2) */
<mjg> but no F_GETSIZE
<mjg> pretty weird on that front
<LittleFox> wat
<mrvn> They probably just implemented a bunch of syscalls that affects files in a single handler.
xenos1984 has quit [Read error: Connection reset by peer]
Persephone has quit [Ping timeout: 260 seconds]
Persephone has joined #osdev
xenos1984 has joined #osdev
isaacwoods has joined #osdev
Persephone has quit [Ping timeout: 244 seconds]
Persephone has joined #osdev
SGautam has quit [Quit: Connection closed for inactivity]
Persephone has quit [Ping timeout: 264 seconds]
Persephone has joined #osdev
Persephone has quit [Ping timeout: 244 seconds]
Persephone has joined #osdev
lkurusa has joined #osdev
CYKS has quit [Quit: Ping timeout (120 seconds)]
CYKS has joined #osdev
thatcher has quit [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
<geist> huh odd. yeah
<geist> i always thought fctnl was only for really different things but not something that would cross functionality
<geist> maybe truncate() came later?
<mjg> fcntl duplicates a lot of syscalls, i don't know what came first
<mjg> i mean respective F_ vs a syscall
<mjg> there is dup, advisory locking
Persephone has quit [Ping timeout: 264 seconds]
netbsduser has joined #osdev
<bslsk05> ​lpc.events <no title>
Persephone has joined #osdev
Persephone has quit [Ping timeout: 244 seconds]
ElementW has joined #osdev
Persephone has joined #osdev
<geist> hmm, i dont see the F_SETSIZE on darwin here
<geist> or at least macosx
<geist> and there it is
<geist> ah the OSX man page gives some context:
<geist> "F_SETSIZE Deprecated. In previous releases, this would allow a process with root privileges to truncate a file without zeroing space. For security reasons, this operation is no longer supported and will instead truncate the file in the same manner as truncate(2)."
<geist> so it was for file extend without zeroing. presumably at the time someone thought that was an optimization if the the process was just about to overwrite it completely
<geist> so actually it's silly but reasonable. there's usually reasons like this exist
Persephone has quit [Ping timeout: 268 seconds]
<mjg> that's part of the problem though
<mjg> it sure as hell can't be inferred from the name, can it
<mjg> they should have added flags to truncate, which woudl also make it extendable
Persephone has joined #osdev
Matt|home has joined #osdev
heat has joined #osdev
<heat> re: avoiding a full stat
<heat> really? how much slower is it?
<heat> at least on my system all the stat stats come from the in-memory inode itself, it's just as fast as a random F_GETSIZE
<heat> (you can also just use the traditional two-lseeks method to get it, but that's slow)
<zid> yea I'd figure stat was 'free'
<zid> i.e it ammortized into everything else as the inode came off the disk for you to read the file in the first place to seek it
wgrant has quit [Ping timeout: 244 seconds]
<geist> well that is why F_GETSIZE doesn't exist
<geist> that was part of the discussion: if F_SETSIZE exists why doesn't GETSIZE? well, the answer is F_SETSIZE was a weird special case that used to not do what truncate does
<geist> and thusit's only there for backwards compatibility
<geist> there was no reason for it to generically exist
<heat> most of fcntl is old garbo anyway
Persephone has quit [Ping timeout: 265 seconds]
<zid> aka osdev
<geist> i dont know. most of them seem to be for a specific purpose
<heat> F_DUPFD has dup, dup2, dup3
<geist> sure, but the darwin one at least has specical dup that *also* copies cloexec flags
<geist> not saying dup and fcntl shuoldn't overlap but most of the other ones seem to not have an overlap
<heat> linux also has that
<geist> most of this is a reason why if you have a generic syscall like that, add a options or flags field to it
<geist> we generally do that in fuchsia and it has helped immensenly
<heat> yea
<geist> ie, if starting over write it as dup(int infd, uint flags, out *outfd)
<heat> it's part of sane linux syscall API design right now
<geist> now you can implement all the dups and some ones you haven't thought about
<heat> int syscall(args..., uint32_t flags)
<geist> yah
<geist> there are a fair number of zircon syscalls now where 0 is the nly valid flags field, but that's for future expansion
<heat> they're also exclusively doing explicitly sized integers for every in-memory struct that goes to the kernel to avoid 32-bit compatibility crap
<heat> oh yeah that's a funny one
<heat> technically you can't return EINVAL for bad flags in open(2)
<geist> hmm, is that for a reason or just bad design?
<heat> bad design baby!
<geist> ooooh becayuse varargs
<geist> also protip: please dont varargs your syscalls
<heat> oh no varargs has nothing to do with that
<geist> ah
<heat> open("/bin/bash", O_RDONLY | O_RANDOM_FLAG_THAT_DOESNT_EXIST) doesn't fail
Persephone has joined #osdev
<geist> probably some historical reason, like source level compatibility between unices where some flags aren't there
<heat> yea
<geist> but that's a bad idea, probably born out of necessity than anything else
dude12312414 has joined #osdev
<heat> also a funny one: there's a random 10 year old linux kernel release where O_CLOEXEC wasn't respected
<heat> so musl, out of compatibility's sake, always does open() + fcntl(F_SETFL, O_CLOEXEC) for O_CLOEXEC opens
<geist> yay multithreaded
<heat> not only can you accidentally leak fds but you need to confirm your O_CLOEXEC choice with an extra syscall, yay!
<heat> and this is born out of stubbornness because the musl people could just, erm, look at the kernel version to figure this stuff out
<mjg> stat is far from free
<geist> so here's a question for ya: on arm64 and x86-64 when you're running a 32bit user space on top of a 48 bit address space (ie, a 64bit kernel) what happens when the process writes to address 0xffff.ffff+?
<mjg> that's copying over 128 bytes of data to grab 8
<geist> ie, wraps around
<heat> I would say "sanely, wraps around"
<mjg> in fact several years back linux was making some moves to make reading inode size dirt cheap specifically because it was needed a lot
<mjg> and a full blown stat was fucking terrible
<heat> but insanely page faults is also something I could assume
<zid> isn't that different between amd and intel
<geist> never thought about it much, but on x86 machine you do end up blowing at least 1 extra page table to run a 32bit process on a 64bit kernel
<geist> since you still have to set up a 64bit aspace but then only 0-4GB is ever accessed
<zid> or was that the 1M limit that's different
<mjg> also did i mnetion it is easy to implement size thing without taking the inode lock?
<heat> yeah sure
<heat> just read it
<heat> ez
<geist> well, depends on the memory barriers, but as long as the upader properly barriers, then it should be generally safe indeed
wgrant has joined #osdev
<geist> there's a few places in zircon where we read outside of a lock a thing that's updated in a lock, and there's usually some hand wrangling and extra commments to make sure thats cool in the gang
<mjg> to get some perspective, a fstat syscall (as in no path lookup, just straight to stat) is about 9 mln/s. getpid is over 20 mln
<heat> that's pretty fast
<geist> hmm, what is mln?
<heat> million
<mjg> milion
<geist> million what?
<mjg> ops/s?
<geist> i dunno.
<mjg> so a fcntl(fd, F_GETSIZE) would be at half the price of fstat(fd, &statbuf);
<geist> oh oh i see, you mean 'mln' == million
<mjg> and i'm not even talking about all the cases where the kernel grabs the size internally
<heat> so
<geist> got it, just never seen that particular shortening. i was thinking 'million lines' or something
<mjg> so ye, i do think F_GETSIZE is definitely worth it
<heat> why are you assuming fcntl(fd, F_GETSIZE) is as expensive as getpid()?
<mjg> it should be in the same ballpar
<mjg> k
<mjg> provided there is no inode locking
<mjg> what kind of perf do you expect
<geist> yah i'd assumed there'd be no real difference. if it can return the value on the left hand side it'd avoid a user copy in the syscall itself
<heat> sys_getpid() is just a task access
<geist> and that might be more substantial
<mjg> heat: well ye it is some more memory references
<mjg> so it will be slower than getpid, but definitely still way faster than fstat
<mjg> fstat has strictly more work to do
<geist> ie if you had some sort of `off_t get_file_size(int fd);` idealized syscall it would avoid any memory references, vs fstat
<mjg> and i mean a lot
<heat> it does have more work to do but it's trivial and cpus are fast
<mjg> which is exdtra crappy on contemporary boxes with SMAP
<mjg> ... since turning it off to do copyout is turbo expensive
<mjg> which happens to be avoided in this case
<heat> is it?
<geist> but not in a hypothetical fcntl call right?
<mjg> it is
<geist> because that can't return on the left side
<mjg> heat: for one it is serializing
<heat> oh yeah also, a problem: fcntl returns int
<mjg> got ya covered
<mjg> freebsd can return *two* ints in a syscall
xenos1984 has quit [Read error: Connection reset by peer]
<mjg> :)
<heat> bonkers
<geist> yes but then it's not fcntl
<geist> honestly a more interesting syscall that is missing is 'where is my file position'
<mjg> general point being, gathering full info for a stat call is way more work and often requires locking
<geist> since the canonical solution is to seek to the end and then ftell
<heat> geist, that's covered by lseek
<mjg> ye
<geist> does it?
<heat> lseek(fd, 0, SEEK_CUR) returns the current seek and adjusts it by 0
<geist> oh huh. yeah okay
<geist> i think i jnew that but forgot. got it
Persephone has quit [Ping timeout: 244 seconds]
<mjg> if anything i'm surprised by opposition to F_GETSIZE as a concept
<mjg> wanting *just* size is pretty standard before you mmap
<mjg> and paying for the entire stat buf to get there is definitely wasteful
<geist> i mean sure, but i dont think adding a new call just to optimize is necessarily worth it
<mjg> but now that i wrote it, a "just map the whole ffucking file" flag to mmap would be great
<geist> based on that same premise all of the rest of all of the fields in fstat should get a syscall, etc
<geist> and then isn't it intrinsically implementation defined as to whether or not it helps?
<mjg> if a field is very frequently asked for, while the rest is ignored, then yes
<mjg> so far that's only size
<geist> what about time stamp? that might be extremely helpful (say for git, etc) to read in a single syscall
<geist> without a full stat
<mjg> i don't know if that is of practical use
<mjg> if one was to survey real world uses
Persephone has joined #osdev
<geist> i dunno, millions of time stamps for build systems/s is pretty important
<mjg> it may turn out a quarter of actual stat buf is used inp ractice for 99% of consumers
<mjg> and it can be populated without locks
<mjg> that i would call a win worth pursuing for sure
<mjg> in the meantime, i know for a fact doing fstat just to get the szie is super common
<geist> yah. mostly just playing devils advocate
rpnx has joined #osdev
<geist> you're right, but i'd say a holistic view of the whole world is speed isn't always the most important thing
<geist> duplicated apis have cost, especially when they're around forever, etc etc
<mjg> see, so happens this functionality *cleans it up* in the kernel
<mjg> namely there are several caess in there which do vop_getattr just to get the size
<geist> not really, because there's now two completely different paths
<mjg> i'm deduping this shit in preparation to make it sensible
<geist> since you can't get rid of stat
<mjg> i'm saying the current code will end up calling vn_getsize and be shorter for it
* geist nods
<mjg> and there are several places
<mjg> but ye, at the bottom there will be a new routine
<mjg> which i don't consider ot be a problem
<geist> (which OS are you talking about here anyway?)
<heat> freebsd
<geist> i forget you're primarily freebsd right?
<mjg> freebsd, as usual :p
<geist> right, so there's also that. if one of the big kernels addds it you're effectively forcing the other ones to as well
<mjg> i have not checked on linux on that front in quite a while
<geist> so maybe it's a win for freebsd but not linux or the other way around
<mjg> ye part of why i asked previously if *any os* already has it, i know linux does not
<mjg> instead shitty games get played with lseek
<mjg> by people who try to avoid fstat
<heat> oh yeah i_size is funny
<heat> in linux, it's just a 64-bit read
<heat> no barriers no nothing
<heat> no locks either
<mjg> just atomic load
<heat> no
<mjg> quite a win if you ask me
<heat> just a load
<mjg> you mean READ_ONCE?
<bslsk05> ​elixir.bootlin.com: fs.h - include/linux/fs.h - Linux source code (v5.19.10) - Bootlin
<geist> depends on how it's written
<heat> return inode->i_size;
<heat> 32-bit does crazy shit with seqcounts to make sure things are updated atomically
<mjg> wut
<mjg> that's stupid
<mjg> i suspect they just did not clean it up yet
<geist> hang on, it also says below (in i_size_write()) that it needs locking
<geist> so that lines up with the general rules i'd expect (on 64bit at least)
<heat> it requires locking because of 32-bit
<geist> ie, the write has to either be atomic (witha builtin barrier) or under a lock, which will barrier when the lock is released
<mjg> they say the 32 bit needs locking to not fuck up seqlock
<geist> so that means the read on a weakly ordered cpu siumuiltaneously can read something stale up to the point at which the lock is released
<geist> which works
<mjg> i mean seqcount
<geist> as lomg as there's some sort of guaranteed barrier on the writing side
<mjg> for 64 bit barriers are of no significance, you only need to make sure to store the value in one write
<mjg> which they are not doing
<geist> oh they are hella significance on weakly ordered machines
<geist> very very much a thing, but any reasonable lock has an implied barrier on acquire and release
<mjg> man
Iris_Persephone has joined #osdev
<mjg> 32-bit -- agreed
<mjg> 64-bit -- no
<geist> yes
<geist> 100% yes on 64bit
<mjg> what exactly woud that barrier synchro against?
<mjg> it's literally a load
<geist> a barrier on the *write*
<mjg> but if the reader does not has any locks whatsoever
<geist> ie, you write it witout a barrier, it doesn't 'appear' to the reader until you do
<mjg> and only reads the size
<mjg> it is already 100% unsychnronized against writers
<mjg> no matter how many locks you plop in there
<mjg> weak cpu will *eventually* flush it, worst case when it unlocks the inode
<geist> yes, but i'm talking about whether or not yuo get a stale value *after* the write has occurred on another cpu
<geist> *thats what i'm saying*
<geist> we're violently agreeing
<mjg> i agree you can get a stale value
<mjg> but i'm saying that's not a problem here
* geist head desks
<mjg> you may have as well did the read just prior the write
<geist> i'm saying the exactly same thing
<mjg> so there is no use for barriers
<mjg> for this purpose
<geist> *because the lock has a barrier*
Persephone has quit [Ping timeout: 252 seconds]
<mjg> well let me restate. i_size_write routine in 64-bit variant does not need to post any fences
<geist> okay. i'm tired of violently agreeing
<mjg> which i understood as the point of cnention
<geist> yes.
<geist> no there's no contention, there never was
<mjg> ok, scratch it
<mjg> they defo need atomic store/loads there though
<geist> i'm saying 'yeah that's why this works' and then you're like 'no, it works because <same thing stated differently>'
<geist> anyway. fun times.
<mjg> so what about that mmap
<mjg> a flag to just map the whole file without providing explicit size
<mjg> i guess it runs into a problem if you want to munmap by hand later
xenos1984 has joined #osdev
<geist> right, i think that's te main issue
<mjg> would defo eliminate my usecase for F_GETSIZE
<geist> you dont get a notification of how big the mapping is
Iris_Persephone has quit [Ping timeout: 264 seconds]
<mjg> if mmap was returning a token of sort instead just an address this would be a non-issue
<mjg> damn you unix
<mjg> well token, handle, whatever
<geist> yah agreed, this is also an inconsistent api we have in zircon because of the need to try to be posixy
<geist> would have liked to have returned a handle to a new mapping, but since that's so incompatible with posixy apis it's too difficult to do
<mjg> perhaps a "whack the mapping starting at X" would be ok enough
<mjg> if usersapce was fucking around with remapping that's their problem
<mjg> it only is guaranteed to work if they did not
<geist> the real barrier to handle based mappings is the posixy ability to protect() or unmap() in the middle of an existing mapping
<mjg> hence the above comment
<geist> that at worst case takes one mapping and turns it into 3
<mjg> 17:16 < mjg> it only is guaranteed to work if they did not
<geist> yeah i know.
<mjg> if they fuck with it, the syscall has undefined behavior
<mjg> personally i would unlink the binary
<mjg> :S
<geist> anyway we had to in zircon follow the posix model of making mappings be mostly a hidden object in the kernel, and let operations be range based
<geist> for better or worse
<geist> at best it's inconsistent with the rest of the handle based model that zircon has
<mjg> ye the more i look at optimizing the more i just don't like unix
Iris_Persephone has joined #osdev
<mjg> you would think all the slow hw they had would make for slim to the point interfaces with great room for optimization
<mjg> but it's the opposite
<mjg> they slam syscalls like they are free
<geist> indeed. a holistic view of these sort of things is to try to get userspace to operate smartly rather than just optimize what they do
<geist> though both are the case, sometimes stepping back and trying to point them in a better direction is the right solution
<geist> OTOH, in the posixy world i think that particular ship sailed, for at least most of the meaningful direction changes yuo could make
<mjg> well see the mmap flag idea
<geist> or at least things like ioring or whatnot are probably the only real meaningful style major changes you could do
<mjg> instant bummer
<geist> yah, well it's all bummers if you look at it that way :)
<mjg> i'm polish, so there is not much choice
<geist> for every design decisions there are always downsides. that's a downside of the 'mmap doesn't return info about the mapping' choise that someone made like 30 years ago
<mjg> there is serious insanity in practice concerning credential management in the kernel
<geist> or if there was a sane api for 'give me info about the thing mapped <here>'
<mjg> there are workloads which keep spawning processes and setgid, setuid and so on
<mjg> all systems i know of handles this by allocating new creds from scratch
<geist> vs walking through /proc/self/smaps or whatnot like linux fols robably would say to do
<mjg> you end up with a fuckton of allocs for common idioms
<mjg> hmmmm
<mjg> how about an extra flag: unmap when the is closed
<mjg> the fd
<clever> ooop, smaps, thats a new one!
<clever> ooo*
<mjg> oh heh i did not know about /proc/self/smaps_rollup
<clever> yet its been there since 3.8 at least, how have i not seen it before?
<heat> aw fuck
<clever> ah, that one is absent on 3.8
<geist> also /proc/self/maps, which is a more terse version
<heat> i was late with /proc/self/maps
<heat> it was going to be so funny
<clever> ive known of maps for ages
<mjg> oh environ
<mjg> this reminds me of this linux "philosophy"
<mjg> you know the trick to "setproctitle" where you move your args and env out of the way
<heat> unix philosophy everything is a file baby
<geist> yah slamming everything in /proc is at once really frustrating, and actually very nice for general user space hackery
<mjg> and then plop your custom stuff in there
<mjg> one of my fumdanetal problems with linux
<geist> i do love how you can load up linux on a machine and then with the shell basically get all of the info you'd ever want out of it
<mjg> someone comes up with a stupid hack, people copy paste it and that's way to go now
<clever> heat: bash takes that one step further, you can redirect from /dev/tcp/ip/port
<geist> i dont know if it's a *good* idea, but it's pretty neat nonetheless
<clever> i was confused when i first saw that, because ls claims it doesnt exist
<heat> i know
<heat> it's bash bs
<geist> what sucks is accessing stuff programatically. i like the BSD sysctl style api for certain things
<geist> where you just want a number
<mjg> i find the free form text files coming out of the kernel to be a problem
<mjg> ultimatey you want this in json or some other serialisation non sensitive to whitespace
<geist> excactly. it's nice for shell users, bad for programs
<heat> and thats why bsd isn't true unix
<mjg> and preferably not want the kernel to do it
<clever> there is also a trick i found for environ, xargs -0 -n1 echo < /proc/15605/environ
<clever> xargs takes \0 seperated strings, and passes them to echo, 1 at a time
<geist> yah xargs -0 is super powerful
<mjg> xargs -0 rox
<clever> best paired with `find -print0`
<mjg> but also note there is no guarantee this is representative of envirnoment used by the proc at hand
<heat> yea
<mjg> if glibc moved it around at best you see what it started with
<heat> hrm
<heat> does it look at the user stack?
<heat> I assumed it just saved it at exec time
<mjg> it is saved at exec time
<mjg> and then it just blindly derefs the area
<geist> i assume it's saved at exec time. thsi is what was initially passed it via whatever mechanism linux does
<mjg> if env moved, tough
<geist> possibly the user process literalyl reads its environment via this file?
<mjg> which is part of my issue with all this shit
<geist> (probably not for speed purposes)
<clever> another weird bit, is how a process can modify its own argv, to censor out passwords in `ps aux`
<mjg> i'm unaware of any real world code doing something liike
<clever> that implies /proc is peeking into the stack of another proc
<mjg> clever: except that's too late
<clever> yeah
<mjg> old unix security ideas
<clever> ive also seen that used less for security
<clever> worker threads reporting their status and client
<mjg> postgres is doing it
<mjg> this is a legit usecase
<clever> yep
<mjg> it was made sensible on freebsd: there was a func named setproctitle, inititally i did not perform for shit
<mjg> always making a syscall
<clever> postgres 1346525 0.0 0.1 294808 18192 ? Ss Sep21 0:02 postgres: hydra hydra [local] idle
<mjg> then it was patched to "i'm going to store my title at this location"
<geist> huh /proc/self/stack doesn't have what i thought it would
<geist> unclear precisely what it holds
<mjg> it's the kernel stack
<mjg> check something blocked there
<geist> hmm, except i just did a hexdump -C on it and it was just a bunch of strings
<mjg> oh that's what you mean
<mjg> it's the backtrace
<mjg> from the kernel
<clever> yeah, the kernel walked its own stack, then did symbol lookups on it
<geist> oh oh
<geist> oh heh yeah. ksys_read... etc
<clever> [<0>] proc_pid_stack+0xa7/0x120
Iris_Persephone has quit [Ping timeout: 265 seconds]
<clever> but the offset within the function is present
<clever> it seems to also be censoring the actual addresses out, for kaslr reasons?
<geist> i figured it was like /proc/self/mem and held some sort of map of the user stack for the thread
<clever> so you can then convert that to a line#, if you had debug info
<clever> another one ive used is /proc/self/pagemap
<mjg> geist: right, they really should prefix everything u or k
<clever> on a 64bit system, its a uint64_t[] array
<clever> one slot for every page of virtual memory
<mjg> but another ship which has sailed
<clever> essentially giving you read-only access to the leaf nodes in the paging table
<clever> but much to my anoyance, if you mmap MMIO, it shows up as 0 in that table
<clever> i spent months trying to figure out why mmio didnt work in one situation, only to discover it was the hardware
<heat> that's probably because mmio mappings don't have vm state in linux
<heat> they're just straight up mapped in the page tables
<mjg> clever: :)
<clever> in my case, the AXI bus has a "user" vs "kernel" bit on every memory transfer
<clever> and the hardware was configured to just deny MMIO from anything userland
<clever> so no amount of software debugging could fix it, lol
<clever> geist is the one that ultimately revealed that axi is even capable of doing that
<mjg> you could have ran into a bug which flipped it tho
<clever> mjg: it was a bloody typo, in the official headers, for the enum that configures the hw
<mjg> i knew who a guy writing bios for some board
<mjg> they kept running into weird crashes and they knew it's their fault
<mjg> but they blamed it on ram manufacturer
<mjg> and got away with it
<heat> damn right
<bslsk05> ​github.com: rpi-open-firmware/arm_control.h at master · librerpi/rpi-open-firmware · GitHub
<clever> mjg: code A here, blocks access from userland
Iris_Persephone has joined #osdev
<mjg> wait, so that *remains* broken?
<clever> i didnt fix the .h file at the time
<mjg> you are linking top of the tree
<heat> mjg, it doesn't seem to blindly deref the stack
<bslsk05> ​github.com: rpi-open-firmware/BCM2708ArmControl.cc at master · librerpi/rpi-open-firmware · GitHub
<clever> mjg: i instead fixed it by |'ing in the mislabeled flag
<clever> and then never got around to confirming what every bit does
<mjg> heat: environ support? last time i was patching that code i'm confident it was just accessing the pre-saved area
<clever> the code now works, but the header is still wrong
<mjg> heat: i mean it makes sure to not crash and wahtnot
<heat> yeah
<mjg> heat: but it does not account the env moving around, should it happen
<heat> right
<heat> but it Just Works(tm)
<heat> sometimes
<mjg> works for the default state of not changing env vars
<mjg> but good luck debugging when someone did
<mjg> and it does not show up there
<mjg> ... and you are confident it would
<heat> also someone on the interwebs suggested this: strings /proc/self/environ
<mjg> [no i did not happen to me :>]
<heat> instead of your shitty xargs -0 ... workarounds
<mjg> it's not mine bro
<heat> well, someone's
<heat> oh shit
<clever> heat: but strings has some size limit, and may not detect A=1
<mjg> you just don't like xargs -0
<heat> I just re-found out about /proc/self/wchan
<heat> clever, strings -n 2
<mjg> you reminded me of a long standing problem, wonder if they fixed it
<mjg> you mmap something backed by nfs
<mjg> nfs server dies
<mjg> you take a fault on that mapping
<mjg> tools like ps and top will start hanging in an uninterruptible state trying to grab your mmap semaphore
<heat> lol
<mjg> as in they don't show squat and you can't kill them
<clever> mjg: ive killed my server before, just by coming home, lol
<clever> cacti was running `df -h` in a cronjob, to graph my free space
<clever> the laptop was an nfs server, mounted into that box
<mjg> so a typical sysadmin is 100% fucked here
<clever> when i take the laptop on a trip, `df -h` just hangs, and 1000's of them pile up, and slip into a swap coma
<clever> when the laptop returns home, every bloody `df -h` wakes up at once
<clever> and they all fight over the ram
<clever> system grinds to a total halt :P
<mjg> 17:49 < clever> the laptop was an nfs server, mounted into that box
<mjg> what in thea ctual fuck man
<mjg> are you a webdeloper in your dayjob
<clever> mjg: the solution to all of those problems, use the soft flag when you mount nfs
<clever> if you use the hard mount flag, then nfs errors block forever, until the server returns
<clever> but if you use soft, network errors result in the syscall giving an error
<clever> and things become killable
<heat> that's not true
<clever> has the man page lied again?
<heat> nfs waits are always killable afaik
<mjg> which waits
<heat> locks, and waits in wait_queues
<mjg> it was not all of them when i had to deal with this shit 5 years ago
<mjg> in fact it was quite routine to get a crashdump with a dead nfs server in dmesg
<heat> well, your example involves a lock acquired in a non-killable way
<mjg> and a bunch of unkillable processes fucking aroudn
<mjg> ye i'm saying there were cases where you would take a lock, go off cpu waiting for the nfs server
<mjg> and possibly other code in nfs would trip on it
<mjg> and then you are fucked
<mjg> it is plausbiel this is fixed now
<heat> well sure, that's possible
<heat> and a harder problem
<heat> mmap_sem aren't mutex_lock_killable'd
<mjg> it was not just mmap sem
<heat> (well, the rwsem)
<mjg> albeit that was the main culprit -- monitoring would stop working :->
Iris_Persephone has quit [Ping timeout: 264 seconds]
Iris_Persephone has joined #osdev
<clever> getting a bit more back on topic, ive been working on an ext4 driver lately, and i think the only major feature i'm currently missing, is the ability to recursively follow the extent tree
<clever> but part of the problem in testing, is that i need to create a complex extent tree
frkzoid has joined #osdev
<clever> and due to it being extent based, its not based on filesize, but fragmentation
<clever> my first thought, is to just set a torrent client loose on it
<clever> reducing the number of blocks per group may also help, since the block group metadata is sitting between each group
<clever> so having really tiny groups will force a max size onto fragments
Iris_Persephone has quit [Ping timeout: 265 seconds]
Iris_Persephone has joined #osdev
xenos1984 has quit [Ping timeout: 260 seconds]
xenos1984 has joined #osdev
Iris_Persephone has quit [Ping timeout: 250 seconds]
kof123 has quit [Ping timeout: 268 seconds]
Iris_Persephone has joined #osdev
Iris_Persephone has quit [Ping timeout: 264 seconds]
Iris_Persephone has joined #osdev
<geist> somewhere i wrote an app years ago that generates fragmented files
<geist> basically creates a crap ton of files and then keeps resizing them such that their blocks end up probably overlapping a lot
<geist> thoughyou need to get pretty close to full disk utilitzation for it to work
<clever> i can see how an FS would space the new fragments out nicely, so they dont collide while growing
<bslsk05> ​pastebin.com: #include <errno.h>#include <fcntl.h>#include <limits.h>#include <stdio.h> - Pastebin.com
<geist> i have no warranty on it, and i wrote it like 20 years ago, on BeOS
<geist> so might need some fiddling
pretty_dumm_guy has joined #osdev
<geist> and probably assumes files aren't sparse
<clever> yeah, sparse would only be fought off with writing actual data
<geist> or using whatever fallocate() style call linux has. i dunno precisely what it bottoms out in
<geist> could probably impleent this as a shell script too
<clever> it does at least compile on linux
<geist> anyway i remember it working pretty well
<geist> though like i said i think it only really works well if you size it such that the number of working files is pretty close to the full size of the disk
<geist> otherwise if all of them are just playing with 5% of the space, the fs impl may nicely space them out
<clever> i can just make a 512mb or 128mb disk image
<geist> anyway, see how that works. `filefrag -v` tells you what it ended up with
<geist> can fiddle with it and see what tunables work for you
<clever> yep
Iris_Persephone has quit [Ping timeout: 265 seconds]
bauen1 has quit [Ping timeout: 252 seconds]
bauen1 has joined #osdev
Iris_Persephone has joined #osdev
Iris_Persephone has quit [Ping timeout: 265 seconds]
Iris_Persephone has joined #osdev
Vercas6 has quit [Quit: Ping timeout (120 seconds)]
Vercas6 has joined #osdev
cross has joined #osdev
<mjg> you know, for the biggest system in the world and so much money behind it
<mjg> linux still manages to surprise me by how bad things can get
<mjg> in this episode i tried: perf record --all-kernel --call-graph dwarf
<mjg> perf report is losing its shit though
<mjg> ubuntu 20
pretty_dumm_guy has quit [Quit: WeeChat 3.5]
<heat> geist, the "full disk utilization" thing is sadge :/
<heat> I would really love a tool to generate some stupidly fragmented files that doesn't take that
<heat> also other stupidly stupid conditions like a directory with 100k files
<heat> also downright broken filesystems but that's more on the realm of fuzzers so :|
<geist> well you could fallocate a big file that chews up say 90% of the disk, then run this to generate fragmented files in the space outside of it
<geist> it's totally intended to be for small test disk images though
<geist> and/or in an era when a 2GB disk was big
<heat> I'm fairly sure I've seen syzkaller craft a broken filesystem and mount it as a loop device
<heat> also in my filesystem wishlist: xfstests
<heat> i remember i found a .c that had some sort of simple filesystem unit tests for unixish systems
<heat> wonder what that was
frkzoid has quit [Ping timeout: 244 seconds]
<geist> yah
<heat> ah yes, fsx.c
freakazoid332 has joined #osdev
<bslsk05> ​github.com: fstools/fsx.c at master · apple/fstools · GitHub
<heat> they also an fstorture tool in there
<heat> I need those badly
vdamewood has joined #osdev
dude12312414 has quit [Remote host closed the connection]
dude12312414 has joined #osdev
gog has joined #osdev
<dh`> heat, you know about Impressions?
<dh`> it is a widget for generating sort-of-realistic fs images
<dh`> and while it has a bunch of issues, it is at least sort of usable
<heat> nope, never heard of it
<heat> what is "sort-of-realistic" here?
<dh`> it is a research artefact from some years back
<mjg> you reminded me of an AI which generates human faces
<bslsk05> ​thispersondoesnotexist.com: This Person Does Not Exist
<dh`> I'm not sure what the state of the code is at this point; I had some patches and I'm not sure if they ever got rolled in
<gog> hi
<heat> henlo
<dh`> sort-of-realistic means distributions of sizes and filename extensions and whatnot that are supposed to correspond to the real world
<zid> I just invented a filesystem
<zid> It's a B-tree. The end.
<zid> If your filename doesn't fit the invariants of a B-tree when you try to creat() it, it fails.
<heat> dh`, im not particularly interested in realistic workloads
<heat> I can just mkfs a filesystem from a sysroot and I can something "realistic" I think
<heat> s/workloads/images/
<dh`> well, whether or not you care about that, it is a tool for randomly populating images
<gog> stop inventing filesystems
<zid> It's the ultimate filesystem though
<gog> the null filesystem
<gog> aways fails successfully
<zid> All filesystems are a superset of that one right
<gog> basically
<zid> {}, {{},{}}, etc
<heat> do page caches usually optimize for file holes?
<geist> directories are a mistake
<heat> as in mapping a zero page instead of an actual page
<geist> let the mount points be your directories
<heat> or not mapping anything at all
<geist> yes
<heat> i don't have that
<heat> :|
<geist> the zero page optimization? it's eventually worth it, though i guess it's not a deal breaker up front
<geist> you can alloc a page and fill it with zeros on demand
<heat> i have the zero page optimization, I just don't use it for the page cache (so, for inodes)
<geist> ah. probably only matters at map time. the page cache itself is probably unaware of it
<geist> it'd be simply a hole in the file though you might want some sort of sentinel, depends on how your page cache works
<geist> in the case of zircon we simply leave that as a hole in the vmo, whic is the default state anyway (all holes, no pages)
<geist> and at map time if you read fault on it it just maps in the zero page instead of the nonexistant page (unless the pager source has one, etc etc)
<geist> if you write fault on it its either a fresh new zero page or some pager behavior or a failyre (RO mapping, etc)
<heat> hummm
<heat> right now my vmos don't have that behavior
<heat> all read pages are sourced from whatever is backing the vmo
<heat> write is even more confusing
<heat> also, do your vmos' sizes need to be page aligned?
<mjg> zero page?
scaleww has joined #osdev
<mjg> i highly doubt it is really worth it
<heat> I'm currently requiring that because if I think of a vmo as a bag of pages, it doesn't make sense to have one with size 104, but rather 4096
<mjg> afair the linux folk were not sure either, but did not have enough info way or theo ther to whack it
<heat> so linux does that?
<mjg> yea
<mjg> it leads to retarded discussions sometimes
<mjg> i think i ranted about it on this very channel few weeks back
<mjg> people who don't deal with the problem domain get very bad ideas concerning memory management, with ideas like "calloc is free bro cause zero_page"
<mjg> glibc has an optimization where first calloc returning given address is not touching it
<mjg> then they read from it and get the zero page
<heat> well, I'm talking specifically about non-anon file mappings here
<heat> I'm already using the zero page on anon mappings, for better or worse
<heat> i imagine you could save some decent memory by representing file holes with zero_pages and only giving them actual backing when written to
<mjg> fair, i don't know if that's good or bad
<mjg> does this look off?
<heat> yes
<heat> or no
<heat> one of those
<mjg> now that i said it, it does, but im gonna paste
vdamewood has quit [Read error: Connection reset by peer]
<dh`> I would think if you have zerofill pages at all that using them for nonexistent file pages wouldn't be hard
<dh`> and therefore probably worthwhile
vdamewood has joined #osdev
scaleww has quit [Quit: Leaving]
freakazoid332 has quit [Read error: Connection reset by peer]
<Iris_Persephone> So, another newbie question: Is it worth trying to implement POSIX as closely as possible, or is it something that depends on your goal for the system?
<heat> latter
GeDaMo has quit [Quit: Physics -> Chemistry -> Biology -> Intelligence -> ???]
<heat> posix will let you have a useful system much quicker
<heat> going non-posix gives you the freedom to do whatever you want
<heat> you can theoretically always implement posix as a compatibility layer but that's always iffy
SpikeHeron has quit [Quit: WeeChat 3.6]
radens has joined #osdev
<Iris_Persephone> Yeah, makes sense
<radens> does lk build with clang? If so, how do I tell it to use a clang toolchain and not gcc?
<heat> no
<Iris_Persephone> I was a little worried about standing out from all the other POSIX systems, but in hindsight that is a little silly
<heat> that's a hard question but afaik right now the answer is no
<bslsk05> ​github.com: [build] make LK buildable with LLVM/Clang by pcc · Pull Request #322 · littlekernel/lk · GitHub
<geist> yeah it's a tough problem to generically solve. that PR for example totally relies on a very specific toolchain for a specific arch
<radens> thanks heat
<geist> so i dont see a good way to take it upstream
<geist> i talked to pcc about it a bit but haven't heard much of an answer
<geist> it's a good example of 'works for that person for their use case' sort of PR i tend to get into LK
<geist> but i think there's probably a more low level way to do it that involves starting from a generic solution fundamentally in the build system, instead of just hacking it into the ARM side of the build system
<geist> anyway, bbiab
SpikeHeron has joined #osdev
<radens> It would be nice if it built with llvm. It's annoying to need another gcc toolchain for each arch, when I have a perfectly good llvm toolchain which should do it all.
<mrvn> isn't there some wraper to make clang accept most gcc options?
<heat> yes
<heat> it's called clang
<mrvn> heat: that's not enough for lk
frkzoid has joined #osdev
<heat> it absolutely is
<heat> the lk PR I linked just makes a -Wno- option conditional on gcc (because LLVM doesn't have it I assume)
<mrvn> see
<heat> if you test if options exist a-la traditional kconfig or autoconf, you'll have no issues using clang or gcc
<mrvn> There are probably a buch more of those for other archs.
<heat> doing CC=clang ./configure && make CC=clang Just Works(tm)
<mrvn> I rather trust geist there that it's a "works for me" solution.
<j`ey> which is fair enough for PRs I guess
<heat> it's like you choose to ignore what I say
<heat> what geist said and what I'm saying aren't mutually exclusive
frkzoid has quit [Ping timeout: 244 seconds]
xenos1984 has quit [Ping timeout: 250 seconds]
xenos1984 has joined #osdev
<geist> well what i really mean is that PR doesn't work on anything but a specific LLVM build he has
<geist> i looked at it, and it wont build with any of my llvms
<geist> talked to him in chat and he has some custom llvm for android i think
<geist> the difference being the presence of or the lack of particular built in headers
<geist> it gets into the 'is this a generic clang build or is this one thats inended to be for linux' sort of problem all over again
<geist> yes, clang can target any triple, but the existing headrs may be specific to a particular one, etc
<heat> yup
<geist> so it's not just a drop in the bucket. the change i was thinking was the split 'generic code changes made in the codebase that get it to build with clang' and 'the build system stuff to switch it' which i think requires a bit deeper cut
<geist> and then what i had seen was this PR was totally incompatible with your 'link with the compiler, not the linker' PR for Reasons i forget
<heat> did you ever look at it again?
<mrvn> geist: and there I was expecting that freestanding is a standard that you could rely on, silly me.
<geist> heat: i have not
<geist> i took it as 'huh yeah i should also look at this too' but have not done so
<heat> mrvn, how do you generate and install headers for thousands of different target triplet combos?
<mrvn> heat: that's what /usr/include/triplet/ is for.
<heat> you do realize that clang would have to generate those right? for every combination it supports
<heat> which in the default case, is a boatload of them
<mrvn> it does generate them for every combination it supports. Just nor all at the same time.
<heat> it does not
<mrvn> If you want to build a compiler for everything then yes, you have to generate everything.
<mrvn> heat: whatever triplet you build clang for it will generate at least those headers.
<heat> yes
<mrvn> so it generates all of them, just not at the same time.
<heat> no, it generates a handful of them
<geist> anyway i think it's not too bad, but part of what i need is a local llvm build so i can CI this
<geist> i dont want it going in the tree if i cant build it locally
<mrvn> you do see the "not at the same time" part, right?
<geist> and i was stuck at 'can't build it locally' because of reasons
<geist> and that's where i dropped it and hadn't picked it up again
<heat> add llvm CI?
<geist> hmm?
<heat> add llvm to your ci
<geist> possibly. note this all takes time to implement
<geist> of which i did not spend.
<geist> ie, the PR is not just a freebie, which is why i didn't accept it
<geist> because it's a new can of worms i'd like to solve more generally for LK
<heat> oh wow I was wrong
<heat> clang installs a single set of headers
<heat> and they're supposedly portable
<geist> yes and thats possibly the answer, but i have to sort it out because it didn't just work with the llvm i threw at it
<geist> and i know very little about it, so it's a learning curve for me. ie, i want to know what i'm checking in
<geist> and until i can at least locally test it i dont want to take it into the tree
<heat> how do I check out a PR?
<geist> it's a branch
<geist> usually i just fetch it and then either merge it local or rebase it ontop of yours
<geist> i thik what i had decided is the build system actally needs 4 separate modes: gcc + ld as linker, gcc + gcc as linker, llvm + ld.bfd as linker, llvm + ld.llvm as linker'
<geist> and each are subtly different in at least a way the budli system needs to understand
<geist> not an insurmountable problem but one that takes a solid day or two of work to map out
<mrvn> what about gold?
* geist shrugs. maybe?
<geist> question is if it's useful enough to warrant support for (or if it needs any particular support)
<geist> also note that linkers other than binutils tend to trip over things in my linker scripts
<geist> partially my fault, partially theirs, so i'd be much more inclined to do a simple 'clang the compiler, nothing else' support for phase 1
<geist> ie, clang + binutils
<geist> which is why i was asking pcc to split it into separate CLs
Iris_Persephone has quit [Ping timeout: 264 seconds]
vdamewood has quit [Read error: Connection reset by peer]
vdamewood has joined #osdev
Iris_Persephone has joined #osdev
Iris_Persephone has quit [Ping timeout: 264 seconds]
<heat> i need a unix guru rn
Iris_Persephone has joined #osdev
<heat> let's imagine a dangling symlink named 'a'
<heat> why should faccess('a', ...) be valid if its dangling and I didn't tell it to not follow symlinks?
<heat> and opens seem to work
<heat> does path resolution just open the symlink if its dangling?
<heat> actually, hrm
<heat> I think I was misinterpreting the strace
<mjg> that should fail with ENOENT
<heat> yes
<heat> i was looking at it wrong
Iris_Persephone has quit [Ping timeout: 244 seconds]
Iris_Persephone has joined #osdev
[itchyjunk] has joined #osdev
Vercas6 has quit [Quit: Ping timeout (120 seconds)]
Iris_Persephone has quit [Remote host closed the connection]
Iris_Persephone has joined #osdev
Vercas6 has joined #osdev
isaacwoods has quit [Quit: WeeChat 3.6]
netbsduser has quit [Remote host closed the connection]
Matt|home has quit [Ping timeout: 248 seconds]