ChanServ changed the topic of #armlinux to: ARM kernel talk [Upstream kernel, find your vendor forums for questions about their kernels] | https://libera.irclog.whitequark.org/armlinux
Pali has quit [Ping timeout: 276 seconds]
<HdkR> arnd: Did a small test. Takes anywhere from 10-100ms to steal the VA range (depending on the range). Which means that small utility applications like ls, echo, cat, etc which take a fraction of that time to execve and quit is actually duplicating the time it takes to run or worse :D
<HdkR> and that's on super fast M1 mind. Cortex will just be slower.
mraynal has quit [Remote host closed the connection]
mraynal has joined #armlinux
_whitelogger has joined #armlinux
Tokamak has quit [Read error: Connection reset by peer]
Tokamak has joined #armlinux
Tokamak_ has joined #armlinux
Tokamak has quit [Ping timeout: 256 seconds]
shailangsa has quit [Ping timeout: 252 seconds]
shailangsa_ has joined #armlinux
rexbcchen__ has joined #armlinux
rexbcchen_ has quit [Ping timeout: 256 seconds]
bps has joined #armlinux
bps has joined #armlinux
bps has quit [Changing host]
luispm has quit [Quit: Leaving]
frieder has joined #armlinux
<arnd> HdkR: let's try to come up with a syscall interface proposal for solving this properly then. How about an interface that limits the upper address for future mmap() calls done in this process, taking a 64-bit address as the argument?
sudeepholla has quit [Ping timeout: 250 seconds]
<arnd> there are a couple of details one has to decide for this: separate syscall vs prctl or something else; should the limit persist across an exec(); does this affect MAP_FIXED, or existing mappings; ...?
guillaume_g has joined #armlinux
<HdkR> arnd: Yea, it's a solution that I haven't fully checked on. Host side needs to be able to still do allocations outside of the restricted limits
<HdkR> I need to make a matrix of the four different configs and the problems that will be encountered
<HdkR> Userspace syscall dispatch might be able to resolve some of the problems, although that has other problems.
bps has quit [Ping timeout: 240 seconds]
<HdkR> Might come back to a set of range based memory allocation syscalls and also a prctl that limits the range when no range is provided.
nsaenz has joined #armlinux
djrscally has joined #armlinux
apritzel has joined #armlinux
cbeznea has joined #armlinux
alexels has joined #armlinux
bps has joined #armlinux
bps has quit [Changing host]
bps has joined #armlinux
luispm has joined #armlinux
alexels has quit [Quit: WeeChat 3.5]
sszy has joined #armlinux
monstr has joined #armlinux
alexels has joined #armlinux
<HdkR> arnd: Noodled on it a bit. Two prctls (get and set) and 3 new syscalls should cover everything. https://gist.github.com/Sonicadvance1/8c55565b2dbbbaef79bde800e40835d9
<HdkR> Unless I missed syscalls that can allocate memory anyway
Turingtoast has joined #armlinux
<arnd> HdkR: I don't see why we'd need both the prctl and a mmap_range(). In what scenario is prctl() followed by mmap() not enough?
<HdkR> ioctl
<HdkR> ioctl is always the trouble child
<HdkR> oh, mmap
<HdkR> 32-bit side of things, FEX can't use 32-bit VA space at all. So it is required to allocate outside of the restricted space.
<arnd> I thought both ioctl() and mmap() end up calling get_unmapped_area() in the kernel, so something that modifies the behavior of get_unmapped_area() to respect the range should work with any syscall that allocates a vma
<HdkR> aye, they do
<HdkR> But if we do something like prctl(SET_VA, {0, 1'0000'0000UL}); then FEX (host side allocation) does mmap(nullptr), that needs to stay out of the 4GB space.
<arnd> how common is it for FEX to do those allocations for the host side after the guest has started running?
<HdkR> Constantly
<HdkR> Need to allocate thread stacks, JIT executable memory, bunch of stuff
<arnd> could that use MAP_FIXED instead and manage its own address space if the non-fixed mappings are restricted to guest addresses?
<HdkR> It's a bit unwieldy since then I have to pessimistically allocate and spray the VA range on failure to allocate, which is basically what I'm doing at the start to grab all the VA range already
<HdkR> It's at least an improvement?
<HdkR> Pain can be reduced a bit by ensuring I always allocate a VMA region at the start that can hold `/proc/self/maps` and string parse to find a new hole
<HdkR> Still pretty nasty
<HdkR> Allocating memory in my allocator to know how to allocate is already a pain, where giving the kernel the choice of VA limit range of prctl or syscall provided one seems the most slick to me :)
headless has joined #armlinux
prabhakarlad has joined #armlinux
luispm has quit [Ping timeout: 252 seconds]
sudeepholla has joined #armlinux
<arnd> HdkR: I think a new mmap() syscall variant would likely hit resistance, in particular with the problem that adding another argument pushes it above the six-argument limit. I think your proposal leaves out the 'size' argument to stay below the limit, but I don't know how that works.
<arnd> adding an mmap flag is probably ok if we can find one that has a sensible meaning
sszy has quit [Ping timeout: 240 seconds]
<arnd> I suppose for your usecase, you would want a flag that flips between the 'below the limit' and 'above the limit' areas, not sure if that can be generalized to something that is useful for others as well
<HdkR> ah, just a typo if a missed an argument there
Turingtoast has quit [Quit: My iMac has gone to sleep. ZZZzzz…]
<HdkR> Would need to switch over to pointer with a struct then
<HdkR> As far as I'm aware, the MAP_ flag space already maxes out the 32-bit int argument, so that can't be abused anymore?
<arnd> the struct argument another an of worms, we are still dealing with the clone3() syscall using a struct this way. for loongarch64 I proposed using only clone() instead of clone(), but the chrome sandbox rejects that syscall because it lacks introspection
<HdkR> The prctl would at least remove the startup time problem. The range stuff could be tackled at a later date as a means to remove allocation overhead costs
luispm has joined #armlinux
<arnd> HdkR: with the mmap flags, would this be something we can encode in the upper bits of a 64-bit word, limiting the feature to 64-bit architectures, or do you see this being needed on 32-bit as well?
<arnd> I see that x86 has an MMAP_32BIT flag to force allocation in the lower 4GB, so adding that on arm64 would be trivial, but I suspect that is not sufficient for what you need
<HdkR> Right, MAP_32BIT is a bit of a pain which is something that a _range variant would help with. Since it's not actually limited to the lower 32-bit, it's limiting to...3GB? of the lower 32-bit? Been a while since I've looked at that
<HdkR> Is it safe to encode something in the upper bits of a 64-bit word for flags? I thought it would be unsafe to extend the interface from `int` to `long` since garbage could be in the upper bits.
<HdkR> But I also don't see this being used for 32-bit
<arnd> good question about the syscall calling conventions, I don't actually know since this may be architecture specific. I think usually the syscall ABI requires user space to pass zero-padding register values because the ELF psABI requires the same. OTOH the kernel generally does not trust user space to do this and clears the upper halves again
<HdkR> So then if you convert the argument from 32-bit to 64-bit, it no longer zero extends and could get garbage? :)
<arnd> I think this is mainly done because of potential malicous code passing invalid (according to the psABI) arguments, not because any real application would have those bits set
<arnd> right, that is exactly the risk
<HdkR> Smells unsafe. Would likely need to introduce a new syscall anyway at that point
<arnd> I'd have to check again. The zero-padding is also applied inconsistently. I have this feeling that a number of architectures still fail to do it for 32-bit arguments, and if that is the case, extending it is probably fine
<HdkR> Double checked, MAP_32BIT gives you 1GB of range between [0x4000'0000, 0x8000'0000)
<HdkR> Pretty painful flag
headless has quit [Quit: Konversation terminated!]
<arnd> that is completely weird. I see it as well now, but I have no idea why it got defined like this.
<HdkR> Some historical thing
<HdkR> Things rely on it, so I need to emulate it :|
<HdkR> I believe it gives you the maximum range for RIP relative addressing. Only has a int32_t of range or something
<HdkR> Which hey, a mmap_range could give applications that wherever in VA. Just saying ;)
<arnd> I see it was introduced in the early days of x86_64 to fix "the X server crashes."
<HdkR> Interesting
<HdkR> I know Source2 game engine and Mono's trampoline generator relies on it
<HdkR> Mono's JIT allocates everything in the lower 4GB of memory just for this purpose
<arnd> HdkR: so MONO just uses PER_LINUX_32BIT?
<HdkR> I don't think it uses the personality flag, just allocates all of its VM JIT space in 32-bit space.
<HdkR> Yea, just checked the source. Doesn't use personality
luispm has quit [Ping timeout: 240 seconds]
<arnd> I looked too much at the personality code again, this was clearly a mistake
luispm has joined #armlinux
<arnd> I see that the ADDR_LIMIT_32BIT flag in personality only has an effect on Alpha, where it limits the mmpap() addresses to the lower 2GB (31 bit). On arm32, the same flag is used internally to switch between 26-bit and 32-bit mode, so it allows allocations beyond the lower 64MB
<HdkR> FEX-Emu doesn't even emulate it yet because it's a nightmare.
<arnd> makes sense
<HdkR> I'm sure I'll hit some program written in the 90s that requires it at some point.
<arnd> so the elf loader has a SET_PERSONALITY() call into arch code that sets the initial personality based on the binary format, but the address limit is not even part of that on x86. Instead when called for a 32-bit binary it sets the address limit to 32 bit using set_thread_flag(TIF_ADDR32), which cannot be overriden through personality (this is usually a good thing)
<arnd> but for a task that has the TIF_ADDR32 flag set (ia32 or x32), get_unmapped_area() uses either the lower 3GB or the lower 4GB based on the actual user-set personality flags
<HdkR> The address limit is also chosen there depending on the syscall entrypoint. Which on x86-64 can be `syscall;` or `int 0x80`
<HdkR> I guess also x32 by doing `syscall` with bit 30 of the syscall number set.
sudeepholla has quit [Ping timeout: 240 seconds]
sudeepholla has joined #armlinux
headless has joined #armlinux
sudeepholla has quit [Quit: Ex-Chat]
sszy has joined #armlinux
<HdkR> arnd: I guess the flag range inversion idea is just easier since flags are passed all the way down to `get_unmapped_area()` region already, so extending to 64-bit flags is easy?
<HdkR> And actually, still interested in why pointers to structs are disliked. iovec seems to be thoroughly used?
<arnd> HdkR: I think the iovec case is fairly limited. For clone3(), there are a lot of specific options that a sandbox may want to forbid with seccomp/bpf, or it may want to allow only a very specific subset.
<arnd> mmap() is obviously in the same category as clone3() here, it just does too many things depending on the specific fd, flags, flot and even addr arguments
<HdkR> Userspace syscall dispatch to the rescue?
<HdkR> Clone is a bit painful to handle in that model but definitely possible
torez has joined #armlinux
headless has quit [Quit: Konversation terminated!]
sszy has quit [Ping timeout: 240 seconds]
<hanetzer> question: is *any* relatively recent arm chip capable of making use of trusted firmware or does it need to be written on the tin?
sszy has joined #armlinux
<arnd> hanetzer: apple m1 is an example of a modern core without EL3 support
<hanetzer> gotcha.
<arnd> hanetzer: The Arm cortex cores all support EL3, but a SoC implementation might prevent it from running TF-A, e.g. if they ship with another secure monitor that is started from maskrom.
<hanetzer> ah, so all the cpus are capable, but the socs they're in may disable it.
<hanetzer> like most amd/intel chips are generally capable of virt stuff but may have it fused off
<arnd> hanetzer: not exactly. In this case it's more that you can run at most one secure monitor. If you run TF-A, you can't pick a different one and vice versa.
<hanetzer> ahhh.
<hanetzer> well, I can tell you the maskrom doesn't start one on this chip :P
sszy has quit [Ping timeout: 240 seconds]
<arnd> In Apple's case, they don't have the support at all, same for the original APM X-Gene. I think everything else has support.
<hanetzer> huh. neat.
sszy has joined #armlinux
Amit_T has joined #armlinux
sszy has quit [Quit: No Ping reply in 180 seconds.]
sszy has joined #armlinux
XV8 has joined #armlinux
<arnd> HdkR: the arm64 handling of 52-bit VA space is funny: if the 'addr' argument to mmap() is smaller than 1<<48 (including NULL), then it will always return an address within the 48-bit space, otherwise it tries the higher addresses first
<arnd> according to the man page, the 'addr' is meant to be just a hint, but this sounds like we could extend the meaning to be a little more strict for other limits as well, possibly with a new flag to force such behavior
<arnd> we should clearly not reuse the MAP_32BIT name for it, but the 0x40 bit in the flags word is not used for anything else on arm64 ;-)
sszy has quit [Quit: http://quassel-irc.org - Chat comfortably. Anywhere.]
<maz> we should have an auction system for these flag bits! :D
<mrutland> mmap_max_bid
alexels has quit [Ping timeout: 256 seconds]
alexels has joined #armlinux
<hanetzer> here's a question. when doing include/dt-bindings/blah/baz.h, is it legal to use enums instead of #define's?
<maz> hanetzer: if dtc understands them, sure. but I doubt this is the case.
<hanetzer> hmm. I wonder.
<hanetzer> maybe it should be made to. seems useful.
<HdkR> arnd: Yea, x86's 57-bit VA also does the same hinting thing to ensure applications don't accidentally fall in to using the larger VA space. Which is why I figured a bit more explicit range hinting isn't that crazy.
alexels has left #armlinux [WeeChat 3.5]
headless has joined #armlinux
djrscally has quit [Ping timeout: 256 seconds]
frieder has quit [Remote host closed the connection]
jlinton has quit [Quit: Client closed]
cbeznea has quit [Quit: Leaving.]
luispm has quit [Quit: Leaving]
Pali has joined #armlinux
Tokamak_ has quit [Ping timeout: 246 seconds]
Tokamak has joined #armlinux
Tokamak_ has joined #armlinux
Tokamak has quit [Ping timeout: 256 seconds]
apritzel has quit [Ping timeout: 260 seconds]
sudeepholla has joined #armlinux
sudeepholla has quit [Ping timeout: 256 seconds]
sudeepholla has joined #armlinux
jeeeun has quit [Quit: The Lounge - https://thelounge.chat]
jeeeun has joined #armlinux
Tokamak_ has quit [Ping timeout: 240 seconds]
Tokamak has joined #armlinux
prabhakarlad has quit [Quit: Client closed]
monstr has quit [Remote host closed the connection]
XV8 has quit [Ping timeout: 240 seconds]
XV8 has joined #armlinux
bps has quit [Ping timeout: 240 seconds]
Turingtoast has joined #armlinux
Turingtoast has quit [Client Quit]
bps has joined #armlinux
bps has quit [Changing host]
bps has joined #armlinux
rexbcchen__ has quit [Read error: Connection reset by peer]
rexbcchen__ has joined #armlinux
Amit_T has quit [Quit: Leaving]
djrscally has joined #armlinux
headless has quit [Quit: Konversation terminated!]
bps has quit [Ping timeout: 252 seconds]
torez has quit [Quit: torez]
apritzel has joined #armlinux
nsaenz has quit [Quit: Leaving]
djrscally has quit [Ping timeout: 246 seconds]
mag has quit [Remote host closed the connection]
mag has joined #armlinux
guillaume_g has quit [Ping timeout: 248 seconds]
mag has quit [Remote host closed the connection]
guillaume_g has joined #armlinux
mag has joined #armlinux
Tokamak has quit [Ping timeout: 246 seconds]
guillaume_g has quit [Ping timeout: 260 seconds]
apritzel has quit [Ping timeout: 252 seconds]
Tokamak has joined #armlinux