<conchuod>
I really do want something "server" grade, by which I mean honeycomb lx2 level. So not really server grade at all but a vast step up from SBC/embedded SoCs we have now...
epony has quit [Ping timeout: 268 seconds]
<rneese>
sifive unmatched is about it currently
<rneese>
if you can find one
<conchuod>
sifive unmatched is not even in the same ballpark I'm afraid.
<conchuod>
I have one ;)
<conchuod>
There's just no way that I am going to waste my time compiling anything significant natively on it!
<conchuod>
Right. Pure grunt on it's own is not what is important, it's grunt *plus* all of the above
bauruine has quit [Remote host closed the connection]
<rneese>
I build on arm64 for riscv and its alot fatser then x86
<rneese>
you might look at learning docker
<rneese>
its alot faster for builds
<rneese>
I currently build on a mac m1 in docker in 5 min for a cli img
<rneese>
and 20 min for a fulll on desktop img
<q66>
you are not making any sense
<q66>
whether you build on arm64 or on x86 does not matter, if you have x86 that is faster than the arm it will build faster
<q66>
and docker is just a wrapper around namespaces and similar
<rneese>
but it speeds things up if done correctly
<q66>
it doesn't speed anything up
<rneese>
it removes using caches hich slow thigns down
<q66>
????
<rneese>
its all abotu the settings in a docker
<q66>
docker does not speed anything up, all it potentially does is provide a different host environment
<q66>
there are many ways you can get that
<q66>
it likewise does not help if you want to do native (non-cross) riscv builds
<rneese>
explain how I went from a 45 min build to 30 min build for a full desktop img for riscv64 in our builder on a docker , and that includes a kernel build. on a mac m1
<rneese>
I will get the guy who setup our docker to outline it
<q66>
compared to doing what
<rneese>
same build on a x86 server with 64 gig ram and 24 cores
<rneese>
docker on m1 speed things up alot faster
<q66>
you are not making any sense
<rneese>
1 min
<q66>
you are comparing build speeds between two completely different computers
<q66>
docker has nothing to do with it
<rneese>
building with out docker on arm64 is only 5% fatser
<q66>
then you need to describe how you build on arm64 without docker
<rneese>
not going to fight abotu this I will dun a full cli build in a docker with time output it builds a kernel and the base cli img
<q66>
docker on mac is pretty much just a vm (it needs to be, because the mac host does not have the linux syscall layer), while on linux it's a bunch of tooling around kernel namespaces
<q66>
it never makes anything faster by itself
<q66>
if anything the goal is so that it's not much slower than a native run
BootLayer has quit [Quit: Leaving]
<rneese>
[🐳|🌱] Runtime [ 3 min ]
<q66>
if something is faster inside, it can be due to various factors, e.g. the toolchain within the container environment being different
<q66>
or the tooling having some overhead within a mac host
<rneese>
and I have a full cli img ready to go
<conchuod>
Also
<conchuod>
>same build on a x86 server with 64 gig ram and 24 cores
<conchuod>
Is the m1 faster than said server?
<conchuod>
Just out of curiousity
<q66>
it could be depending on the cores
<rneese>
the m1 is a mac mini 8 cores 16 gig ram
<q66>
the build does not always use all of them, if you are building a whole system image it likely spends considerable time in i/o and single-threaded stuff like running configure scripts and installing
<q66>
mac arm hardware is supposed to have pretty good single-core performance
<q66>
in any case saying that docker somehow on its own magically improves performance or is even capable of doing so is completely off the mark and misleading
<conchuod>
I don't have any decent arm64 hardware, but I would be pretty sure that config for config my x86 box is going to be faster than anything shy of several 1000$
<conchuod>
Certainly it'd be faster than running docker in a vm from macos
<q66>
conchuod: it really depends, like, lots of x86 server hw is xeon systems, say, 2 cpus each 12 cores at 2 ghz
<q66>
and then the performance can be pretty variable
<q66>
anything relying on single-threaded perf is not gonna be great because low clocks
<conchuod>
Yah, I dunno what x86 box rneese is using
<q66>
and then if you have a multisocket system you can have e.g. NUMA memory affinity stuff screw it up
<conchuod>
But my x86 box is pretty good, so I don't think building on arm64 would serve any purpose other than making things clunkier.
<q66>
probably not yeah
<q66>
there is no inherent advantage to using arm64, might as well use the fastest hw you have regardless of architecture
<q66>
if you are cross-compiling anyway
<q66>
if doing emulation x86 is gonna be the fastest again because qemu's tcg jit is usually fastest on x86 hosts
<q66>
it's definitely quite a bit faster than on my power9 boxes for instance
<rneese>
the other builder is in use and access blocked
<q66>
yeah so it's an old xeon with low clocks and multisocket NUMA config
<rneese>
that on is a xeon the other is a diff unit
<rneese>
I just built a xfce riscv img in a dockers
<rneese>
[🐳|🌱] Runtime [ 5 min ]
<rneese>
the whale is how we mark we are using docker
<conchuod>
tbh, without knowing *exactly* what was compiled/recompiled that 5 mins is pretty meaningless
<mps>
building kernel on M1pro with linux installed is fast really, 3 minutes
<mps>
on Ampere Altra it takes lot more time
<conchuod>
The apple arm64 stuff is great, I dont think there's much point denying that.
<conchuod>
But whether it is or not is mostly unrelated to wanting native build infra
<mps>
conchuod: right, I agree
<mps>
I expected that build riscv with qemu-user will be significantly faster or arm64 than on x86_64, but in my test no big difference
<mps>
s/or/on/
<q66>
llvm build in my distro in qemu-user on ryzen 5950x still takes all night and it's not even lto
<q66>
on the same cpu for x86 it takes like 30 mins and that's with lto :)
<q66>
on my power9 boxes it's between an hour and two, depending on the cpu
<conchuod>
And natively on a 5950x that takes like 30 ye
<q66>
on the aarch64 builder i have for arm native packages (which is ampere emag) it takes about 4
<qwestion>
is there a sunxi-like chart for mainlining effort of c910 (and others?)? how can one get paid to contribute to upstreaming foss drivers and such?
<q66>
it used to take longer, then i spent a bunch of time digging around musl and ended up patching in scudo as the default libc allocator
<q66>
and everything got faster
<conchuod>
smaeul: or icenowy might know about that sort of thing qwestion
<mps>
q66: so scudo is faster than default musl allocator iiuc correctly
<conchuod>
qwestion: do people get paid for upstreaming that stuff?
<jn>
via employment at companies like Bootlin, maybe
<qwestion>
does icenowy chat here?
<conchuod>
Pretty sure I've seen them here, although I may be mis-remembering
<rneese>
sorry got called away found a bug in our builder for desktops
<rneese>
bbl
pedja has joined #riscv
Starfoxxes has quit [Ping timeout: 265 seconds]
Starfoxxes has joined #riscv
BootLayer has joined #riscv
<smaeul>
conchuod: there have been a few crowdfunded or manufacturer-sponsored projects in the past, like the video decoder driver, but otherwise nobody doing the upstreaming gets paid
<smaeul>
qwestion: very few people have C910 hardware (and I have none), so I don't think any community upstreaming effort has started. you could ask Guo Ren if Alibaba has plans to do any first-party upstreaming
<q66>
mps: significantly, yeah
<q66>
especially with threads
<q66>
in musl i have a bit of a special configuration that is tuned to be even faster and somewhat leaner
<q66>
which implements the configuration and frontend
<mps>
q66: ah. chimera linux uses it by default?
<q66>
yeah i made it default because it performed universally well
<mps>
interesting. maybe worth look if it could be added to alpine
<q66>
i have it default to primary32 allocator (which is normally only used on 32-bit archs in default configs, but primary64 reserves tons of virtual memory and i didn't find the performance to be much if at all better)
<q66>
and disabled secondary cache
<q66>
because secondary cache was causing some janky behavior with qemu-user
<q66>
actually using primary64 with qemu-user is kind of a nope too
ldevulder has joined #riscv
<q66>
because reserving 130GB of virtual memory == qemu causes each startup to take like 500ms
<q66>
which results in degraded performance with anything emulated
<q66>
the number can be tuned, but i couldn't get it under 8GB without it becoming unreliable
<q66>
at 8GB the cost wasn't huge but i still didn't like it
<mps>
q66: thanks for explanation
<q66>
mps: alpine is currently using scudo with lld
<q66>
because standard allocator performance makes linking take 3x as long
<q66>
but it uses the .so that comes with llvm
<q66>
so it's unusable on arm apparently
<conchuod>
smaeul: yah, I figured none of it was first party - but I figured at least some of the driver support would be via the various linux consultancy places
<q66>
i successfully use my in-libc version on all archs
<q66>
which is currently aarch64, ppc64le, riscv64 and x86_65
<q66>
64
<mps>
q66: yes, I see scudo-malloc pkg in alpine but never looked for what is used
<q66>
replacing allocator in musl is kind of a tricky thing btw, you can't do it with just any allocator
<q66>
most of them seem to rely on thread_local being functional
<mps>
hm, right
<q66>
which is not the case in musl, libc.so is not allowed to contain tls because the dynamic linker does not set it up till later (and the tls is itself malloc'd) and does not handle ELF TLS for itself
<q66>
scudo can be made to work because it's very configurable at build-time and consists of several components that you can mix and match
<q66>
so all it takes is implementing a custom tsd registry that does not rely on thread_local
<q66>
in my case i just mmap the memory for it and then store a pointer in each struct pthread
<mps>
aha, looks like too much complicated for me
<q66>
well, i allocate a 64k-sized chunk and then split it into several tad's and give them out/recycle as needed
<q66>
*tsd's
<q66>
64k because that's the largest standard page size you will run into
<q66>
and mmap can only deal with pages
<mps>
yes
<q66>
the actual size of the registry struct is maybe 6k or so
<q66>
might as well pack it and reduce waste
<mps>
q66: thanks again for explanation. now I understand from "birds point of view"
jacklsw has joined #riscv
<q66>
(and recycling means not having to map and unmap all the time)
<q66>
np
jack_lsw has joined #riscv
jack_lsw has quit [Client Quit]
john1 has joined #riscv
epony has quit [Ping timeout: 268 seconds]
aburgess has quit [Ping timeout: 272 seconds]
john1 has quit [Quit: Leaving]
epony has joined #riscv
jacklsw has quit [Read error: Connection reset by peer]