#osdev on 2021-08-14 — irc logs at libera.irclog.whitequark.org

2021-05-23 01:57 klange changed the topic of #osdev to: Operating System Development || Don't ask to ask---just ask! || For 3+ LoC, use a pastebin (for example https://gist.github.com/) || Stats + Old logs: http://osdev-logs.qzx.com New Logs: https://libera.irclog.whitequark.org/osdev || Visit https://wiki.osdev.org and https://forum.osdev.org || Books: https://wiki.osdev.org/Books

00:03 tacco has quit []

00:04 <geist> kazinsal: yah question is whether or not that causes the bios to re-fire

00:04 <geist> my guess is you can't rely on it not actually rebooting the machine

00:05 <kazinsal> yeah I bet on hardware with server-style firmware that has IPMI and watchdog components it'll probably kick off the whole startup phase again

00:05 <kazinsal> and likely some SMM stuff to catch what it thinks is a crash and do dumps to the LOM's flash

00:06 <kazinsal> some hypervisors might catch it as well

00:06 <kazinsal> virtualbox responds to triple faulting by pausing the machine and putting it into the GURU_MEDITATION state so you can hook up a debugger

00:09 isaacwoo1 has quit [Quit: WeeChat 3.2]

00:13 Izem has quit [Quit: Lost terminal]

00:18 sortie has quit [Remote host closed the connection]

00:23 <heat> how do you go back to 16-bit from long mode? is it by disabling paging (so you go back to protected mode?) and then you get to real mode by disabling protected mode?

00:23 <heat> this sounds correct but you never know

00:24 <heat> hmm actually you may need a 32-bit code segment

00:24 <kazinsal> I think to do it "properly" it's 64-bit long mode -> 32-bit compatibility mode with LME -> 32-bit protected mode -> 16-bit unreal mode -> 16-bit real mode

00:24 <heat> yeah exactly

00:24 <kazinsal> you can sorta skip the 32-bit steps when doing 16 -> 64 but I don't think you can on the way back

00:24 <heat> that looks correct

00:26 <heat> uhhhhhhhh what happens if you disable paging in long mode?

00:27 <heat> inside a 64-bit CS

00:34 <kazinsal> some kind of fault

00:34 <kazinsal> probably GPF, followed by DF, followed by RESET#

00:55 mahmutov has quit [Ping timeout: 268 seconds]

00:57 devcpu has joined #osdev

00:57 devcpu has quit [Client Quit]

00:58 devcpu has joined #osdev

01:06 andydude has quit [Quit: andydude]

01:21 dutch has quit [Quit: WeeChat 3.2]

01:26 dutch has joined #osdev

01:34 LambdaComplex has joined #osdev

01:37 Vercas has quit [Remote host closed the connection]

01:37 Vercas has joined #osdev

01:50 andydude has joined #osdev

02:02 dh` has joined #osdev

02:15 sts-q has joined #osdev

02:22 nyah has quit [Ping timeout: 256 seconds]

02:58 flx- has joined #osdev

02:59 AssKoala has quit [Ping timeout: 256 seconds]

03:02 flx has quit [Ping timeout: 268 seconds]

03:02 flx- is now known as flx

03:06 srjek has quit [Ping timeout: 272 seconds]

03:42 heat has quit [Ping timeout: 245 seconds]

03:45 zaquest has quit [Quit: Leaving]

03:50 zaquest has joined #osdev

03:53 pretty_dumm_guy has quit [Quit: WeeChat 3.2]

03:57 shlomif has joined #osdev

05:09 <klange> sham1: I use premultiplied RGBA colors and previously had a mix of both the integer stuff (the SSE blitter) and floating-point blending, but I just switched everything over to the integer approach yesterday as I got it to be much faster.

05:10 <klange> zid: I just... pretend the colorspace problem doesn't exist. There's some wackiness about how premultiplication in sRGB space is supposed to work, but given that I'm not writing professional graphics stuff [and Freetype got away with non-gamma-correct coverage rendering for years] I don't think it matters too much how transparent windows get blitted together - the whole thing's a gimmick anyway.

05:11 <klange> Plus if you actually want to get into that stuff, sRGB is a fantasy, not a reality, and you really need more bits for things to matter...

05:16 <klange> It matters much more if you're doing, say, subpixel antialiased text, which I don't do... grayscale, even if you color the text, looks fine with a linear alpha ramp and _whatever the hell colorspace I'm in_...

05:20 <zid> yea I'd 100% do it that way too

05:20 <zid> but it's nice to at least know what corners you're cutting when you do it eh :D

05:30 kulernil has quit [Remote host closed the connection]

06:19 mahmutov has joined #osdev

07:24 <klange> my head hurts, but these benchmarks say I've gained a bit of speed with the shenanigans I'm pulling, and everything still _looks_ right...

07:24 <klange> The bottleneck with my transformed renderer is absolutely bounds checking so it produces clean edges...

07:45 sm2n has quit [Ping timeout: 248 seconds]

07:49 scoobydoo has quit [Changing host]

07:49 scoobydoo has joined #osdev

08:01 GeDaMo has joined #osdev

08:21 nyah has joined #osdev

09:05 dennis95 has joined #osdev

09:16 dutch has quit [Quit: WeeChat 3.2]

09:17 dutch has joined #osdev

09:18 mahmutov has quit [Ping timeout: 245 seconds]

09:20 sm2n has joined #osdev

09:24 nur has quit [Killed (NickServ (GHOST command used by hussein))]

09:25 nur has joined #osdev

10:45 tacco has joined #osdev

10:45 tenshi has joined #osdev

10:59 AssKoala has joined #osdev

11:20 kulernil has joined #osdev

11:23 sortie has joined #osdev

11:47 scaleww has joined #osdev

11:54 sortie has quit [Ping timeout: 268 seconds]

12:38 devcpu has quit [Quit: leaving]

12:39 devcpu has joined #osdev

12:41 ElectronApps has joined #osdev

12:41 isaacwoods has joined #osdev

12:55 devcpu has quit [Quit: leaving]

13:16 AssKoala has quit [Ping timeout: 248 seconds]

13:19 pretty_dumm_guy has joined #osdev

13:50 sortie has joined #osdev

14:04 AssKoala has joined #osdev

14:15 heat has joined #osdev

14:20 mahmutov has joined #osdev

14:29 kwilczynski has joined #osdev

15:10 AssKoala has quit [Ping timeout: 268 seconds]

15:21 nismbu has quit [Ping timeout: 268 seconds]

15:33 ElectronApps has quit [Remote host closed the connection]

15:35 nismbu has joined #osdev

15:37 dennis95 has quit [Remote host closed the connection]

15:38 dennis95 has joined #osdev

15:48 V has quit [Ping timeout: 258 seconds]

15:54 dutch has quit [Quit: WeeChat 3.2]

16:03 dutch has joined #osdev

16:06 Santurysim has joined #osdev

16:30 AssKoala has joined #osdev

17:21 kulernil has quit [Remote host closed the connection]

17:23 kulernil has joined #osdev

17:28 AssKoala has quit [Ping timeout: 256 seconds]

17:30 wootehfoot has joined #osdev

17:33 kulernil has quit [Ping timeout: 244 seconds]

17:39 <Santurysim> Greetings, OS wizards! Do I understand correctly that my graphic output options in x86 protected mode on bios-based machine are: vga text mode via 0xB80000, VESA VBE and video card in its native mode?

17:40 devcpu has joined #osdev

17:41 <heat> you can also program vga registers manually for some (really shitty) graphics modes

17:41 <heat> but yeah that's basically it afaik

17:42 <heat> you can also grab an x86 emulator and emulate the video bios

17:42 <heat> which works but is super hacky and slow

17:44 shlomif has quit [Ping timeout: 248 seconds]

17:46 thaumavorio has quit [Ping timeout: 248 seconds]

17:46 kuler has joined #osdev

17:48 thaumavorio has joined #osdev

17:56 <Santurysim> heat: thank you very much!

17:57 <heat> :)

18:01 <gog> :)

18:06 shantaram is now known as shan

18:07 <Santurysim> Another question: how does 0xB80000 (and memory-mapped io in general) work? Is it memory controller who detects those special addresses and redirects them to the right place?

18:11 <zid> address decoders

18:11 scaleww has quit [Quit: Leaving]

18:12 vdamewood has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]

18:12 <heat> it's the northbridge I think

18:12 <gog> memor controller

18:12 <gog> it's shipped with the cpu these days

18:15 <heat> yeah

18:15 <heat> you could probably get a more detailed answer by looking at chipset docs

18:16 srjek has joined #osdev

18:16 <gog> also idk how intel and amd's architectues differ in this regard

18:16 <gog> amd has the Fusion Controller Hub on-board now

18:16 <gog> which i assume is comprable to whatever intel calls the ICH now

18:17 <heat> IIRC my laptop's chipset exposes a memory controller PCI device of some sorts

18:17 <heat> it gets hidden during boot

18:17 <gog> platform controller hub

18:18 <heat> the firmware writes a specific value to a random PCI register, the PCI device stops responding to reads and writes

18:18 <heat> pretty funny

18:18 <gog> is it some vendor fuckery, trying to keep end-users out of the innards of the controller to do Unauthorized Things

18:19 <heat> *shrug*

18:20 <heat> i can see why you wouldn't want to expose internals like that on the PCI bus

18:20 <gog> yes

18:20 <heat> stupid OSes are stupid and power management is hard

18:21 <gog> back in my day we had very simple power management

18:21 <gog> off is off, on is on

18:23 dennis95 has quit [Quit: Leaving]

18:25 <Santurysim> gog: is it some vendor fuckery, trying to keep end-users out of the innards of the controller to do Unauthorized Things <--- Yeah, computers are unfriendly ot OSdevers these days

18:26 rubion has joined #osdev

18:27 <Santurysim> s/ot/towards/

18:27 <gog> manufacturers are unfriendly*

18:27 immibis_ has joined #osdev

18:27 <gog> computers do not have emotions

18:27 <gog> like me

18:27 <gog> i am computer i do not feel

18:28 <heat> no they're not lol

18:29 V has joined #osdev

18:32 V has quit [Remote host closed the connection]

18:34 V has joined #osdev

18:36 <heat> "This register holds 32 writable bits with no functionality behind them. It is for the convenience of BIOS and graphics drivers."

18:37 <heat> lol

18:37 <heat> "screw it, lets put a register right there"

18:37 <GeDaMo> It's a decoy register so you don't notice the other registers creeping up behind you :|

18:38 * gog shoots at the registers

18:41 <heat> ahah! intel calls their decoder thing DMI

18:42 <heat> anything that's not RAM goes to DMI which is their link between the northbridge and southbridge

18:43 <heat> there are some registers in my host bridge that control how the BIOS ranges decode (DRAM vs DMI)

18:47 <heat> reads and writes are controlled independently

18:47 <heat> this is probably how the firmware shadows itself (enable writes-to-dram-only, copy it up (reads come from the DMI, writes go to DRAM), enable read-write-to-dram)

18:48 <Santurysim> Maybe coreboot developers could make use of that register

18:49 <heat> scratch registers sound pretty useless to me

18:50 <heat> considering they already have cache-as-ram enabled

18:54 theruran has quit [Ping timeout: 272 seconds]

18:54 seds has quit [Ping timeout: 256 seconds]

18:54 dmj` has quit [Ping timeout: 272 seconds]

18:55 kwilczynski has quit [Ping timeout: 245 seconds]

18:55 <Santurysim> Well...

18:55 jakesyl has quit [Ping timeout: 258 seconds]

18:55 YuutaW has quit [Ping timeout: 258 seconds]

18:55 SanchayanMaity has quit [Ping timeout: 258 seconds]

18:56 paulbarker has quit [Ping timeout: 258 seconds]

18:58 mhall has quit [Ping timeout: 256 seconds]

18:58 Benjojo has quit [Ping timeout: 256 seconds]

18:58 seds has joined #osdev

18:59 kwilczynski has joined #osdev

19:00 Benjojo has joined #osdev

19:00 geist has joined #osdev

19:01 __sen has joined #osdev

19:01 mhall has joined #osdev

19:01 theruran has joined #osdev

19:06 SanchayanMaity has joined #osdev

19:06 dmj` has joined #osdev

19:06 jakesyl has joined #osdev

19:06 AssKoala has joined #osdev

19:07 paulbarker has joined #osdev

19:08 tenshi has quit [Quit: WeeChat 3.2]

19:08 YuutaW has joined #osdev

19:33 <geist> also i always forget if the PCI setup for the gfx device on modern machines includes a BAR that maps stuff to 0xb8000, etc

19:33 <geist> i guess easy enought o check...

19:34 <jjuran> GeDaMo: "Clever girl…"

19:34 <geist> in tis case..... no, but then it was booted UEFI so maybe it never actually set up 0xb8000?

19:39 <heat> geist, those ranges have special decodes from what I can see

19:40 jjuran has quit [Quit: Killing Colloquy first, before it kills me…]

19:40 jjuran has joined #osdev

19:41 <heat> then there's a register in the host bridge that tells it to send all that stuff to device 00:02.0 (integrated graphics)

19:41 <geist> yeah i wonder if that is implemented as a straight map to another BAR (mapped at say 0xe500.0000) or via some other mechanism

19:42 <geist> like some sort of redirect register in the chipset somewhere

19:43 <geist> i guess it dpends on what the messages look like from the PCI(e) device point of view. if it gets an incoming transfer with the full physical address i guess theres no redirection needed

19:43 <geist> the vid card just sees 'transfer at address 0xb8000, size X, bytes Y'

19:44 <geist> never thought about it that way, but that's probably what happens. if you have a device with N bars it probably means the device has to do its own decoding to figure out which bar it came in on

19:44 <geist> guess digging through the spec would answer that pretty easily

19:45 <geist> also re integrated vs discrete: i assume it's basically the same mechanism, its just in the second case it's redirected to one of the PCI devices and not the integrated one

19:46 <heat> "The legacy 128 KB VGA memory range 000A_0000h – 000B_FFFFh can be mapped to Processor Graphics (Device 2), PCI Express (Device 1 Functions), and/or to the DMI interface depending on the programming of the VGA steering bits"

19:47 wereii has quit [Ping timeout: 252 seconds]

19:47 GeDaMo has quit [Quit: Leaving.]

19:50 <heat> actually now i'm really confused wrt DMI vs PCIe

19:57 jjuran has quit [Ping timeout: 268 seconds]

20:00 jjuran has joined #osdev

20:02 wereii has joined #osdev

20:13 <geist> ah guess 'PCI Express' in this case means 'route it out over pci to something'

20:13 <geist> wonder how it does additional specification of target there

20:14 <heat> I don't have device 1 in my chipset, probably because it's a laptop

20:15 <geist> yeah dunno precisely what device 1 is in this case.

20:15 <geist> are they referring to 00:1.0? or something else?

20:15 <heat> probably

20:15 <heat> since processor graphics is 00:02.0

20:16 <geist> possibly device one is always a PCI bridge? in this random machine it is, but it's a ryzen, so doesn't mean aything

20:16 <geist> if so maybe that's the generic way (at this register's level) to say 'route it out over PCIe and let that logic sort it out'

20:16 <geist> and then there's some config somewhere else to route VGA bits to the right device

20:16 <heat> ahh I see

20:17 <heat> device 1's functions are individual PCIe controllers

20:17 <geist> or, maybe dvice 00:1.0 on your machine has some control registers

20:17 <heat> 0 is x16, 1 is x8, 2 is x4

20:17 <geist> so it's a bridge for the first slot or whatnot?

20:17 <Santurysim> I have old motherboard with intel graphics built in it. And device 1 is not integrated graphics, according to lspci

20:18 <heat> yup, bridges

20:18 <geist> right, that's what we're talking about. device 0:2 is integrated, but there's an option for selecting device 0:1

20:18 <heat> I don't have them because it's a laptop

20:18 <geist> so we're theorizing that it's basically 'route out to a PCI bridge somewhere'

20:18 <geist> and yeah if it's a laptop chipset that may be The Video PCI Bridge

20:18 <geist> since it dones' thave to be as generic and have a whole mess of root bridges because lots of slots

20:19 <geist> i guess a question is what does this particlar register you're talking about look like on a non laptop SoC

20:19 <geist> where things are a bit more generic, maybe it just has more options, one for every root bridge or whatnot

20:19 <heat> I would imagine integrated graphics are a special case

20:20 <geist> oh totally

20:20 <heat> not part of a generic route to pci bridge thing

20:20 <geist> that's the meaning in my mind of the 'device 2' option, since intel can just always arrange for the integrated to be there

20:20 <geist> it's the 'how do you deal with non integrated'

20:20 <heat> laptops don't need to :)

20:20 <geist> in your case looks like device 1 is easy since that's probably a single 'slot' that's on your chipset because laptop

20:20 <heat> I don't think they even can

20:21 <geist> sure they do. you can put nvidia gpus on your laptop

20:21 <heat> yeah but they don't drive displays

20:21 <geist> people do t all the time, but there's probably not a lot of options for connections

20:21 <heat> at least not at boot time (so there's no need for VGA)

20:21 <geist> they *could*, but either way the model fits fine. you have integrated gpu, on a port behind bridge at 0:1

20:21 <geist> er discrete gpu

20:22 <graphitemaster> Who doesn't Windows or Linux have some sort of VM copy system call? Basically copy memory by rewriting PTEs instead of moving data.

20:22 <geist> youmean Why?

20:22 <graphitemaster> Yeah, Why, typo, sorry

20:22 <graphitemaster> macOS has vm_copy, I want that :P

20:23 <heat> linux has mremap

20:23 <geist> ah what are the semantics of that? copy as if the cpu had done a memcpy?

20:23 <geist> i think the details get hairy really fast.

20:23 <graphitemaster> geist, Yeah, for page-aligned addresses of course.

20:23 <graphitemaster> It just unmaps the old, wires in new PTEs, done and done.

20:23 <geist> so my guess there is that's a mach call, i've heard of someone doing something that

20:23 <heat> graphitemaster, mremap

20:23 <geist> well, 'unmapping the old' is not a copy, per se. that's move

20:23 <graphitemaster> Makes memcpy on macOS like 15x faster than all other OSes

20:24 <geist> move and copy are different things

20:24 <heat> if you unmap it of course

20:24 <graphitemaster> Well it's important it does cow

20:24 <geist> but... here's my thought: it's a call saying 'do whatever operations you can do to make it appear as if you copied data from A to B'

20:24 <geist> and that gives the kernel full capability to decide how to do it. so the model is a page aligned memcpy

20:25 <geist> the implementation can then decide based on source and destination a bunch of options

20:25 <geist> and the best option is to use the PTEs to move things around and avoid a copy

20:25 <graphitemaster> Which is what it actually does when I look at the code

20:25 <geist> we have something kinda lke that on zircon for moving pages between vmos

20:25 <graphitemaster> It sets up copy-on-write mappings

20:25 <graphitemaster> To alias the memory avoiding a copy

20:26 <graphitemaster> Then when you touch those pages it actually well, page faults and does the copy

20:26 <geist> makes sense. so it does mean there are downsides do it. it's based on the premise that the source is probably not going to immediately touch it, because you can easily come up with a situation where if the source starts faulting in new copies somewhere it may end up slower than a memcpy

20:26 <geist> which cpus are exceptionally good at

20:26 <geist> or the target. depends on which side it COWs from

20:27 <graphitemaster> Well, aarch64 is exceptionally good at copies.

20:27 <geist> we have something in zircon that actually does a direct move of pages between vmo A and B. basically a splice() operation

20:27 <graphitemaster> x86 TSO makes copies kind of sad :|

20:28 <graphitemaster> M1 gets like ~70 GiB/s memcpy performance without any optimizations (a byte code in 2 lines of C gets that)

20:28 <graphitemaster> x86 can barely push past 10 GiB/s memcpy performance

20:28 <graphitemaster> Even with stupid tricks like microcode rep movsb

20:28 <heat> geist: linux has copy_file_range(2) now

20:28 <graphitemaster> s/byte code/byte-wise copy/

20:29 <geist> but yes, this is a whole valid avenue of design patterns one can seek to do in kernel design: give the kernel some ability to do smart things for you by telling it to do some mundane stuff for you (like copying data) and hoping the kernel has some optimized path that is unavailible to user space

20:29 <geist> i've heard mach has more than one thing exactly like this

20:29 <graphitemaster> So how would mremap be used to do this on Linux

20:29 <geist> graphitemaster: your copy numbers are absolutely not what i've observed

20:29 <geist> as in x86 is a monster moving data in general. though M1 is also, but a lot of that is because it has a very very custom memory subsystem

20:30 Arthuria has joined #osdev

20:30 <heat> graphitemaster: I was wrong, you can do it with MAP_SHARED (create a new mapping of the same pages) but not MAP_PRIVATE

20:31 <geist> that's the key though: do you tell the kernel to do precisely what you want (create SHARED, zircon vmo copy, etc) or do you tell the kernelsomething generic and hope it can find a behind-the-scenes way to do it fast

20:31 <geist> two completely valid design patterns. useful in different contexts

20:31 <geist> in the past i was generally down on the latter, seems complicated, etc, but i'm warming up to it

20:32 <geist> and this vm_copy(ptr, ptr, len) is a great example of it

20:32 <graphitemaster> geist, I've spent the past week trying to push a few x86s past 10 GiB/s memcpy performance (doing everything at this point), thousands of lines of custom SSE/AVX aligned/unaligned/prefetch non prefetch, temporal / non-temporal routines, meanwhile I wrote `while (src != end) *dst++ = *src++` and that thing is hitting ~60 GiB/s on an M1 without a sweat, unrolling it a bit and 69.25 GiB/s is possible (that's the theoretical maximum too, seems

20:32 <graphitemaster> to be approx that when averaged out), I rounded to ~70 GiB/s since nicer number

20:32 <geist> part of my says it's not the kernels' business, that's something that can be done in user space so dont add complexity to the kernel that can't be done via user memcpy

20:32 <heat> I would implement vm_copy as a library call with a fast kernel backend for big enough copies

20:33 <geist> graphitemaster: huh? i've had shitty one memory channel laptop skylakes do 2.5x that without any trouble

20:33 <heat> or at least a single syscall with a DO NOT USE THIS FOR TINY COPIES

20:33 <geist> right

20:34 <geist> graphitemaster: are thse particularly low end x86s or something?

20:34 <graphitemaster> geist, that's strange, never tested any intels though yet, all I own and have access to are Zens with DDR4 (Zen1, 2, and 3)

20:34 <graphitemaster> rep movsb is absolutely garbage on Zen1 btw

20:34 <geist> that is really really strange. i think that's not right

20:34 <heat> I have a shitty kabylake R laptop

20:34 <heat> what do I need to test?

20:34 <graphitemaster> Run https://github.com/ssvb/tinymembench

20:34 <bslsk05> ssvb/tinymembench - Simple benchmark for memory throughput and latency (73 forks/239 stargazers/NOASSERTION)

20:34 <graphitemaster> Takes about 10 minutes

20:34 <graphitemaster> Pastebin your results

20:35 <graphitemaster> Highest I got was 11 GiB/s

20:35 <geist> yes, zen 1 explicitly does not have the RMSB bit

20:35 <graphitemaster> That's with SSE2 non-temporal prefetch 64b at a time, being the fastest for Zen.

20:35 <geist> zen 2 or 3 added it

20:35 <graphitemaster> Even Zen 2 rep movsb is slower than SSE2

20:36 <graphitemaster> Zen 3 they seem to be the same

20:36 <graphitemaster> In either case, none of the Zens with none of my DDR4 can exceed 11 GiB/s

20:36 <geist> huh yeah that does appear slow here too. i've personally seen higher than that

20:36 <geist> but maybe their testing is very conservative

20:37 <geist> ah it's into standard memcpy, which is about 16GB/sec here

20:37 <geist> SSE2 non temporal copy basically (this is a zen1+)

20:38 <graphitemaster> Like even if you're hitting 16 GiB/s, that's not much of a massive difference compared to the insanity that is 70 GiB/s on a lower powered aarch64 system with lower memory clocks.

20:38 <graphitemaster> It's just not fair.

20:38 <geist> that's not an artifact of arm64, hat's an artifact of apple's design

20:38 <geist> it's my understanding the M1 has stacked ram, so it's very very close (dunno what frequency its' at) and it has something like 8 channels

20:40 <graphitemaster> Well it's an artifact of x86's TSO based on the paper I read, someone actually built an x86 (really old one) soft core in FPGA with and without the behavior of total store order since they were genuinely curious about the effect it has on the memory subsystem performance and concluded that x86's decision of TSO leaves about 10x performance at the table and it really feels like that's the reason here too because even non-M1 aarch64 I have is

20:40 <graphitemaster> hitting those speeds too with non-stacks memory.

20:41 <geist> can you point me to a piece of arm hardware that can?

20:41 <geist> i've certainly got othing available to me that gets close, including a fairly beefy ThunderX2 server

20:41 <geist> it gets about 12GB/sec

20:41 <graphitemaster> Yeah, get an eMAG 8180

20:41 <geist> also isn't it by definition that the M1 can't do 70GB/sec but more like 35GB/sec memcpy, because read/write?

20:42 <geist> alas i can't easily build it here, the assembler is unhappy with the .S files in the project on mac

20:42 <geist> for M1

20:43 <graphitemaster> That thing has DDR4-2667 and my system has DDR4-4733 and the aarch64 reaches 70 GiB/s and I can barely push past 10 GiB/s in that bench.

20:43 <graphitemaster> My memory has twice the clock rate and performs 1/7th the speed.

20:43 <graphitemaster> Just doesn't make sense

20:44 <graphitemaster> Anyways, going to get some food, brb later.

20:44 rebedin has joined #osdev

20:44 <geist> sure

20:44 <graphitemaster> I want vm_copy regardless XD

20:44 <geist> while i wont argue your numbers (though i do argue 70GB/sec for mem*copy*)

20:44 <geist> i do think youre amking too many architecture pronouncements based on specific tests

20:45 <graphitemaster> Maybe, but even without synthetic benches I'm seeing std::vector<int> resize 15x faster on the M1 compared to my x86 when timed and profiling shows memcpy

20:45 <geist> yes but the M1 is a fucking monster

20:45 <geist> this is whyi've been observing myself

20:45 <geist> its really in a class by itself

20:46 <graphitemaster> It moves memory faster than any machine then, apparently.

20:46 <geist> for a whole pile of reasons. some of which *may* be arm architectural, but it's basically a sample size of 1

20:46 <geist> as in 'apple built this amazing end to end design that happens to use arm ISA'

20:46 <graphitemaster> Then lets kill x86

20:46 <rebedin> graphitemaster: now you might be starting to understand what my research is about about clockless states, cause it would be easy for you if you ever did, i've taken time to add security for that pipeline , but do you have any real question you are after technically? Perhaps a mentally ill wanker like me can answer those :D

20:46 <graphitemaster> M1 for everything in the future please.

20:46 <geist> so while some of it may be because of weak memory model, etc by general feeling is when dealing with raw memory bandwidth you almost always end up memory bound anyway

20:46 <heat> https://gist.github.com/heatd/d0dc0e72bde4cc7d920182a6bd6a2d38 Intel(R) Core(TM) i5-8250U DDR4-2400

20:46 <bslsk05> gist.github.com: gist:d0dc0e72bde4cc7d920182a6bd6a2d38 · GitHub

20:47 <geist> and you're really testing the ability for the machine to shov data through it's cache and whatnot

20:47 <heat> aka very slow

20:47 <graphitemaster> 16 GiB/s

20:47 <graphitemaster> rep movsb is shit there

20:47 <graphitemaster> Just like mine

20:47 <graphitemaster> Is that a Zen1 ?

20:47 <geist> be careful. 16GB/sec is a fill, not a copy

20:47 <geist> it's worse, it's about 8GB/sec copy

20:47 <graphitemaster> Oh right, nice catch

20:47 <graphitemaster> That's even worse than mine lol

20:48 <graphitemaster> Looks like SSE2 nontemporal wins then for your rig

20:48 <geist> my dual channel nuc i5-6xxx gets a bit better , around 12GB/sec copy

20:48 <rebedin> graphitemaster: let's view this thing based of how hardware handles things on it's schematic, there is a concept called cross domain checking, than wire logic, and wire port coersion, which is relevant in verilog how this state is handled, zen of course would handle that way batter than risc based arms of any sort.

20:48 <rebedin> likely it just is not yet implemented there for whatever reasons

20:48 <geist> or apple pulled it off with M1

20:48 <heat> do realise my TDP up is 25W lol

20:48 <graphitemaster> That's the thing I also tested a shitty rockchip MIPS

20:48 <heat> it's trying ;_;

20:48 <graphitemaster> The MIPS cpu crushes x86 in memcpy performance too

20:49 <graphitemaster> At least my Zen1

20:49 <graphitemaster> It was able to sustain 20 GiB/s

20:49 <geist> for what, copy or fill?

20:49 <graphitemaster> copy

20:49 <geist> have to be extremely explicit here and make sure dont use the wrong one

20:50 <graphitemaster> I'm starting to think desktop x86 is regressing in memory performance.

20:50 <geist> side note, the M1 may also be using large pages to somewhat greater effect

20:50 <graphitemaster> I want to test some old systems now

20:50 <rebedin> graphitemaster: well let's say the ram is not any different than flip flops in it's performance when you use write-combining bus traffic tbh, it is a line of sense amplifiers and capacitors, and an asic attached to it as control circuitry

20:50 <moon-child> just tested (zen 2), I get 15gb/s

20:50 <geist> the M1 uses a) 16K pages (or larger) most of the time, and b) has a relatively huge TLB (3k entries i believe)

20:50 <graphitemaster> geist, Well the M1 mac also runs an OS with a libc which explicitly uses vm_copy when the size is larger than 16 KiB

20:51 <graphitemaster> In it's memcpy implementation

20:51 <geist> sure, but this test case should bypass that

20:51 <graphitemaster> Yeah

20:51 <geist> except i can't build it without hacking on the asm file

20:51 <graphitemaster> standard memcpy on the M1 beats everything in this test btw

20:51 <graphitemaster> Anyways, for real I need to get food before the store closes

20:51 <graphitemaster> XD

20:51 <geist> kk

20:52 <geist> hmm, this nuc i3-6100U gets

20:52 <geist> standard memcpy : 11374.7 MB/s

20:52 <geist> standard memset : 21617.8 MB/s

20:52 <moon-child> hmm, if I unroll the loop twice as hard, I get 16.5gb/s

20:52 <rebedin> it seems like you ain't listening me, so might as well dig ahead dorks, it ain't gonna seem to happen with your brains involved i see.

20:53 <moon-child> I'm tempted to unroll it twice as hard again but I doubt it'll improve matters much :P

20:53 <rebedin> it almost looks like you do not even know how to calculate, and 1cououmb of charge and wire logic might be too much to you

20:53 <geist> rebedin: one more and you're out

20:53 <rebedin> geist: prk yourself into trashbin fecalist

20:53 rebedin was kicked from #osdev by geist [rebedin]

20:54 <heat> strike out!

20:54 <geist> a shame, if he just sticks to not insulting people he can listen in, participate, etc

20:54 <heat> we should all abandon ARM and x86

20:54 <heat> let's use efi byte code directly

20:56 * sortie commends geist on a job well done :)

20:57 <sortie> I have coffee

20:57 <sortie> Reviewing contributions :)

20:58 <geist> noice.

20:58 <sortie> I started fearing what would happen if too many people start contributing, in which case I would need contributors to review each other's work to be senior devs, yada yada yada, performance reviews ...

20:59 <geist> nooo!

20:59 <geist> OKR time!

20:59 <sortie> How dare you suggest OKRs

20:59 <geist> i only mean it ironically!

21:00 <sortie> That's the gateway, ironic OKRs

21:00 <sortie> Corporate jokes aside, community building is a real and valuable thing :)

21:02 <sortie> There's plenty of things one can do to build communities, like giving regular contributors a bit of extra project access like op access, so they can help deal with trouble people

21:02 <sham1> It vwey much is, although it tends to feel very forced

21:02 <sortie> I think it's fair to expect regular contributors, whose code I review, to also do some code reviewing in return

21:02 <geist> yeah

21:03 <geist> it's tough when they phone something in. lots of times i've responded to reviews with suggestions and then you get radio silence

21:03 <geist> so question is do you give up on those but stick with t because 1/10 contributors will come back?

21:03 <geist> (probably the right answer)

21:03 <sortie> Yeah I did abandon a couple contribs that were unresponsive

21:05 <sortie> There's major differences between a first time contributor and recurring contributors though. The first timers you want to lower the entry bar as much as reasonable, and be okay with many of them not managing it anyway

21:05 <geist> yah

21:05 <sortie> The people that do stick around should be given opportunity to grow in the community

21:06 <Santurysim> Technical question: what extension is used on wiki.osdev.org to highlight source code?

21:07 <sortie> But it's a cool thing that I'm actually in a state right now where I actually let people contribute, compared to the pretty controlling wanting-to-do-everything-myself or too-busy-to-deal-with-it I've been in for most of the project

21:08 <sham1> I would think the syntax highlighter is just some mediawiki thing

21:09 <sham1> Santurysim: https://www.mediawiki.org/wiki/Extension:SyntaxHighlight might be it, actually

21:09 <bslsk05> www.mediawiki.org: Extension:SyntaxHighlight - MediaWiki

21:09 <geist> yah i'm starting to get more motivated to do some deeper work with LK, as i'm discovering more places where it's used

21:09 <geist> i can probably make a good case to essentially work on it full time at work, since it's used in <redacted> stuff just around work

21:09 <geist> and they're basically following mainline

21:09 <j`ey> I've been playing with the GPIO controller on the m1

21:10 <j`ey> nice, full time LK!

21:10 <j`ey> just dont let it become BK :P

21:11 <geist> exactly, thats the challenge

21:11 <geist> but i can efinitely do a lot of administrative and testing/etc stuff

21:11 <geist> which is under par compared to lots of modern stuff

21:11 <geist> fine for 20-10 years ago, but nowadays folks expect more unit tests, etc

21:12 <j`ey> I'm trying to understand the pinmuxing code in linux, but Ihad about 5-6h today with no luck getting it working

21:12 <heat> yeah

21:12 <geist> my experience with pin muxing and gpio is the moment you try to come up with a single model you have failed

21:12 <heat> I'm trying to unit test a whole filesystem driver

21:13 <heat> not fun D:

21:13 <j`ey> geist: lol

21:13 <geist> especially given theres no one model for pins vs gpios

21:13 <geist> some socs they're the same thing, some a gpio is a fundamentally different concept of a pin

21:15 <j`ey> there's a working patchset for the m1, but im trying to clean it up / modify the device tree it uses to match upstream

21:16 <geist> word

21:16 <geist> i wonder if anyone has tried to figure out what the page copression instructions do

21:16 <geist> ie, what format it does, how fast it is, etc

21:16 <graphitemaster> The M1 is officially the most impressive technological jump I've seen in CPUs in probably a decade. I'm still amazed by it.

21:16 <Santurysim> sham1: thank you. I've installed it on my mediawiki instance, and it looks slightly different from one on osdev. Also, it handles at&t assembly properly

21:16 <graphitemaster> Hello, I'm back :3

21:17 <j`ey> hm, all I know about is https://github.com/AsahiLinux/docs/wiki/HW%3AApple-Instructions geist, but its sparse on info

21:17 <bslsk05> github.com: HW:Apple Instructions · AsahiLinux/docs Wiki · GitHub

21:17 <geist> but all aside i wouldn't fixate too much on the memory bandwidth of M1. it's impressive, but that's also exactly what you get when you tightly couple the design

21:18 <geist> what'll be interestign is to see if they have to regress that a bit if/when they make a socketed memory version

21:18 <geist> ie, not just fixed ram directly attached to the cpu

21:18 <graphitemaster> geist, Okay but other CPUs not the M1 seem to be doing better than my over priced x86 at memory which is a problem in my eyes.

21:18 <geist> fine.

21:19 <j`ey> I'm just happy to have a project to work on, been in a bit of a rut

21:19 <geist> actually there's some fun stuff in the middle there as to bypassing their cache heirarchy and whatnot. and in general the apparently super low latency of the L2

21:19 <geist> part of maybe their also leap forward is the exceptionally wide L1 and L2 with no L3. which is also a different path x86 vendors have been going on

21:20 <geist> dunno if thats scalable. can you realistically clock at 4-5Ghz with that? M1 runs at like 2.2 iirc

21:20 <graphitemaster> It's actual a serious problem I think because software paradigm has changed to be cache-aware so a lot of old approaches to avoiding copying memory (like linked lists) are no longer a good idea, and data transformations are being programmed assuming really flat, decoupled, and tightly packed arrays which often requires copying first to get into that format...

21:20 <geist> so maybe that's a very concious design decision: build an extremely wide cpu that is not intended to clock high, and optimize the shit out of that

21:20 <graphitemaster> And copy performance of x86 is getting slower I think.

21:20 <geist> if anything that may be one of the biggest major forks in the M1 design from modern x86

21:21 <geist> graphitemaster: oh yeah totally. this whole 'get everything into a linear run because chasing pointers is slow' is a real thing nowadays

21:21 <geist> more and more of modern stuff has to contend with that (if it wants to e fast)

21:21 <graphitemaster> Yeah and getting things into linear runs requires a lot of copying and moving of data :P

21:21 <Santurysim> Still, pentium n5000 @ 2.7GHz is too hot for my transformer pad

21:21 <graphitemaster> So you want fast copies

21:21 <geist> so it's possible that thats a large reason M1 is extremely fast: low latency caches, etc

21:22 <geist> and less levels of cache (thus contributing to lower latency)

21:22 <graphitemaster> It helps that the memory on the M1 has quicker access times than desktop too. Like actual RAM latency is impressive for DDR4

21:22 <graphitemaster> No doubt solder close to CPU helps there

21:22 <graphitemaster> May as well just put the ram right inside the CPU core at this point XD

21:23 <geist> rght so what would be interesting is to see if they can pull that off if you made a generic design that has to support DIMMs and whatnot

21:23 <geist> whats fascinating to me is they took what was usually a negative (soldered ram, etc) in smartphone/embedded space and turned it into a positive by going all in

21:24 <geist> generally random arm socs are severely held back by single channel, 32bit wide ddr

21:24 <graphitemaster> Even if TSO is not contributing to issues here. After reading that entire paper on TSO I'm totally convinced strong memory models are fundamentally incompatible with high performance and more importantly, scaling to a ton of cores.

21:24 <geist> word.

21:26 <graphitemaster> dword

21:26 <j`ey> graphitemaster: can test with linux on the m1 eventually

21:26 <j`ey> by enabling tso

21:27 <Santurysim> Does making program cache-aware help on m1?

21:28 Arthuria has quit [Ping timeout: 248 seconds]

21:28 sts-q has quit [Ping timeout: 248 seconds]

21:28 <heat> obviously

21:28 <graphitemaster> cache aware helps on everything lol

21:29 <heat> *except cacheless cpus :P

21:31 <Santurysim> Oh. That waa a silly question

21:47 heat has quit [Remote host closed the connection]

21:49 kulernil has joined #osdev

21:51 kuler has quit [Ping timeout: 244 seconds]

21:57 <graphitemaster> cacheless cpus are for cachless people

21:57 <graphitemaster> s/cachless/cashless

21:59 <zid> cacheless is a plot by big RISC to make their cpus competitive

22:13 anon16 has quit [Ping timeout: 258 seconds]

22:15 anon16 has joined #osdev

22:38 srjek has quit [Ping timeout: 258 seconds]

22:38 kulernil has quit [Remote host closed the connection]

22:38 kulernil has joined #osdev

22:42 sortie has quit [Quit: Leaving]

22:43 rubion has quit [Ping timeout: 268 seconds]

22:47 kulernil has quit [Ping timeout: 244 seconds]

22:59 anon16 has quit [Quit: WeeChat 3.1]

23:00 anon16 has joined #osdev

23:46 dude12312414 has joined #osdev

23:52 Burgundy has quit [Ping timeout: 258 seconds]