klange changed the topic of #osdev to: Operating System Development || Don't ask to ask---just ask! || For 3+ LoC, use a pastebin (for example https://gist.github.com/) || Stats + Old logs: http://osdev-logs.qzx.com New Logs: https://libera.irclog.whitequark.org/osdev || Visit https://wiki.osdev.org and https://forum.osdev.org || Books: https://wiki.osdev.org/Books
tacco has quit []
<geist> kazinsal: yah question is whether or not that causes the bios to re-fire
<geist> my guess is you can't rely on it not actually rebooting the machine
<kazinsal> yeah I bet on hardware with server-style firmware that has IPMI and watchdog components it'll probably kick off the whole startup phase again
<kazinsal> and likely some SMM stuff to catch what it thinks is a crash and do dumps to the LOM's flash
<kazinsal> some hypervisors might catch it as well
<kazinsal> virtualbox responds to triple faulting by pausing the machine and putting it into the GURU_MEDITATION state so you can hook up a debugger
isaacwoo1 has quit [Quit: WeeChat 3.2]
Izem has quit [Quit: Lost terminal]
sortie has quit [Remote host closed the connection]
<heat> how do you go back to 16-bit from long mode? is it by disabling paging (so you go back to protected mode?) and then you get to real mode by disabling protected mode?
<heat> this sounds correct but you never know
<heat> hmm actually you may need a 32-bit code segment
<kazinsal> I think to do it "properly" it's 64-bit long mode -> 32-bit compatibility mode with LME -> 32-bit protected mode -> 16-bit unreal mode -> 16-bit real mode
<heat> yeah exactly
<kazinsal> you can sorta skip the 32-bit steps when doing 16 -> 64 but I don't think you can on the way back
<heat> that looks correct
<heat> uhhhhhhhh what happens if you disable paging in long mode?
<heat> inside a 64-bit CS
<kazinsal> some kind of fault
<kazinsal> probably GPF, followed by DF, followed by RESET#
mahmutov has quit [Ping timeout: 268 seconds]
devcpu has joined #osdev
devcpu has quit [Client Quit]
devcpu has joined #osdev
andydude has quit [Quit: andydude]
dutch has quit [Quit: WeeChat 3.2]
dutch has joined #osdev
LambdaComplex has joined #osdev
Vercas has quit [Remote host closed the connection]
Vercas has joined #osdev
andydude has joined #osdev
dh` has joined #osdev
sts-q has joined #osdev
nyah has quit [Ping timeout: 256 seconds]
flx- has joined #osdev
AssKoala has quit [Ping timeout: 256 seconds]
flx has quit [Ping timeout: 268 seconds]
flx- is now known as flx
srjek has quit [Ping timeout: 272 seconds]
heat has quit [Ping timeout: 245 seconds]
zaquest has quit [Quit: Leaving]
zaquest has joined #osdev
pretty_dumm_guy has quit [Quit: WeeChat 3.2]
shlomif has joined #osdev
<klange> sham1: I use premultiplied RGBA colors and previously had a mix of both the integer stuff (the SSE blitter) and floating-point blending, but I just switched everything over to the integer approach yesterday as I got it to be much faster.
<klange> zid: I just... pretend the colorspace problem doesn't exist. There's some wackiness about how premultiplication in sRGB space is supposed to work, but given that I'm not writing professional graphics stuff [and Freetype got away with non-gamma-correct coverage rendering for years] I don't think it matters too much how transparent windows get blitted together - the whole thing's a gimmick anyway.
<klange> Plus if you actually want to get into that stuff, sRGB is a fantasy, not a reality, and you really need more bits for things to matter...
<klange> It matters much more if you're doing, say, subpixel antialiased text, which I don't do... grayscale, even if you color the text, looks fine with a linear alpha ramp and _whatever the hell colorspace I'm in_...
<zid> yea I'd 100% do it that way too
<zid> but it's nice to at least know what corners you're cutting when you do it eh :D
kulernil has quit [Remote host closed the connection]
mahmutov has joined #osdev
<klange> my head hurts, but these benchmarks say I've gained a bit of speed with the shenanigans I'm pulling, and everything still _looks_ right...
<klange> The bottleneck with my transformed renderer is absolutely bounds checking so it produces clean edges...
sm2n has quit [Ping timeout: 248 seconds]
scoobydoo has quit [Changing host]
scoobydoo has joined #osdev
GeDaMo has joined #osdev
nyah has joined #osdev
dennis95 has joined #osdev
dutch has quit [Quit: WeeChat 3.2]
dutch has joined #osdev
mahmutov has quit [Ping timeout: 245 seconds]
sm2n has joined #osdev
nur has quit [Killed (NickServ (GHOST command used by hussein))]
nur has joined #osdev
tacco has joined #osdev
tenshi has joined #osdev
AssKoala has joined #osdev
kulernil has joined #osdev
sortie has joined #osdev
scaleww has joined #osdev
sortie has quit [Ping timeout: 268 seconds]
devcpu has quit [Quit: leaving]
devcpu has joined #osdev
ElectronApps has joined #osdev
isaacwoods has joined #osdev
devcpu has quit [Quit: leaving]
AssKoala has quit [Ping timeout: 248 seconds]
pretty_dumm_guy has joined #osdev
sortie has joined #osdev
AssKoala has joined #osdev
heat has joined #osdev
mahmutov has joined #osdev
kwilczynski has joined #osdev
AssKoala has quit [Ping timeout: 268 seconds]
nismbu has quit [Ping timeout: 268 seconds]
ElectronApps has quit [Remote host closed the connection]
nismbu has joined #osdev
dennis95 has quit [Remote host closed the connection]
dennis95 has joined #osdev
V has quit [Ping timeout: 258 seconds]
dutch has quit [Quit: WeeChat 3.2]
dutch has joined #osdev
Santurysim has joined #osdev
AssKoala has joined #osdev
kulernil has quit [Remote host closed the connection]
kulernil has joined #osdev
AssKoala has quit [Ping timeout: 256 seconds]
wootehfoot has joined #osdev
kulernil has quit [Ping timeout: 244 seconds]
<Santurysim> Greetings, OS wizards! Do I understand correctly that my graphic output options in x86 protected mode on bios-based machine are: vga text mode via 0xB80000, VESA VBE and video card in its native mode?
devcpu has joined #osdev
<heat> you can also program vga registers manually for some (really shitty) graphics modes
<heat> but yeah that's basically it afaik
<heat> you can also grab an x86 emulator and emulate the video bios
<heat> which works but is super hacky and slow
shlomif has quit [Ping timeout: 248 seconds]
thaumavorio has quit [Ping timeout: 248 seconds]
kuler has joined #osdev
thaumavorio has joined #osdev
<Santurysim> heat: thank you very much!
<heat> :)
<gog> :)
shantaram is now known as shan
<Santurysim> Another question: how does 0xB80000 (and memory-mapped io in general) work? Is it memory controller who detects those special addresses and redirects them to the right place?
<zid> address decoders
scaleww has quit [Quit: Leaving]
vdamewood has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]
<heat> it's the northbridge I think
<gog> memor controller
<gog> it's shipped with the cpu these days
<heat> yeah
<heat> you could probably get a more detailed answer by looking at chipset docs
srjek has joined #osdev
<gog> also idk how intel and amd's architectues differ in this regard
<gog> amd has the Fusion Controller Hub on-board now
<gog> which i assume is comprable to whatever intel calls the ICH now
<heat> IIRC my laptop's chipset exposes a memory controller PCI device of some sorts
<heat> it gets hidden during boot
<gog> platform controller hub
<heat> the firmware writes a specific value to a random PCI register, the PCI device stops responding to reads and writes
<heat> pretty funny
<gog> is it some vendor fuckery, trying to keep end-users out of the innards of the controller to do Unauthorized Things
<heat> *shrug*
<heat> i can see why you wouldn't want to expose internals like that on the PCI bus
<gog> yes
<heat> stupid OSes are stupid and power management is hard
<gog> back in my day we had very simple power management
<gog> off is off, on is on
dennis95 has quit [Quit: Leaving]
<Santurysim> gog: is it some vendor fuckery, trying to keep end-users out of the innards of the controller to do Unauthorized Things <--- Yeah, computers are unfriendly ot OSdevers these days
rubion has joined #osdev
<Santurysim> s/ot/towards/
<gog> manufacturers are unfriendly*
immibis_ has joined #osdev
<gog> computers do not have emotions
<gog> like me
<gog> i am computer i do not feel
<heat> no they're not lol
V has joined #osdev
V has quit [Remote host closed the connection]
V has joined #osdev
<heat> "This register holds 32 writable bits with no functionality behind them. It is for the convenience of BIOS and graphics drivers."
<heat> lol
<heat> "screw it, lets put a register right there"
<GeDaMo> It's a decoy register so you don't notice the other registers creeping up behind you :|
* gog shoots at the registers
<heat> ahah! intel calls their decoder thing DMI
<heat> anything that's not RAM goes to DMI which is their link between the northbridge and southbridge
<heat> there are some registers in my host bridge that control how the BIOS ranges decode (DRAM vs DMI)
<heat> reads and writes are controlled independently
<heat> this is probably how the firmware shadows itself (enable writes-to-dram-only, copy it up (reads come from the DMI, writes go to DRAM), enable read-write-to-dram)
<Santurysim> Maybe coreboot developers could make use of that register
<heat> scratch registers sound pretty useless to me
<heat> considering they already have cache-as-ram enabled
theruran has quit [Ping timeout: 272 seconds]
seds has quit [Ping timeout: 256 seconds]
dmj` has quit [Ping timeout: 272 seconds]
kwilczynski has quit [Ping timeout: 245 seconds]
<Santurysim> Well...
jakesyl has quit [Ping timeout: 258 seconds]
YuutaW has quit [Ping timeout: 258 seconds]
SanchayanMaity has quit [Ping timeout: 258 seconds]
paulbarker has quit [Ping timeout: 258 seconds]
mhall has quit [Ping timeout: 256 seconds]
Benjojo has quit [Ping timeout: 256 seconds]
seds has joined #osdev
kwilczynski has joined #osdev
Benjojo has joined #osdev
geist has joined #osdev
__sen has joined #osdev
mhall has joined #osdev
theruran has joined #osdev
SanchayanMaity has joined #osdev
dmj` has joined #osdev
jakesyl has joined #osdev
AssKoala has joined #osdev
paulbarker has joined #osdev
tenshi has quit [Quit: WeeChat 3.2]
YuutaW has joined #osdev
<geist> also i always forget if the PCI setup for the gfx device on modern machines includes a BAR that maps stuff to 0xb8000, etc
<geist> i guess easy enought o check...
<jjuran> GeDaMo: "Clever girl…"
<geist> in tis case..... no, but then it was booted UEFI so maybe it never actually set up 0xb8000?
<heat> geist, those ranges have special decodes from what I can see
jjuran has quit [Quit: Killing Colloquy first, before it kills me…]
jjuran has joined #osdev
<heat> then there's a register in the host bridge that tells it to send all that stuff to device 00:02.0 (integrated graphics)
<geist> yeah i wonder if that is implemented as a straight map to another BAR (mapped at say 0xe500.0000) or via some other mechanism
<geist> like some sort of redirect register in the chipset somewhere
<geist> i guess it dpends on what the messages look like from the PCI(e) device point of view. if it gets an incoming transfer with the full physical address i guess theres no redirection needed
<geist> the vid card just sees 'transfer at address 0xb8000, size X, bytes Y'
<geist> never thought about it that way, but that's probably what happens. if you have a device with N bars it probably means the device has to do its own decoding to figure out which bar it came in on
<geist> guess digging through the spec would answer that pretty easily
<geist> also re integrated vs discrete: i assume it's basically the same mechanism, its just in the second case it's redirected to one of the PCI devices and not the integrated one
<heat> "The legacy 128 KB VGA memory range 000A_0000h – 000B_FFFFh can be mapped to Processor Graphics (Device 2), PCI Express (Device 1 Functions), and/or to the DMI interface depending on the programming of the VGA steering bits"
wereii has quit [Ping timeout: 252 seconds]
GeDaMo has quit [Quit: Leaving.]
<heat> actually now i'm really confused wrt DMI vs PCIe
jjuran has quit [Ping timeout: 268 seconds]
jjuran has joined #osdev
wereii has joined #osdev
<geist> ah guess 'PCI Express' in this case means 'route it out over pci to something'
<geist> wonder how it does additional specification of target there
<heat> I don't have device 1 in my chipset, probably because it's a laptop
<geist> yeah dunno precisely what device 1 is in this case.
<geist> are they referring to 00:1.0? or something else?
<heat> probably
<heat> since processor graphics is 00:02.0
<geist> possibly device one is always a PCI bridge? in this random machine it is, but it's a ryzen, so doesn't mean aything
<geist> if so maybe that's the generic way (at this register's level) to say 'route it out over PCIe and let that logic sort it out'
<geist> and then there's some config somewhere else to route VGA bits to the right device
<heat> ahh I see
<heat> device 1's functions are individual PCIe controllers
<geist> or, maybe dvice 00:1.0 on your machine has some control registers
<heat> 0 is x16, 1 is x8, 2 is x4
<geist> so it's a bridge for the first slot or whatnot?
<Santurysim> I have old motherboard with intel graphics built in it. And device 1 is not integrated graphics, according to lspci
<heat> yup, bridges
<geist> right, that's what we're talking about. device 0:2 is integrated, but there's an option for selecting device 0:1
<heat> I don't have them because it's a laptop
<geist> so we're theorizing that it's basically 'route out to a PCI bridge somewhere'
<geist> and yeah if it's a laptop chipset that may be The Video PCI Bridge
<geist> since it dones' thave to be as generic and have a whole mess of root bridges because lots of slots
<geist> i guess a question is what does this particlar register you're talking about look like on a non laptop SoC
<geist> where things are a bit more generic, maybe it just has more options, one for every root bridge or whatnot
<heat> I would imagine integrated graphics are a special case
<geist> oh totally
<heat> not part of a generic route to pci bridge thing
<geist> that's the meaning in my mind of the 'device 2' option, since intel can just always arrange for the integrated to be there
<geist> it's the 'how do you deal with non integrated'
<heat> laptops don't need to :)
<geist> in your case looks like device 1 is easy since that's probably a single 'slot' that's on your chipset because laptop
<heat> I don't think they even can
<geist> sure they do. you can put nvidia gpus on your laptop
<heat> yeah but they don't drive displays
<geist> people do t all the time, but there's probably not a lot of options for connections
<heat> at least not at boot time (so there's no need for VGA)
<geist> they *could*, but either way the model fits fine. you have integrated gpu, on a port behind bridge at 0:1
<geist> er discrete gpu
<graphitemaster> Who doesn't Windows or Linux have some sort of VM copy system call? Basically copy memory by rewriting PTEs instead of moving data.
<geist> youmean Why?
<graphitemaster> Yeah, Why, typo, sorry
<graphitemaster> macOS has vm_copy, I want that :P
<heat> linux has mremap
<geist> ah what are the semantics of that? copy as if the cpu had done a memcpy?
<geist> i think the details get hairy really fast.
<graphitemaster> geist, Yeah, for page-aligned addresses of course.
<graphitemaster> It just unmaps the old, wires in new PTEs, done and done.
<geist> so my guess there is that's a mach call, i've heard of someone doing something that
<heat> graphitemaster, mremap
<geist> well, 'unmapping the old' is not a copy, per se. that's move
<graphitemaster> Makes memcpy on macOS like 15x faster than all other OSes
<geist> move and copy are different things
<heat> if you unmap it of course
<graphitemaster> Well it's important it does cow
<geist> but... here's my thought: it's a call saying 'do whatever operations you can do to make it appear as if you copied data from A to B'
<geist> and that gives the kernel full capability to decide how to do it. so the model is a page aligned memcpy
<geist> the implementation can then decide based on source and destination a bunch of options
<geist> and the best option is to use the PTEs to move things around and avoid a copy
<graphitemaster> Which is what it actually does when I look at the code
<geist> we have something kinda lke that on zircon for moving pages between vmos
<graphitemaster> It sets up copy-on-write mappings
<graphitemaster> To alias the memory avoiding a copy
<graphitemaster> Then when you touch those pages it actually well, page faults and does the copy
<geist> makes sense. so it does mean there are downsides do it. it's based on the premise that the source is probably not going to immediately touch it, because you can easily come up with a situation where if the source starts faulting in new copies somewhere it may end up slower than a memcpy
<geist> which cpus are exceptionally good at
<geist> or the target. depends on which side it COWs from
<graphitemaster> Well, aarch64 is exceptionally good at copies.
<geist> we have something in zircon that actually does a direct move of pages between vmo A and B. basically a splice() operation
<graphitemaster> x86 TSO makes copies kind of sad :|
<graphitemaster> M1 gets like ~70 GiB/s memcpy performance without any optimizations (a byte code in 2 lines of C gets that)
<graphitemaster> x86 can barely push past 10 GiB/s memcpy performance
<graphitemaster> Even with stupid tricks like microcode rep movsb
<heat> geist: linux has copy_file_range(2) now
<graphitemaster> s/byte code/byte-wise copy/
<geist> but yes, this is a whole valid avenue of design patterns one can seek to do in kernel design: give the kernel some ability to do smart things for you by telling it to do some mundane stuff for you (like copying data) and hoping the kernel has some optimized path that is unavailible to user space
<geist> i've heard mach has more than one thing exactly like this
<graphitemaster> So how would mremap be used to do this on Linux
<geist> graphitemaster: your copy numbers are absolutely not what i've observed
<geist> as in x86 is a monster moving data in general. though M1 is also, but a lot of that is because it has a very very custom memory subsystem
Arthuria has joined #osdev
<heat> graphitemaster: I was wrong, you can do it with MAP_SHARED (create a new mapping of the same pages) but not MAP_PRIVATE
<geist> that's the key though: do you tell the kernel to do precisely what you want (create SHARED, zircon vmo copy, etc) or do you tell the kernelsomething generic and hope it can find a behind-the-scenes way to do it fast
<geist> two completely valid design patterns. useful in different contexts
<geist> in the past i was generally down on the latter, seems complicated, etc, but i'm warming up to it
<geist> and this vm_copy(ptr, ptr, len) is a great example of it
<graphitemaster> geist, I've spent the past week trying to push a few x86s past 10 GiB/s memcpy performance (doing everything at this point), thousands of lines of custom SSE/AVX aligned/unaligned/prefetch non prefetch, temporal / non-temporal routines, meanwhile I wrote `while (src != end) *dst++ = *src++` and that thing is hitting ~60 GiB/s on an M1 without a sweat, unrolling it a bit and 69.25 GiB/s is possible (that's the theoretical maximum too, seems
<graphitemaster> to be approx that when averaged out), I rounded to ~70 GiB/s since nicer number
<geist> part of my says it's not the kernels' business, that's something that can be done in user space so dont add complexity to the kernel that can't be done via user memcpy
<heat> I would implement vm_copy as a library call with a fast kernel backend for big enough copies
<geist> graphitemaster: huh? i've had shitty one memory channel laptop skylakes do 2.5x that without any trouble
<heat> or at least a single syscall with a DO NOT USE THIS FOR TINY COPIES
<geist> right
<geist> graphitemaster: are thse particularly low end x86s or something?
<graphitemaster> geist, that's strange, never tested any intels though yet, all I own and have access to are Zens with DDR4 (Zen1, 2, and 3)
<graphitemaster> rep movsb is absolutely garbage on Zen1 btw
<geist> that is really really strange. i think that's not right
<heat> I have a shitty kabylake R laptop
<heat> what do I need to test?
<bslsk05> ​ssvb/tinymembench - Simple benchmark for memory throughput and latency (73 forks/239 stargazers/NOASSERTION)
<graphitemaster> Takes about 10 minutes
<graphitemaster> Pastebin your results
<graphitemaster> Highest I got was 11 GiB/s
<geist> yes, zen 1 explicitly does not have the RMSB bit
<graphitemaster> That's with SSE2 non-temporal prefetch 64b at a time, being the fastest for Zen.
<geist> zen 2 or 3 added it
<graphitemaster> Even Zen 2 rep movsb is slower than SSE2
<graphitemaster> Zen 3 they seem to be the same
<graphitemaster> In either case, none of the Zens with none of my DDR4 can exceed 11 GiB/s
<geist> huh yeah that does appear slow here too. i've personally seen higher than that
<geist> but maybe their testing is very conservative
<geist> ah it's into standard memcpy, which is about 16GB/sec here
<geist> SSE2 non temporal copy basically (this is a zen1+)
<graphitemaster> Like even if you're hitting 16 GiB/s, that's not much of a massive difference compared to the insanity that is 70 GiB/s on a lower powered aarch64 system with lower memory clocks.
<graphitemaster> It's just not fair.
<geist> that's not an artifact of arm64, hat's an artifact of apple's design
<geist> it's my understanding the M1 has stacked ram, so it's very very close (dunno what frequency its' at) and it has something like 8 channels
<graphitemaster> Well it's an artifact of x86's TSO based on the paper I read, someone actually built an x86 (really old one) soft core in FPGA with and without the behavior of total store order since they were genuinely curious about the effect it has on the memory subsystem performance and concluded that x86's decision of TSO leaves about 10x performance at the table and it really feels like that's the reason here too because even non-M1 aarch64 I have is
<graphitemaster> hitting those speeds too with non-stacks memory.
<geist> can you point me to a piece of arm hardware that can?
<geist> i've certainly got othing available to me that gets close, including a fairly beefy ThunderX2 server
<geist> it gets about 12GB/sec
<graphitemaster> Yeah, get an eMAG 8180
<geist> also isn't it by definition that the M1 can't do 70GB/sec but more like 35GB/sec memcpy, because read/write?
<geist> alas i can't easily build it here, the assembler is unhappy with the .S files in the project on mac
<geist> for M1
<graphitemaster> That thing has DDR4-2667 and my system has DDR4-4733 and the aarch64 reaches 70 GiB/s and I can barely push past 10 GiB/s in that bench.
<graphitemaster> My memory has twice the clock rate and performs 1/7th the speed.
<graphitemaster> Just doesn't make sense
<graphitemaster> Anyways, going to get some food, brb later.
rebedin has joined #osdev
<geist> sure
<graphitemaster> I want vm_copy regardless XD
<geist> while i wont argue your numbers (though i do argue 70GB/sec for mem*copy*)
<geist> i do think youre amking too many architecture pronouncements based on specific tests
<graphitemaster> Maybe, but even without synthetic benches I'm seeing std::vector<int> resize 15x faster on the M1 compared to my x86 when timed and profiling shows memcpy
<geist> yes but the M1 is a fucking monster
<geist> this is whyi've been observing myself
<geist> its really in a class by itself
<graphitemaster> It moves memory faster than any machine then, apparently.
<geist> for a whole pile of reasons. some of which *may* be arm architectural, but it's basically a sample size of 1
<geist> as in 'apple built this amazing end to end design that happens to use arm ISA'
<graphitemaster> Then lets kill x86
<rebedin> graphitemaster: now you might be starting to understand what my research is about about clockless states, cause it would be easy for you if you ever did, i've taken time to add security for that pipeline , but do you have any real question you are after technically? Perhaps a mentally ill wanker like me can answer those :D
<graphitemaster> M1 for everything in the future please.
<geist> so while some of it may be because of weak memory model, etc by general feeling is when dealing with raw memory bandwidth you almost always end up memory bound anyway
<heat> https://gist.github.com/heatd/d0dc0e72bde4cc7d920182a6bd6a2d38 Intel(R) Core(TM) i5-8250U DDR4-2400
<bslsk05> ​gist.github.com: gist:d0dc0e72bde4cc7d920182a6bd6a2d38 · GitHub
<geist> and you're really testing the ability for the machine to shov data through it's cache and whatnot
<heat> aka very slow
<graphitemaster> 16 GiB/s
<graphitemaster> rep movsb is shit there
<graphitemaster> Just like mine
<graphitemaster> Is that a Zen1 ?
<geist> be careful. 16GB/sec is a fill, not a copy
<geist> it's worse, it's about 8GB/sec copy
<graphitemaster> Oh right, nice catch
<graphitemaster> That's even worse than mine lol
<graphitemaster> Looks like SSE2 nontemporal wins then for your rig
<geist> my dual channel nuc i5-6xxx gets a bit better , around 12GB/sec copy
<rebedin> graphitemaster: let's view this thing based of how hardware handles things on it's schematic, there is a concept called cross domain checking, than wire logic, and wire port coersion, which is relevant in verilog how this state is handled, zen of course would handle that way batter than risc based arms of any sort.
<rebedin> likely it just is not yet implemented there for whatever reasons
<geist> or apple pulled it off with M1
<heat> do realise my TDP up is 25W lol
<graphitemaster> That's the thing I also tested a shitty rockchip MIPS
<heat> it's trying ;_;
<graphitemaster> The MIPS cpu crushes x86 in memcpy performance too
<graphitemaster> At least my Zen1
<graphitemaster> It was able to sustain 20 GiB/s
<geist> for what, copy or fill?
<graphitemaster> copy
<geist> have to be extremely explicit here and make sure dont use the wrong one
<graphitemaster> I'm starting to think desktop x86 is regressing in memory performance.
<geist> side note, the M1 may also be using large pages to somewhat greater effect
<graphitemaster> I want to test some old systems now
<rebedin> graphitemaster: well let's say the ram is not any different than flip flops in it's performance when you use write-combining bus traffic tbh, it is a line of sense amplifiers and capacitors, and an asic attached to it as control circuitry
<moon-child> just tested (zen 2), I get 15gb/s
<geist> the M1 uses a) 16K pages (or larger) most of the time, and b) has a relatively huge TLB (3k entries i believe)
<graphitemaster> geist, Well the M1 mac also runs an OS with a libc which explicitly uses vm_copy when the size is larger than 16 KiB
<graphitemaster> In it's memcpy implementation
<geist> sure, but this test case should bypass that
<graphitemaster> Yeah
<geist> except i can't build it without hacking on the asm file
<graphitemaster> standard memcpy on the M1 beats everything in this test btw
<graphitemaster> Anyways, for real I need to get food before the store closes
<graphitemaster> XD
<geist> kk
<geist> hmm, this nuc i3-6100U gets
<geist> standard memcpy : 11374.7 MB/s
<geist> standard memset : 21617.8 MB/s
<moon-child> hmm, if I unroll the loop twice as hard, I get 16.5gb/s
<rebedin> it seems like you ain't listening me, so might as well dig ahead dorks, it ain't gonna seem to happen with your brains involved i see.
<moon-child> I'm tempted to unroll it twice as hard again but I doubt it'll improve matters much :P
<rebedin> it almost looks like you do not even know how to calculate, and 1cououmb of charge and wire logic might be too much to you
<geist> rebedin: one more and you're out
<rebedin> geist: prk yourself into trashbin fecalist
rebedin was kicked from #osdev by geist [rebedin]
<heat> strike out!
<geist> a shame, if he just sticks to not insulting people he can listen in, participate, etc
<heat> we should all abandon ARM and x86
<heat> let's use efi byte code directly
* sortie commends geist on a job well done :)
<sortie> I have coffee
<sortie> Reviewing contributions :)
<geist> noice.
<sortie> I started fearing what would happen if too many people start contributing, in which case I would need contributors to review each other's work to be senior devs, yada yada yada, performance reviews ...
<geist> nooo!
<geist> OKR time!
<sortie> How dare you suggest OKRs
<geist> i only mean it ironically!
<sortie> That's the gateway, ironic OKRs
<sortie> Corporate jokes aside, community building is a real and valuable thing :)
<sortie> There's plenty of things one can do to build communities, like giving regular contributors a bit of extra project access like op access, so they can help deal with trouble people
<sham1> It vwey much is, although it tends to feel very forced
<sortie> I think it's fair to expect regular contributors, whose code I review, to also do some code reviewing in return
<geist> yeah
<geist> it's tough when they phone something in. lots of times i've responded to reviews with suggestions and then you get radio silence
<geist> so question is do you give up on those but stick with t because 1/10 contributors will come back?
<geist> (probably the right answer)
<sortie> Yeah I did abandon a couple contribs that were unresponsive
<sortie> There's major differences between a first time contributor and recurring contributors though. The first timers you want to lower the entry bar as much as reasonable, and be okay with many of them not managing it anyway
<geist> yah
<sortie> The people that do stick around should be given opportunity to grow in the community
<Santurysim> Technical question: what extension is used on wiki.osdev.org to highlight source code?
<sortie> But it's a cool thing that I'm actually in a state right now where I actually let people contribute, compared to the pretty controlling wanting-to-do-everything-myself or too-busy-to-deal-with-it I've been in for most of the project
<sham1> I would think the syntax highlighter is just some mediawiki thing
<sham1> Santurysim: https://www.mediawiki.org/wiki/Extension:SyntaxHighlight might be it, actually
<bslsk05> ​www.mediawiki.org: Extension:SyntaxHighlight - MediaWiki
<geist> yah i'm starting to get more motivated to do some deeper work with LK, as i'm discovering more places where it's used
<geist> i can probably make a good case to essentially work on it full time at work, since it's used in <redacted> stuff just around work
<geist> and they're basically following mainline
<j`ey> I've been playing with the GPIO controller on the m1
<j`ey> nice, full time LK!
<j`ey> just dont let it become BK :P
<geist> exactly, thats the challenge
<geist> but i can efinitely do a lot of administrative and testing/etc stuff
<geist> which is under par compared to lots of modern stuff
<geist> fine for 20-10 years ago, but nowadays folks expect more unit tests, etc
<j`ey> I'm trying to understand the pinmuxing code in linux, but Ihad about 5-6h today with no luck getting it working
<heat> yeah
<geist> my experience with pin muxing and gpio is the moment you try to come up with a single model you have failed
<heat> I'm trying to unit test a whole filesystem driver
<heat> not fun D:
<j`ey> geist: lol
<geist> especially given theres no one model for pins vs gpios
<geist> some socs they're the same thing, some a gpio is a fundamentally different concept of a pin
<j`ey> there's a working patchset for the m1, but im trying to clean it up / modify the device tree it uses to match upstream
<geist> word
<geist> i wonder if anyone has tried to figure out what the page copression instructions do
<geist> ie, what format it does, how fast it is, etc
<graphitemaster> The M1 is officially the most impressive technological jump I've seen in CPUs in probably a decade. I'm still amazed by it.
<Santurysim> sham1: thank you. I've installed it on my mediawiki instance, and it looks slightly different from one on osdev. Also, it handles at&t assembly properly
<graphitemaster> Hello, I'm back :3
<j`ey> hm, all I know about is https://github.com/AsahiLinux/docs/wiki/HW%3AApple-Instructions geist, but its sparse on info
<bslsk05> ​github.com: HW:Apple Instructions · AsahiLinux/docs Wiki · GitHub
<geist> but all aside i wouldn't fixate too much on the memory bandwidth of M1. it's impressive, but that's also exactly what you get when you tightly couple the design
<geist> what'll be interestign is to see if they have to regress that a bit if/when they make a socketed memory version
<geist> ie, not just fixed ram directly attached to the cpu
<graphitemaster> geist, Okay but other CPUs not the M1 seem to be doing better than my over priced x86 at memory which is a problem in my eyes.
<geist> fine.
<j`ey> I'm just happy to have a project to work on, been in a bit of a rut
<geist> actually there's some fun stuff in the middle there as to bypassing their cache heirarchy and whatnot. and in general the apparently super low latency of the L2
<geist> part of maybe their also leap forward is the exceptionally wide L1 and L2 with no L3. which is also a different path x86 vendors have been going on
<geist> dunno if thats scalable. can you realistically clock at 4-5Ghz with that? M1 runs at like 2.2 iirc
<graphitemaster> It's actual a serious problem I think because software paradigm has changed to be cache-aware so a lot of old approaches to avoiding copying memory (like linked lists) are no longer a good idea, and data transformations are being programmed assuming really flat, decoupled, and tightly packed arrays which often requires copying first to get into that format...
<geist> so maybe that's a very concious design decision: build an extremely wide cpu that is not intended to clock high, and optimize the shit out of that
<graphitemaster> And copy performance of x86 is getting slower I think.
<geist> if anything that may be one of the biggest major forks in the M1 design from modern x86
<geist> graphitemaster: oh yeah totally. this whole 'get everything into a linear run because chasing pointers is slow' is a real thing nowadays
<geist> more and more of modern stuff has to contend with that (if it wants to e fast)
<graphitemaster> Yeah and getting things into linear runs requires a lot of copying and moving of data :P
<Santurysim> Still, pentium n5000 @ 2.7GHz is too hot for my transformer pad
<graphitemaster> So you want fast copies
<geist> so it's possible that thats a large reason M1 is extremely fast: low latency caches, etc
<geist> and less levels of cache (thus contributing to lower latency)
<graphitemaster> It helps that the memory on the M1 has quicker access times than desktop too. Like actual RAM latency is impressive for DDR4
<graphitemaster> No doubt solder close to CPU helps there
<graphitemaster> May as well just put the ram right inside the CPU core at this point XD
<geist> rght so what would be interesting is to see if they can pull that off if you made a generic design that has to support DIMMs and whatnot
<geist> whats fascinating to me is they took what was usually a negative (soldered ram, etc) in smartphone/embedded space and turned it into a positive by going all in
<geist> generally random arm socs are severely held back by single channel, 32bit wide ddr
<graphitemaster> Even if TSO is not contributing to issues here. After reading that entire paper on TSO I'm totally convinced strong memory models are fundamentally incompatible with high performance and more importantly, scaling to a ton of cores.
<geist> word.
<graphitemaster> dword
<j`ey> graphitemaster: can test with linux on the m1 eventually
<j`ey> by enabling tso
<Santurysim> Does making program cache-aware help on m1?
Arthuria has quit [Ping timeout: 248 seconds]
sts-q has quit [Ping timeout: 248 seconds]
<heat> obviously
<graphitemaster> cache aware helps on everything lol
<heat> *except cacheless cpus :P
<Santurysim> Oh. That waa a silly question
heat has quit [Remote host closed the connection]
kulernil has joined #osdev
kuler has quit [Ping timeout: 244 seconds]
<graphitemaster> cacheless cpus are for cachless people
<graphitemaster> s/cachless/cashless
<zid> cacheless is a plot by big RISC to make their cpus competitive
anon16 has quit [Ping timeout: 258 seconds]
anon16 has joined #osdev
srjek has quit [Ping timeout: 258 seconds]
kulernil has quit [Remote host closed the connection]
kulernil has joined #osdev
sortie has quit [Quit: Leaving]
rubion has quit [Ping timeout: 268 seconds]
kulernil has quit [Ping timeout: 244 seconds]
anon16 has quit [Quit: WeeChat 3.1]
anon16 has joined #osdev
dude12312414 has joined #osdev
Burgundy has quit [Ping timeout: 258 seconds]