klange changed the topic of #osdev to: Operating System Development || Don't ask to ask---just ask! || For 3+ LoC, use a pastebin (for example https://gist.github.com/) || Stats + Old logs: http://osdev-logs.qzx.com New Logs: https://libera.irclog.whitequark.org/osdev || Visit https://wiki.osdev.org and https://forum.osdev.org || Books: https://wiki.osdev.org/Books
amanita has joined #osdev
dhugo has left #osdev [Leaving.]
Arthuria has quit [Ping timeout: 264 seconds]
<doug16k> riverdc, it's the same thing on x86 vs x86_64. x86_64 is actually easier
<doug16k> doublefault and stack fault are clunky as hell on x86, easy on x86_64
<doug16k> x86 nmi is also even more hell then x86_64 one
amanita has quit [Quit: Lost terminal]
transistor has quit [*.net *.split]
xenos1984 has quit [*.net *.split]
transistor has joined #osdev
xenos1984 has joined #osdev
iorem has joined #osdev
<riverdc> ok, good to know. i'm pretty green in terms of assembly, didn't know that x86_64 was considered easier
transistor has quit [Ping timeout: 245 seconds]
<kazinsal> I wonder if it would be easier to write a new kernel core for x86_64 and port the modules into it than to try to clone the i686 core and change bits as needed until it works
xenos1984 has quit [*.net *.split]
xenos1984 has joined #osdev
<Mutabah> x86_64 has more warts... but they're easier to use
<GreaseMonkey> [11:54:05] <heat> also arm64 has division, arm32 doesn't I think <-- ARMv7 i think has division which is 32-bit
<GreaseMonkey> i know ARMv7-M has division and that's the cut-down thumb-only microcontroller version
<heat> GreaseMonkey, probably, i'm not the biggest ARM fan here :)
<GreaseMonkey> ARM can be a lot of fun but i want to see more RISC-V stuff
<heat> kazinsal: if you wrote your paging well there are probably not too many changes you can make
<heat> just use uint64_t and increase your paging levels
Lucretia has quit [Quit: Konversation terminated!]
<merry> ARMv7-A has division yes
<heat> then there's the IDT and GDT which are slightly different but that's also a piece of cake
<heat> the TSS is also slighly different
<heat> for syscalls you probably want to use syscall but int works fine
<heat> and yeah that's about it
<Mutabah> GreaseMonkey: riscv is FUN
<Mutabah> it's just so simple
<GreaseMonkey> Mutabah: it's basically a better MIPS, another CPU which takes less than a day to make a basic user-mode-level emulator for
dgb has joined #osdev
isaacwoods has quit [Quit: WeeChat 3.1]
Affliction has joined #osdev
<heat> is it just me or is virtio-blk crap?
<Mutabah> In what way?
<heat> you need to create a separate virtio_blk_request header for each sector you want to read, and each sector is very much its own request
gog has joined #osdev
<Mutabah> Well, each request - you can read sequential sectors in one request
* gog meows
<Mutabah> And you need something to tell the host what address to read
<heat> you can?
<Mutabah> Yep
<heat> ah, I take it back then
<heat> virtio-blk is great
<doug16k> you might fragment it because of scatter gather, but it might be one big range
<doug16k> you can chain them together into one big command
<doug16k> my read and write are largely wrappers around mphysranges
<doug16k> the number of items there determine how many items in virtio ring
vdamewood has joined #osdev
<doug16k> if you look at the virtio config struct, it has size_max. use that to cap your scatter/gather fragment size
<doug16k> I have virtio-blk driver
transistor has joined #osdev
<doug16k> it reminds me of usb block storage. almost impossible to be simpler
<doug16k> virtio-blk blows away usb block storage for one reason: block storage is in order, virtio isn't
<doug16k> each thing completes independently
heat has quit [Ping timeout: 264 seconds]
vdamewood has quit [Ping timeout: 268 seconds]
<doug16k> if you want a guarantee that something run after something else, then don't issue it until the other thing completes
gog has quit [Ping timeout: 268 seconds]
<doug16k> you already need to be able to stall to do a useful flush
<doug16k> you need to be able to block issue from all threads, drain out the command until none are pending, issue the flush, and prevent issue until flush completes
<doug16k> when you say flush you are saying flush what was issued so far. if it hasn't issued everything up to the flush, the flush doesn't flush everything before it
<doug16k> the only way to be certain it was issued is to wait for it to complete
<doug16k> both ahci and nvme need you to do that too
<doug16k> if you just fling a flush in there you don't know for sure what it flushed
<doug16k> not allowed to issue that command NCQ anyway
<doug16k> already need to be able to stall until ncq commands drain, and transition to not-ncq
<doug16k> it's like it's a tradition
<doug16k> I wanted to add port multiplier support
<doug16k> I love the idea of port multipliers
<doug16k> one sata can handle 4 spinning drives no problem
<doug16k> too late now, too old
<Affliction> Something I've not thought of - do any hypervisors or emulators emulate them?
<Affliction> doesn't look like qemu or vmware do
<doug16k> I hope there is, I don't think so though
sniff has joined #osdev
<doug16k> I bet it's not even that much work to add it to qemu's ahci
<Affliction> hardware's not that expensive, either
<Affliction> but, my ONLY use for it would be testing my AHCI driver against it
<doug16k> a friend almost went that route and went sas-to-many-sata instead
sniff has left #osdev [#osdev]
<doug16k> he has an array of 7200 rpm drives that keep up with my 960 pro nvme in linear
vdamewood has joined #osdev
<doug16k> so much mechanical it keeps up with solid state :)
<Affliction> way back in the day I wrote a FAT fragmenter, compare their performance after that treatment :)
<doug16k> it's fun to have so much disk throughput, that you need to put the disk host controller in the video card slot so you can get x8 :P
jaevanko has joined #osdev
jaevanko has quit [Client Quit]
<Affliction> speaking of ludicrous disk throughput, does anyone know about directstorage? I've not been able to find a straight answer on if it's only useful for the GPU, or if it's useful on the CPU too.
<Affliction> even now, all I can find is marketing.
<doug16k> I go for storage so direct, I wrote the block driver and the filesystem and the syscalls and the libraries :P
<doug16k> then everything is my fault
<doug16k> fuzzing seems like an excellent idea on syscalls
<Affliction> Well, the nvidia documenttion seems to be entirely GPU side.
<Affliction> Dunno, maybe it provided a way to allow applications to map part of a BAR so they can access storage without going through the OS
<Affliction> But that would require the disk to know about the filesystem
<doug16k> newest pcie can map any size
<doug16k> variable bar thing
<Affliction> Which is something I've thought about - move the filesystem to the SSD controller, which can deal with allocating a bunch of numbered logical streams. The OS' filesystem can use some streams for metadata, others for content.
<doug16k> goes all the way up to 64 bit
<Affliction> Program wants to open a file, OS does permission checks, gives it a mapping to a page in one of the BARs that allows it to do read/write commands, without syscalls
<Affliction> mmap gives it a BAR on the disk which can be read and written
<doug16k> if hardware has that bank switching window thing, yeah
<doug16k> oh you mean slide the bar up and down depending on access?
<doug16k> that's probably not right, you aren't intended to be frequently modifying config space
<doug16k> weird if they do that
<Affliction> nah, the BAR could be a fixed sized multiple of the size of the disk.
<doug16k> yeah just bank switch the one bar and not move it
<doug16k> that way makes more sense
<Affliction> the device just creates 'views' of logical streams
<doug16k> it's EMS memory from DOS 6.22 day
<Affliction> effectively, the device deals with fragmentation
<doug16k> with nice big 16MB window
<Affliction> something I kinda want to try with an FPGA, but I've never done anything with an FPGA
<Affliction> seems a bit more complex than blinking a LED
<doug16k> if you want to do that they have amazing stuff
<bslsk05> ​numato.com: Aller Artix-7 FPGA Board with M.2 Interface | Numato Lab
<Affliction> But, being able to read/write files with 0 syscall overhead
<doug16k> M.2 fpga - your code can use pcie bus and do dma and stuff
<Affliction> nice, hardware side sorted
<doug16k> big ass artiz
<doug16k> x
<Affliction> 1 RGB LED for custom use
<Affliction> so I can blink the LED too :D
<doug16k> that is amazing if you want to explore making pcie devices
<doug16k> they give you an led I think lol
<doug16k> just dma the blinking led into the framebuffer through peer-to-peer bus master stores
<doug16k> no you're right, that one is for more advanced user. you can get friendlier fpga stuff
<Affliction> At any rate, effectively building a filesystem in hardware is far above my skill level
<doug16k> it has an led though
<doug16k> you aren't even *allowed* to sell an fpga prototyping tool with no led are you?
<doug16k> the led police make those people disappear
<Affliction> Oh I'm sure I can blink the hell out of that LED :)
<Affliction> maybe I can DMA "Hello World!" into the framebuffer!
<doug16k> you could probably hijack control of the kernel from dma
<doug16k> point it into an mmio window you made, which has code in it
<doug16k> make the cpu jmp to the mmio window
<doug16k> from there you take over easy
<Affliction> Well, if you're booting from my disk, it can just load my code anyway!
<Affliction> Unless you're signing your EFI loader
<doug16k> if there is no iommu, there's nothing stopping you
<doug16k> you can modify ram whenever you want
<doug16k> anywhere
<Affliction> Didn't the IOMMU hve a bit devices can set to ignore it anyway?
<Affliction> Or at least, some versino of the spec
<Affliction> Because, that makes sense.
<doug16k> it could be behind something that doesn't support the remapping
<doug16k> everything behind that can see the real thing when they do peer to peer
<Affliction> I was sure there was a bit you could set in your DMA packets to say "nah, don't translate this address, it's completely harmless!"
<doug16k> oh not sure what you can or can't do maliciously
<Affliction> maybe it was in a draft
iorem has quit [Quit: Connection closed]
<doug16k> that device I linked would allow the user to explore dma attacks
<doug16k> I had my iommu on full paranoia. sync evict immediately after each I/O
<doug16k> it was fine, but impacted game engines a bit
<doug16k> iommu.strict
<doug16k> I just played tf2, first time playing it in a while, and it was so weird to just sit at fps_max forever 299fps
<Affliction> found it
<Affliction> ^F for ATS
<Affliction> anyway, gotta go, back in an hour or so
<doug16k> nice
<doug16k> I don't know why I find it so amusing to run older engines at breakneck speed
bradd has quit [Remote host closed the connection]
bradd has joined #osdev
jmpeax has joined #osdev
Arsen has quit [Changing host]
Arsen has joined #osdev
MrBonkers has quit [Changing host]
MrBonkers has joined #osdev
Geertiebear has joined #osdev
Geertiebear has quit [Changing host]
<Affliction> now that I'm back, let's see what this is about...
sprock has quit [Quit: ...]
tenshi has joined #osdev
Terlisimo has quit [Quit: Connection reset by beer]
<Arsen> does the forum have an inline code bbcode? like ``code`` in rST?
sprock has joined #osdev
iorem has joined #osdev
divine has quit [Ping timeout: 264 seconds]
<moon-child> y'all remember colinux?
Terlisimo has joined #osdev
<moon-child> it seems like a cool thing to use, but also wildly impractical to have to build or work with
Terlisimo has quit [Client Quit]
Terlisimo has joined #osdev
MarchHare has quit [Ping timeout: 264 seconds]
<Affliction> I know of colinux, though I've never run it, or looked into how it worked
<Affliction> so it runs the kernel as a usermode process?
<Affliction> alongside drivers to use resources from windows
riverdc has quit [Remote host closed the connection]
SwitchToFreenode has quit [Remote host closed the connection]
SwitchToFreenode has joined #osdev
GeDaMo has joined #osdev
cultpony has quit [Changing host]
cultpony has joined #osdev
jmpeax has quit [Quit: leaving]
zagto has joined #osdev
pretty_dumm_guy has joined #osdev
pretty_dumm_guy has quit [Client Quit]
pretty_dumm_guy has joined #osdev
srjek has quit [Ping timeout: 268 seconds]
mahmutov has joined #osdev
qookie has joined #osdev
Lucretia has joined #osdev
mctpyt has quit [Ping timeout: 268 seconds]
mctpyt has joined #osdev
simpl_e has quit [Remote host closed the connection]
simpl_e has joined #osdev
alexander has joined #osdev
sortie has joined #osdev
lleo has joined #osdev
mctpyt has quit [Ping timeout: 268 seconds]
Lucretia has quit [Read error: Connection reset by peer]
Lucretia has joined #osdev
mctpyt has joined #osdev
dgb has quit [Ping timeout: 268 seconds]
qookie has quit [Ping timeout: 268 seconds]
iorem has quit [Ping timeout: 268 seconds]
qookie_ has joined #osdev
lleo has quit [Ping timeout: 268 seconds]
Lucretia-backup has joined #osdev
Lucretia has quit [Killed (NickServ (GHOST command used by Lucretia-backup))]
Lucretia-backup is now known as Lucretia
simpl_e has quit [Ping timeout: 268 seconds]
transistor has quit [Ping timeout: 268 seconds]
mctpyt has quit [Ping timeout: 268 seconds]
isaacwoods has joined #osdev
maksy has joined #osdev
GeDaMo has quit [Ping timeout: 268 seconds]
zagto has quit [Ping timeout: 268 seconds]
zagto has joined #osdev
GeDaMo has joined #osdev
dennis95 has joined #osdev
gog has joined #osdev
Matt|home has quit [Read error: Connection reset by peer]
mahmutov has quit [Ping timeout: 268 seconds]
junon has quit [Ping timeout: 272 seconds]
junon has joined #osdev
tricklynch has quit [Ping timeout: 268 seconds]
tricklynch has joined #osdev
wgrant has joined #osdev
wgrant has quit [Changing host]
tricklynch has quit [Ping timeout: 268 seconds]
tricklynch has joined #osdev
mahmutov has joined #osdev
chartreuse has quit [Ping timeout: 264 seconds]
heat has joined #osdev
pretty_dumm_guy has quit [Quit: WeeChat 3.2-dev]
pretty_dumm_guy has joined #osdev
iorem has joined #osdev
mahmutov has quit [Remote host closed the connection]
Mids_IRC has joined #osdev
lleo has joined #osdev
bleb has joined #osdev
mahmutov has joined #osdev
ksroot has quit [Ping timeout: 244 seconds]
redeem has quit [Ping timeout: 250 seconds]
mahmutov has quit [Ping timeout: 272 seconds]
transistor has joined #osdev
iorem has quit [Quit: Connection closed]
alexander has quit [Ping timeout: 265 seconds]
redeem has joined #osdev
pretty_dumm_guy has quit [Quit: WeeChat 3.2-dev]
amanita has joined #osdev
gareppa has joined #osdev
gareppa has quit [Remote host closed the connection]
MarchHare has joined #osdev
mahmutov has joined #osdev
tricklynch has quit [Ping timeout: 268 seconds]
tricklynch has joined #osdev
tricklynch has quit [Read error: Connection reset by peer]
tricklynch has joined #osdev
mahmutov has quit [Remote host closed the connection]
srjek has joined #osdev
qookie_ is now known as qookie
mahmutov has joined #osdev
alexander has joined #osdev
Oli has quit [Ping timeout: 268 seconds]
Oli has joined #osdev
<heat> so quiet today
* heat meows
* meisaka nyans
<heat> my clang build is about 60% faster than the gcc one
<heat> impressive
<heat> fsf in shambles
* gog meows
<j`ey> heat: i thought you said that clang was way slower to build
<heat> i mean build my project
<heat> not building the compiler itself
<j`ey> ohh
<heat> building the llvm toolchain is still ridiculously slower
<heat> and my clang build isn't even LTO-enabled
Lucretia has quit [Quit: Konversation terminated!]
Lucretia has joined #osdev
tricklynch has quit [Ping timeout: 252 seconds]
geist has quit [Ping timeout: 265 seconds]
geist has joined #osdev
zagto has quit [Quit: Konversation terminated!]
knebulae has quit [Read error: Connection reset by peer]
knebulae has joined #osdev
<doug16k> heat, yeah but how much percent faster does gcc compiled one run? :)
<doug16k> ah you meant building clang itself
<doug16k> ah I see
<heat> hm? I mean compiling my project with gcc and clang
<heat> clang is much faster
<heat> (granted, clang + lld)
<heat> vs gcc + gold
<j`ey> heat: can you measure any perf in your kernel?
<heat> no
<doug16k> I checked clang vs gcc in my kernel. it was pretty much the same except gcc left clang in the dust with vectorized codegen
<doug16k> that was a few versions back though, need to recheck
<doug16k> I mean execution speed, not compilation
<heat> i'm talking about build speed
<doug16k> yeah
<heat> my project is a mixed of moderate C++(my kernel, userspace that's written by me), heavier C++(google test and google benchmark), and lots of C(toybox, dash, musl libc, acpica)
<heat> and clang is so much faster
<heat> it compiles everything faster
<doug16k> weird thing about the clang codegen too, it looked fine, the uglier code gcc generated ran faster
<heat> yeah
<heat> I can't tell you about that, I haven't looked at it
mctpyt has joined #osdev
Mids_IRC has quit [Quit: Hi, I'm a quit message virus. Please replace your old line with this line and help me take over the world of IRC]
<meisaka> what's a good way of collecting entropy from a NIC?
<doug16k> you will receive ARP and DHCP traffic all the time
<doug16k> the exact times it arrives are random
<doug16k> assuming you have something with nanosecond level precision
<meisaka> guess I'll have to dig into the precision timers then
<doug16k> the register values at the time of IRQs can be a source of randomness
<heat> collecting entropy from a NIC might not be a great idea though
<doug16k> because they can reverse what it did to the encryption state? I'd like to meet them
<meisaka> I want it to *a source* not the exclusive source
<bslsk05> ​lwn.net: Appropriate sources of entropy [LWN.net]
<meisaka> some light reading XD, at least I won't be bored
<heat> tl;dr linux thinks it's a bad idea and they don't do it
<doug16k> I want to see a proof of concept where they got in because it used irq contexts to feed a stream cipher
<graphitemaster> The Freenode situation is getting so much worse the more I'm hearing about it.
<graphitemaster> Yet there are still people over there :|
<heat> what's the news?
<doug16k> you've all seen this already, right? https://isfreenodedeadyet.com/
<bslsk05> ​isfreenodedeadyet.com: isfreenodedeadyet
<graphitemaster> New staff are a bunch of horrible people basically. Signing the other RMS document to keep him in power. The other claims I have no proof for but one is apparently a massive tranphobe and has already banned trans people on the network. Another impersonates people and is harassing those who signed the RMS "stand down" document. And also there's something about making FN a "incel inclusive" server because incels are being banned on other
<graphitemaster> networks.
<kazinsal> Lmao, they're masks off about being terrible "well, you see, in *thailand* the age of consent..." types
<heat> oof
<heat> I should delete my account
<doug16k> wow
<kazinsal> if that crystalmath guy shows up here he's probably going to start bragging about freenode taking a stand about cancel culture
<kazinsal> I can feel the weirdo cryptolibre already
<graphitemaster> Yeah I dunno how true the claims are. I'm just picking up comments on Twitter and server mods on other networks.
<bslsk05> ​twitter: <mjg59> Fucking incredible one of the new Freenode staff members is Chris Punches, who stole someone's identity in order to harass people who signed the petition asking for RMS to stand down: <github.com/rms-support-le… https://t.co/YUbDNIuan1>
vdamewood has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]
<kazinsal> I'm generally inclined to believe the libera/oss-twitter people more than the freenode/bitcoin/darknets people
<gog> can't ban the trans people that already fled :p
<gog> also shame about crystalmath
<graphitemaster> It would turn out to be pretty funny if the whole corporate takeover of freenode just turned out to be a bunch of the cancel cancel culture types / RMS did nothing wrong / "actually the age of consent is" types.
<kazinsal> the new owner is a bitcoin billionaire who has run IRCs for offsites started by people who were so atrocious they got banned en masse from reddit
<kazinsal> he was in charge of the voat irc for most of its existence (including reusing its TLS configs for new-freenode), well known as the place that hundreds of thousands of qanon cultists migrated to from reddit
<kazinsal> the man is basically an alt-right-tech financier
<kazinsal> anyone who defends freenode nowadays is as good as fertilizer to me
<gog> lmao voat
<gog> reddit is bad enough, i couldn't imagine having gone to voat
<heat> neckbeard^2
johnjay has joined #osdev
<johnjay> libdl is referring to libc stuff or gcc internals? the former right?
<heat> yeah libc
<heat> you get the dl* family of functions
tricklynch has joined #osdev
<heat> in glibc, that is. some libc's don't have a libdl or even libm, libpthread
<heat> musl for example keeps everything in libc.{a,so}, even the dynamic linker is just a symlink to libc.so
<gog> neat
<heat> fun fact: glibc's libc.so.6 is executable
<doug16k> why?
<heat> because it is lol
<doug16k> lol
<renopt> why not :P
<kazinsal> sometimes you just gotta do weird shit for the oss cred
<heat> you get some info on the glibc's version and features and stuff
<doug16k> yeah I keep thinking I should figure out symbol versioning sooner rather than later
<doug16k> never used it before
<johnjay> GNU C Library (Debian GLIBC 2.31-12) stable release version 2.31.
<johnjay> Copyright (C) 2020 Free Software Foundation, Inc.
<johnjay> haha it is
<doug16k> haha, neat. TCG says it supports "self snoop" because you don't need to flush the cache when changing memory types
<doug16k> I don't know of any that support self snoop in real life
<doug16k> I put in an if (!cpuid_has_self_snoop()) cpu_flush_cache(); after setting PAT and it was surprisingly skipping the flush
lleo has quit [Ping timeout: 272 seconds]
mctpyt has quit [Ping timeout: 272 seconds]
tricklynch has quit [Ping timeout: 265 seconds]
GeDaMo has quit [Excess Flood]
tricklynch has joined #osdev
GeDaMo has joined #osdev
tricklynch has quit [Ping timeout: 264 seconds]
Skyz has joined #osdev
tenshi has quit [Quit: WeeChat 3.1]
<Skyz> Is there anyway to make money with free software?
<j`ey> sell support
<doug16k> printf "int main(){}" > money.cc && make money
<GeDaMo> Do you want to limit yourself to legal means? :|
<geist2> use free software to make a thing to sell
<klysm> if it's really good you could make movies with it
<Skyz> Actually wanted to make a movie once on my own
<gog> open-source plans for a 3d printer and pistol
<Skyz> Realized that would be a colossal effort
<gog> then rob a liquor store
<Skyz> Does it actually have to be free software?
<Skyz> Who heard of free cars or free houses?
<doug16k> free as in freedom
gmacd has joined #osdev
<klysm> if you don't like it go sign up for adobe cloud and make a movie that way
<doug16k> you are free to read it, modify it, and give modified version to your friends
<Skyz> I see, free as in freedom is a good ideal
<Skyz> a good motto
<doug16k> so yeah, I am free to get a car, modify it, and give it to my friends
tricklynch has joined #osdev
<sortie> Making money with prinf is considered counterfeit
<Skyz> I don't trust that at all
<Skyz> lol
<Skyz> Would be nice idea though
<Skyz> They had car kits
<geist> making money with sortix though is the highest form of money making
<geist> pinnacle of human development
<sortie> Sortix for US Treasury 2021
<GeDaMo> Well, there's that guy who gave up their job to work on their OS full time
<sortie> printf("$100\n");
<Skyz> The high art of money making
<bslsk05> ​awesomekling.github.io: I quit my job to focus on SerenityOS full time – Andreas Kling – I like computers!
<j`ey> he should work on sortix instead
<GeDaMo> Living the dream slash nightmare :P
<sortie> Think about it. No performance reviews. No management. No corpspeak.
<froggey> hmmmm
<froggey> "corpspeak" sounds like corpspeak
<doug16k> this is why we need klaxons
<heat> m a t e r i a l d e s i g n
<geist> a s t h e t i c s
<Skyz> E L 0 S E C U R I T Y
<klysm> class money* bucks = new money( "$100", 100.0 ); printf( "%s", money.value() );
<doug16k> I just realized my code that sets PAT MSR needs to do way more stuff
<doug16k> you have to freak out and disable cache fill, writeback/invalidate cache, flush tlb, then disable MTRR, then set it, then undo it back to normal
<klysm> s/\./->/
<doug16k> and the other cpu better nor be running anything beyond a spinloop also with its cache not filling and its MTRRs off
<doug16k> that is easy to arrange though, I'll make the cpu that sends IPI go into no-cache before sending SIPI. it won't get notified of SMP ready until after the other cpu set its PAT
<doug16k> AP ready I mean
<doug16k> it's funny how utterly disabled the cache is. it isn't even trying to fill, and the MTRR being off makes all memory UC anyway
<Skyz> What store would sell operating systems?
<Skyz> Off-question/
chartreuse has joined #osdev
<Skyz> SMP is something every os needs now
<klysm> skyz, a reseller for chips bought off digi-key
<doug16k> the amount an OS needs to do grows naturally as more hardware needs to be shared and abstracted
<junon> unless it has a driver system, unless you count drivers are part of the OS
<doug16k> picking which thread to run and which page gets to be in ram is the main job of OS
<doug16k> it boils down to fetching instructions that read and write data
mahmutov has quit [Quit: WeeChat 3.1]
<doug16k> we expect lots of programs to be able to share the extremely powerful hardware we have now
mahmutov has joined #osdev
mahmutov has quit [Client Quit]
<Skyz> But what types of programs
<Skyz> Toy applications?
<klysm> apps, databases, networks, systems, configurators, viewers, compilers
<Skyz> I would be really interested to see a hobby os in a car
<geist> yah it is kinda amazing how much things have changed in consumer world over the last 20-30 years re: multitasking
<geist> i remember back when it was novel and neat to just be able to forward/background something in windows 3.1 or even a dos TSR
<geist> though of couse someone will point out their amiga/unix machine/etc already did that
<bslsk05> ​en.wikipedia.org: List of car manufacturers of the United Kingdom - Wikipedia
<Skyz> I find it interesting how many car manufacturers they have
<Skyz> Only a few are known
<Skyz> most of them are small companies
<Skyz> and many defunct
mahmutov has joined #osdev
GeDaMo has quit [Quit: Leaving.]
<clever> doug16k: ive hit a bit of a roadblock with that vectorized softfloat, its getting too complex for me, combined with the fact that its not actually going to be any faster then non-vectorized hard floats
<clever> doug16k: so i'm shifting gears, creating some portable c functions, that replicate the effects of the vector opcodes, limitations and all, to act like an SDK, and i could then use inlined functions to do the same task on a VPU
<moon-child> clever: have you seen simde?
<clever> moon-child: nope
<clever> moon-child: for reference, this is what i have: https://github.com/librerpi/lk-overlay/blob/master/app/float-tests/float.c#L34-L59
<bslsk05> ​github.com: lk-overlay/float.c at master · librerpi/lk-overlay · GitHub
<doug16k> clever, I like your plan though - making some intrinsics for yourself
<clever> the biggest roadblock, is that my mult opcode, only accepts 16bit inputs
<moon-child> https://github.com/simd-everywhere/simde it has pure-c implementations of all the simd intrinsics
<bslsk05> ​simd-everywhere/simde - Implementations of SIMD instruction sets for systems which don't natively support them. (115 forks/1087 stargazers/MIT)
<moon-child> might be useful
<clever> doug16k: yeah, the #gcc channel also said intrinsics would be better then proper vector spport, because gcc doesnt really understand a matrix
<clever> moon-child: ahh, that sounds almost exactly what i want to do
<clever> implementing the intrinsics twice, as both pure-c (portable) and inline asm (faster)
<clever> the pure-c version, would then let people test out the algos, without needing a vpu toolchain and test target
gog has quit [Quit: bye]
<clever> one question i have on planning, how should the intrinsics accept inputs?
<geist> yah i think the general way of writing simd stuff is to use a huge pile of intrisics. it's kinda like assembly except the compiler is doing the busywork of register allocation and load/stores for you
<doug16k> chunks of that could be reused in telling gcc how to use it natively
<clever> should i give it a coordinate into the matrix, and a pointer to a matrix
<clever> or should i just give it a vector of 16 elements?
<geist> oh reminds me of SH-4: at the time (1999) it was the only cpu that had a straight matrix multiply instruction
<clever> coordinates, would carry over more of the real limitations/power, but require the user to handle register allocations by hand
<geist> though i *think* it was just a 1x4 * 4x4 -> 4x1 right? basically 4 of those in a row was a standard 3d transform
<clever> i think the VPU can do a 16x16 matrix mult in ~4 opcodes, if i understand the formula right
<geist> though actually i think it may have had a full 4 4x4
<clever> smaller, would require loading some constants to mess with the per-lane condition flags
<clever> maybe the matrix pointer, could just be TLS
<doug16k> having the one matrix across a vector like that isn't true vectorization
<clever> simpler api
<doug16k> true vectorization would be N threads each doing scalar things, where N is vector width
<geist> ah yeah FTRV instruction was a 4x4 * 4x1 it looks like
<clever> doug16k: internally, the matrix is basically a uint8_t[64][64], and you give it a coordinate, to select either a 1x16 or a 16x1 slice
<doug16k> yes but proper vector code doesn't slice anything
<doug16k> everything is entire register, hardly ever swizzle
<clever> so it would either span m[r][c] to m[r][c+15] or m[r][c] to m[r+15][c]
<clever> you can basically treat it as a collection of 256 different uint8_t[16]'s
<clever> with some built in rotation
<doug16k> you want "whole vector, whole vector, ..." not "did that little piece of a vector, did that little piece of a vector, ..."
<clever> the immediate encoding, doesnt allow you to pick a non-aligned slice
<clever> so its almost a hard mapping on where those 256 slices fit
<clever> when you bump up to 16bit, then there are only 128 uint16_t[16]'s that you can address
<geist> iirc the SH-4 sttrategy is not to have vector registers but to treat the two banks of 16 single precision fpu registers as either 4 vectors horizontally or vertically, or a single 4x4 matrix
<clever> doug16k: if you ignore its ability to rotate, then that is basically just 128 vector registers, each holding 16 x 16bit ints
<geist> was kinda nice to work with
<clever> doug16k: but knowing how the matrix works, lets you mix&match bit widths, do free high/low halfword slicing, and matrix rotation
<clever> so you could treat each slice, as a self-contained vector reg, but then your loosing features
<doug16k> what is the real width
<clever> bit width or lane width?
<doug16k> if you pipelined back to back mul add that are sufficiently independent, how wide before it all takes more cycles
<doug16k> ya is it actually 4-wide or what
<clever> my original speed test, basically did `a = a * 2;` in a loop, and measured it to be 2 cylces per set of 16 mults
<clever> so thats 16 lanes wide, 2 clock cycles, with each mult directly consuming the previous result
<doug16k> yes but there is a loop dependency
<doug16k> you stall yourself until mul latency elapses each iteration
<clever> ah, but this wasnt repeating right away
<doug16k> it might have been able to fit two or three muls in between
<clever> this was a REP64 opcode
<doug16k> ah
<clever> so it was more like doing an int[16*64] based mult, 16 at a time
alexander has quit [Quit: Goodnight o/]
<clever> and it measured 128 cycles
<doug16k> that's why I ask the real width. you would not generate crazy wide vectors if you were doing fully optimized AoSoA code
<clever> all signs point towards it being 16 wide, because it can only ever operate on vectors of 16 elements
<doug16k> 256 bit seems good and reasonable
<clever> it is capable of 32bit x 16, so each ALU input is 512 bits i think
<clever> but only for adds and basic boolean
<clever> writing up a pure-c implementation should serve as a much simpler way to document it all
<doug16k> yeah, you will be an expert in the ISA by the time you fully debug C code that emulates it :D
<doug16k> and get it to match real thing
<clever> one tricky problem though, is generating the inline asm
<clever> lets say i write a function/macro call like this: foo(1,2,3)
<clever> how can it turn into asm volatile ("foo HX(1,2), 3"); ??
<doug16k> easy
<doug16k> static inline. I have 100 or so in my project
<clever> got a link?
<doug16k> you can do compile time constants
<clever> for the asm version, it needs to be a constant, that becomes a string literal
<doug16k> did you mean registers when you said 1 2 3
<clever> a literal 1,2 in the inline asm
<doug16k> ok, yeah you can do that
<clever> asm volatile ("v32or HY(2, 0), HY(2, 0), HY(4,0) IFNZ");
<clever> i need to generate inline asm like this, from function args
thinkpol has quit [Remote host closed the connection]
<clever> but also accept those args as proper ints, so the pure-c version can index into the matrix
thinkpol has joined #osdev
<bslsk05> ​github.com: dgos/control_regs.h at master · doug65536/dgos · GitHub
<doug16k> it injects the dr number
<doug16k> should work for parameter if caller passes constant
<clever> ahhh, the "i" part!
<clever> for immediates in inline asm!
<doug16k> yeah
<clever> that should work out perfectly, and you solved my long-standing question of named args
<doug16k> I am a named inline asm argument enthusiast
<clever> does dr have to be a template arg?
<doug16k> don't think so
<clever> i'll try things both ways
<doug16k> if you pass a constant and it is inline as hell like that, should work
<clever> templating kinda makes things simpler, since i can use a std::pair
<clever> one thing i didnt mention, is that matrix coords, have both an immediate, and a register component
<clever> HX(0,0)+r0
<heat> inline asm enthusiast is the scariest thing I've ever heard
<clever> the immediate, must be aligned correctly to the bit-width, but r0 then contains a pair of 6bit offets, for the row/col, allowing non-aligned access
<clever> that also allows programatic movement of the algo
<clever> so i could have a function that operates on a 16x16 chunk of data, and then point it to a specific chunk at runtime
<clever> doug16k: do other cpu's allow you to change what regs a vector op acts on, at runtime?
<doug16k> yes
<doug16k> there have been ones where you can just set an arbitrary number for vector width
<clever> ah, so its not that special
<clever> yeah, ive heard that the new arm specs, allow that
<doug16k> it's very forward compatible
<geist> yep, arm SVE and the new riscv vector stuff uses that sceheme
<doug16k> someday when it is 1024-bit wide, it will run DOOM at 2400fps instead of 1400
<geist> hardware supports vectors up to N, user spce can dynamically set the width U <= N and then do a series of ops
<geist> actually one more level of abstraction: hardware has H bits, kernel enables K bits of it (whatever it's willing to context switch) and user space can set U width on the fly
<geist> and U <= K <= H
<clever> yeah, that makes sense
<heat> (x/y/z)mm go brrr
<geist> i forget f ARM has a scheme to remember the highest water mark user space has used since the last time it was cleared, for essencitally xsave-like-optimizations
<j`ey> https://www.twitch.tv/asahigpu alyssa working on m1 gpu driver
<bslsk05> ​'asahigpu (live) 2021-05-30 21:32' by asahigpu (live)
<geist> riscv i think might? it actually for FPU stuff has a nice 2 bit scheme to remember not only if user space touched the fpu but if they wrote to it
<geist> 4 states is think? disabled, clean, accessed, dirty or something like that
<heat> j`ey, 0/10 not in a hot tub
<doug16k> that matches up with how my kernel sees fpu exactly :)
<geist> doug16k: yah it's nice. though *really* the read vs write states are kinda extraneous, but i guess they're basically 'free' to hardware so may as well
<j`ey> heat: lol
<geist> like, how often does user space code just read from the fpu and not write it
<geist> possible on riscv it may be more than i think, like some access of some fpu condition register in a read only mode in some software patterns
tricklynch has quit [Read error: Connection reset by peer]
tricklynch has joined #osdev
<geist> well, i guess it's less of a hardware point of view and more that read vs dirty states are useful f you dont want to lazy fpu save, but you'd at least like to know if user code never wrote to it
<geist> so you can context restore it, set the 'read' state and then on the next context switch not bother saving it if they never wrote it
<heat> i thought the standard was to not do any lazy fpu?
<geist> but ou dont want to bother doing the classic scheme of leaving it disabled, trapping, and restoring it there
<geist> heat: depends on the arch. classically speaking most arches have been lazy fpu saving forever
<doug16k> it works correctly on amd
<geist> intel kinda ruined it recently by having a spectre thing
<geist> but... x86 also has the very rich and powerful xsave stuff which also hyper optimizes it so much that it's kinda 'free'
<heat> I remember linux ditched lazy fpu for x86 in 2015, maybe 2016?
dennis95 has quit [Remote host closed the connection]
<heat> not sure, but around that time
<geist> possibly only if has xsave
<doug16k> yeah they made xsave/xrstor lazy and able to know whole chunks of context are zeros
<geist> or xsaveopt or whatever which one the current one is
<geist> but then spectre says we cant have nice things so i think that nailed that particular coffin
<doug16k> xsaves is best, with xsaveopt close behind
dennis95 has joined #osdev
<geist> but, on risc or pseudo risc arches like arm and riscv you have to do it all manually
<geist> so you probably want at least one level of 'did user space even touch it?' fpu save
<heat> defaulted to no-lazy for every x86 cpu in early 2016
<geist> not necessarily full 'trap and lazy restore' but more like 'leave disabled, trap so i know its dirty'
<heat> before that it was xsave-only
<geist> heat: well okay then!
<clever> doug16k: hmmm, another templating problem, i need 3 versions of a function, 8bit, 16bit, and 32bit, then need to pick the right uint8_t for internal usage, but also insert an 8/16/32 literal in the asm...
<clever> doug16k: maybe sizeof(t)*8 as a const expr?
<doug16k> you could use overload resolution
<doug16k> do you have integral_constant
<clever> no idea
tricklynch has quit [Ping timeout: 268 seconds]
<geist> so on fuchsia for example we do a partial fpu lazy saveon arm. we dont leave old state from previous threads on it, so we always fpu save if dirty
<heat> "It seems that, on any remotely recent hardware, eagerfpu is a win: glibc uses SSE2, so laziness is probably overoptimistic, and, in any case, manipulating TS is far slower that saving and restoring the full state. (Stores to CR0.TS are serializing and are poorly optimized.)"
<geist> but i think we delay the loading until a trap
tricklynch has joined #osdev
<geist> and there's a TODO to see if that's even worth it
<geist> right, it also has a lot to do with how user spae uses the fpu. so on x86 yah SSE is used like crazy so there's kinda no point
<geist> arm64 i think the vector bits are used a bit less aggressively...
<geist> OTOH last time i timed the full ector load/store on a recent ARM core it was pretty fast
<geist> like 20 cycles or so? so really it's no big deal
arch is now known as archenoth
<geist> really blatting out a large chunk of sequential registers is what modern cpus crave so
<doug16k> clever, https://github.com/doug65536/dgos/blob/master/kernel/lib/cc/type_traits.h#L28 then add a parameter integral_constant<sizeof(uint32_t)>::type parameter to each variation, and call it with integral_constant<sizeof(whatever)>::type() in that place
<bslsk05> ​github.com: dgos/type_traits.h at master · doug65536/dgos · GitHub
<doug16k> is that what you mean?
<geist> heat: i think the key would be something like avx512 vs SSE. *however* by the time you get cpus with avx512 you have xsave which can optimize for not saving/restoring more than was dirtied
<doug16k> you could make it nice to read by typedefing the different sizes to use for that parameter that is just there for overload selection
<geist> so it's all good
<clever> doug16k: maybe, let me get an example ready...
<geist> on something like ARM where SVE can end up with even more state, if there's not hardware tracking support for how much of the upper registers were touched, then it's back to some sort of trap-n-track solution again
<geist> i can see a scheme where you report to user space that there are 512 byte vectors, but then disabling it to 128, and trapping when user space actually tries to use upper bits
<geist> then, bumping some water mark on the thread, maybe allocating more state, and eenabling that much register
<doug16k> sorry, integral_constant<size_t,sizeof(whatever)>
<heat> geist: i'm not entirely sure what linux does for avx512 but I would assume it's the Intel Sanctioned(tm) way to do fpu save/restore with it
<heat> considering they probably had that in mind
<doug16k> this does all kinds of song and dance to boil it down to being so many bits, then does the right l or ll nonsense: https://github.com/doug65536/dgos/blob/master/kernel/lib/bitsearch.h
<bslsk05> ​github.com: dgos/bitsearch.h at master · doug65536/dgos · GitHub
<heat> maybe xsave is still crazy fast with 512? who knows
tricklynch has quit [Ping timeout: 268 seconds]
tricklynch has joined #osdev
<doug16k> like line 103
<geist> heat: i think xsave just does what you want
<geist> it tracks which parts of the registers are dirty, etc
<geist> and code is encouraged to use xzeroupper/etc which xsave can pick up on
<heat> linux doesn't do that though
<doug16k> clever, line 176 magically calls the right one
<heat> I think we had reached that conclusion
<geist> really its all about having to allocate that much save state for each thread. one of the reasons we haven't added support for avx512 yet in zircon
<heat> at least for vzeroupper
<geist> it's aTODO task but
<geist> heat: hmm, in what case?
<heat> geist, on the syscall path
<geist> i'm talking about a generic context switch. like you preempted user space and it was doing something
<clever> doug16k: ah, one template, calling another template, but using sizeof to fill in the gaps
<doug16k> yeah
<geist> ah yeah but context switch works the same if it came out of syscall or a preemption
<doug16k> and parameter type with unused value selecting which overload
<geist> but you're right, you'd think linux would vzeroupper and they dont which is lame.
<clever> doug16k: let me start the code, and try some things, then maybe ask for help...
<geist> on syscall
<heat> yeah only mentions vzeroupper on crypto code
<geist> right, because that's the only real part where the kernel actively uses the vector bits
<geist> otherwise it just 'passes them through' from user space
<geist> and the context switch routine saves it as basically user state
Matt|home has joined #osdev
<doug16k> right but there is a long stall penalty for transitioning between 128 and 256 bit operation, if it is 256 maybe you should leave it
<doug16k> sometimes that would help too though
<doug16k> I think it has to wait for all the 256 bit aware vex stuff retire before it can begin the "assume upper is zero" code
<doug16k> maybe zeroing upper wouldn't affect that actually
<doug16k> would just cause better init optimization
<doug16k> my stuff guarantees zero fpu on every syscall return
<clever> i just remembered a weird situation i discovered many months ago
<doug16k> if your syscall got preempted I don't save it
<clever> a process was consuming 100% cpu, and strace said it was doing nothing at all
<clever> and it remained like this for over 10 minutes
<clever> after poking around with gdb and getting a backtrace, i found the cause
<clever> compression
<clever> it was ram->ram compression, with pre-allocated buffers, so it never had to do a single syscall
<Skyz> Was a basic kernel a good idea
<Skyz> or is c the only good choice?
<doug16k> best thing is, when pthreads gives up and futex blocks, it doesn't save it. and when futex wait wakes up, it doesn't restore it
<Skyz> subjective
<doug16k> it zeros it on way back to user
<Skyz> Do you use qemu?
<doug16k> of course it does preserve fcw and mxcsr
<doug16k> I do yes
<doug16k> language doesn't matter
<doug16k> if it mattered then we wouldn't be using the same one so widely for so long
<doug16k> the one that stops you least wins
<doug16k> it's like thinking we can make buildings never collapse if only we make the perfect way for architects to write the design down
<Skyz> Well, I would like a job working on an OS
<Skyz> but I'm kinda C-illiterate
<Skyz> I can't write C code well
<Skyz> Taking some classes to get it down
<Skyz> As for what a user wants
<Skyz> they want a GUI most of the time
<Skyz> I'm thinking of writing some tutorials
dennis95 has quit [Quit: Leaving]
<travisg> please dont.
<Skyz> There's other things besides tutorials I can do
<Skyz> I'm kinda emulating Fravia+
<Skyz> Wanting to see if there is an ultimate destination for applications
<Skyz> I'll hold off
<geist> what the heck is fravia?
<heat> i was super confused
<heat> turns out I /ignore'd skyz
<j`ey> lol
<graphitemaster> Making money with free software is like making money as a musician. It's possible, but most of the time it's not about the content so much as it's the celebrity.
<Skyz> Well
<Skyz> Open SOurce software doesn't have to be free software
<Skyz> free as in free
<moon-child> no, but you're even less likely to make money on non-free oss than on free oss
<Skyz> Somehow I missed the point somewhere
<Skyz> Fravia is a reverse engineer
<Skyz> He is grey hat
<Skyz> Been working on trying to do something that is for the protection of software
<geist> heat: yah i had too
<geist> but my other clients were seeing it and i was like oooh
tricklynch has quit [Read error: Connection reset by peer]
tricklynch has joined #osdev
mahmutov has quit [Ping timeout: 268 seconds]
<klange> Skyz: Your continued endeavour of hopping from platform to platform, community to community, making zero sense and demonstrating zero knowledge of anything you are asking about has reached a new level of annoyance that my local authorities will doubtless qualify as harassment.
<Skyz> No harassment intended
<heat> when are .eh_frame and .eh_frame_hdr relevant?
<geist> skyz is like the libyians in back to the future
<doug16k> heat, stack traces and exception unwind
<geist> you think you lose them and then he shows up in a vw van with a rpg
<heat> doug16k: but in-process unwinding or debugger?
<doug16k> _hdr provides a lookup table that speeds up lookup of relevant cfi records for a given pc
<doug16k> debugger and in process if runtime unwinding like a fancy longjmp that calls landing pads, or full c++ landing pads
<doug16k> you can make it so C can call C++ that has landing pads that calls C, and if that C longjmps right over C++ it will clean up
Skyz has quit [Quit: Client closed]
<doug16k> _Unwind_ForceUnwind
<doug16k> so yeah even C uses it
<heat> I'm seeing zircon does -fno-unwind-tables and linux does -fno-asynchronous-unwind-tables
<doug16k> but if you don't force unwind at runtime ever you could discard it
<heat> i'm struggling to understand why
<doug16k> yeah it is turning off support for what I described
<doug16k> it means don't allow foreign exceptions to propagate through the code
<doug16k> so generate potentially a lot less CFI
<heat> what foreign exceptions?
<heat> C++ exceptions?
<doug16k> yes or any language
<doug16k> the way the abi works, all languages can do their own thing and everyone can invoke it
<doug16k> no-asynchronous-unwind means "please don't support full exception unwind as if I were C++"
<heat> and -fno-unwind-tables?
<doug16k> never heard of it
<heat> all I want is to have so debug info for the debugger to look at, I don't want to use any at runtime
<doug16k> then you want no asynchrous unwind fno-exceptions
<geist> may be no-unwind-tables is stronger? i dont thnk it was inherited from Lk
<doug16k> maybe arm exception abi thing?
<geist> possible
<geist> heat: is it in one of the two compiler paths and/or arch specific section?
<doug16k> oh I found it
<doug16k> unwind-tables means just generate the data but don't affect codegen with unwind
<heat> geist, no
<bslsk05> ​cs.opensource.google <no title>
<geist> oh well there's a whole comment about it
<heat> says it keeps asynchronous unwind tables but discard eh_frame
qookie has quit [Ping timeout: 265 seconds]
<geist> does actually remind me. after years of completely eschewing C++ exceptions
<geist> how bad is it really? (codegen and usability)
<doug16k> looks like making the unwind data not runtime data makes it go into .debug_frame
<clever> doug16k: https://gist.github.com/79143cb23a50d572b9d527c9ea479492 my first pass, its only tested to compile, but the code looks like it should do what i intend
<bslsk05> ​gist.github.com: simple-test.c · GitHub
<heat> doug16k, I tried both options and I still get huge eh_frames
<doug16k> heat, look at the cfi records to see where they come from
<doug16k> objdump --dwarf
<doug16k> pc=fffff.....
<doug16k> look up what
<doug16k> ...in the disassembly
<doug16k> sorry for speaking vertically
<heat> having fno-unwind-tables or not makes no difference in the section's size
<heat> no
<heat> problem
<heat> :)
<doug16k> mine shows the stuff in .eh_frame first
<doug16k> what you want is everything in .debug_frame
<heat> those options seem to make absolutely no difference
* heat tries with gcc
<clever> doug16k: now i need to use templates like your example, to dedup this... https://gist.github.com/cleverca22/79143cb23a50d572b9d527c9ea479492#file-vpu-support-purec-h-L27-L57
<bslsk05> ​gist.github.com: simple-test.c · GitHub
<doug16k> you can force instantiate each variation, so everyone can assume they can call the one instance, so it's as good as preprocessor hacking
<heat> oh it works with gcc, not with clang
<heat> is this gcc specific?
<doug16k> you have -fno-exceptions ?
<heat> clang never complains but the eh_frames are still huge
<heat> yes
<doug16k> that alone should go far to shut up with the cfi
<heat> with gcc I have eh frames of size ~0x30
<doug16k> my man clang only even mentions unwind in fexceptions
<heat> compared to several hundred KB
<doug16k> it isn't expected to even get touched if it is good program that uses exceptions correctly
<doug16k> demand paging would be a good excuse to say screw it and generate it
dutch has quit [Quit: Leaving]
<geist> right it's all about disk usage in a demand paged system
sortie has quit [Quit: Leaving]
dutch has joined #osdev
<clever> doug16k: first thing i notice, `movdqa 0xfb4(%rip),%xmm0`, gcc is vectorizing things for me!
<doug16k> clever, yes, you can make it try hard with -ftree-vectorize
<clever> doug16k: from the code in the gist i linked above, how would i select between MATRIX8_WRITE and MATRIX16_WRITE, based on the type of T?
<clever> just throw in some dumb if statements maybe? if (sizeof(T) == 2), and let const-expr eliminate the negative cases?
gog has joined #osdev
<doug16k> could do that
<clever> and that could itself be another static inline
<clever> with template
<doug16k> in newer C++ you can force it to be compile time with constexpr if
<gog> public static inline private constexpr
<doug16k> v16ld and v8ld could even not be inline
<clever> doug16k: what would i gain from not being inline?
<doug16k> if it is C code, being inline wouldn't help nearly as much as if it is emitting a real vector instruction
<doug16k> you might cause more cache misses than the call overhead saving
<doug16k> the real vector instructions will be compact
<clever> gist updated
<doug16k> they should be inline
<clever> i think its compiling down to just 2 vector opcodes right now, plus the normal prologue/epilog
<bslsk05> ​www.jaist.ac.jp: MOVDQA--Move Aligned Double Quadword
<clever> movdqa 0xfb4(%rip),%xmm0
<clever> doug16k: does the intel vs at&t plauge extend even to the mmx opcodes???
<doug16k> no that is the same as normal
<doug16k> intel has movdqa
<clever> according to the docs, the 2nd argument is the source
<doug16k> usually you see movaps - single precision one
<clever> but looking at the asm, the first thing it does, is save an mmx reg (it never write to), to ram!
<doug16k> dqa is integer one
<geist> clever: of course. itd be even weirder if they flipped styles for new instructions
<geist> gotta at least be consistent
<doug16k> the docs are intel syntax. the second argument is normally the source
<clever> geist: so is the above a reg->ram or ram->reg operation? when looking at `objdump -d` with all defaults
<doug16k> last arg is destination even for avx
<geist> left to right in at&t, right to left intel
<geist> though i think it's a bit more subtle
<geist> more like A, B, C at&t
<geist> C, A, B intel
<clever> ah, so that plague does continue, and objdump doesnt agree with the intel docs i linked
<geist> since most opcodes on x86 are 2 address, it's less obvious
<doug16k> tell objdump to use intel syntax if that's a problem
<clever> doug16k: yeah, checking the --help now
<geist> otherwise just deal with it
<clever> -M intel-mnemonic now says `movdqa xmm0,XMMWORD PTR [rip+0xfb4]`
<doug16k> of course
<clever> now it agrees with the docs, and is a bit more verbose
<doug16k> says exact same thing as the at&t you said earlier
<doug16k> just way longer
<clever> yep
<clever> movups XMMWORD PTR [rax],xmm0
<doug16k> the assembler doesn't need your help knowing it is an xmmword ptr
<clever> my first guess, looking at this asm, is that its just a 16 byte ram->reg->ram copy?
<doug16k> where's on earth k thanks bye
<moon-child> ndisasm uses 'oword' rather than 'XMMWORD PTR'
<moon-child> which is a bit nicer
<doug16k> movaps loads 128 bits yes
<doug16k> so does movdqa
<doug16k> loads or stores
<clever> ah, so its more of a dumb uint128_t based mov, and how you interpret those bits, depends on what opcode you use later
<doug16k> must be aligned. there are "u" variants that work unaligned
<doug16k> right
<doug16k> it's a bag of bits in movaps movdqa world
<geist> i do wonder if o modern x86s the aligned/unaligned version make any different
<doug16k> no difference
<geist> is there an implied weak memory model on the unaligned stuff maybe?
<doug16k> there's a cpuid bit to see if u is worse
<geist> ah
<doug16k> newer will handle aligned and unaligned the same. older are slower on unaligned
<moon-child> doug16k: iirc movaps and movdqa function as a hint of some sort
<doug16k> tells it which domain it is
<moon-child> like if you're goign to actually do floating ops you should use the float instruction, or int ops you should use the int instruction
<moon-child> but if you're just shuffling memory doesn't matter
<doug16k> there is a 1 cycle penalty when transition between integer and float domain
<moon-child> right
<moon-child> oh so in that case it's probably better to prefer the *ps instructions to the others?
<moon-child> because somebody else was most likely already using the simd regs for fp math, so you don't want to transition?
<doug16k> what matters is what domain the upcoming instruction that uses the value is
SlyFawkes has joined #osdev
<doug16k> if you movaps then srli then it's not good
<moon-child> uses the value, but doesn't matter if it writes to it?
<doug16k> if you movdqa then addps it is not good
<moon-child> like if I movdqa xmm3, whatever; addpd xmm3, xmm2, xmm1 does that pay the penalty?
<clever> now to implement a vst function, and test dumping the matrix contents...
<doug16k> yes
<moon-child> so it is better to use the fp instructions for shuffling memory, assuming the fp instructions are generally more common
<doug16k> oh it breaks the dependency though
<doug16k> what matters is the latency between two things in a dependency chain
<doug16k> wrong domain = 1 extra cycle of latency
<doug16k> you can interleave the two domains no problem in instruction scheduling
<doug16k> what matters is what domain that register is
<geist> honeslty still surprised with ERMS that there's still some ability to moe data faster with AVX in some situations, or so i have heard
<geist> seems like a proper erms internal implementation is just directly fed into the load/store unit
<moon-child> there's a startup cost I think
<moon-child> doug16k: hmm, can simd registers be renamed?
<geist> thoguh i guess it still has to mess with the corresponding integer registers and whatnot because it can still be interrupted
<doug16k> yes
<doug16k> massively renamed
<moon-child> then if you're just writing to the register in the wrong domain, couldn't you rename to avoid the penalty?
<clever> template <typename T> static inline void matrix_write(int x, int y, T value)
<clever> how would i help gcc infer the return type? error: there are no arguments to 'matrix_read' that depend on a template parameter, so a declaration of 'matrix_read' must be available
<clever> oops, for template <typename T> static inline T matrix_write(int x, int y) {
<geist> yah that can't be implicitly deduced because return type
<geist> thought some of the newer bits with auto maybe can?
<geist> i never know precisely how you can use auto in function declaractions, so i usually just try and see and soetimes it surprises you
<clever> how would i specify it?, since i have the same T param one function call up
<doug16k> you can force it like matrix_write<decltype(some_expression)>(...
<doug16k> 2+2 would make int
<doug16k> *bad_things would be bad_type_t
<doug16k> reference?
<doug16k> or const reference if bad_things is a pointer to const
<moon-child> clever: I don't see how you could infer the return type. But you can make your call be matrix_write<T>(whatever, whatever>
<doug16k> that's why you see std::remove_reference<T>::type
<clever> dst[(stride*r) + i] = matrix_read<decltype(dst[0])>(x,y+i); was accepted by the compiler
<doug16k> you got it
<clever> ah, matrix_read<T> is also accepted
<clever> it wasnt before, due to typos
<doug16k> of course if you have it already, use it :)
<clever> i prefer <T> over dst[0], i want to give it a type, not a random element from an array of that type
<doug16k> if it were auto and your code really didn't know, you could use my decltype trick to escape it
<clever> vpu-support> include/vpu-support-purec.h:30:27: error: cannot bind non-const lvalue reference of type 'unsigned char&' to an rvalue of type 'unsigned char'
<clever> vpu-support> 30 | return matrix[x][y] | (matrix[x][y+16] << 8);
<doug16k> if you find yourself not having a clue what type, but you have an expression that is that type, you can use decltype
<geist> yah decltype is pretty much always 'the type of whatever this is' iirc
<doug16k> that's why I mentioned the remove_reference thing
<clever> yeah, i think i see why that worked now, its just returning the T type back out of T*dst
<clever> oh, i think i kinda see what the above problem is now
iorem has joined #osdev
<clever> the 16bit read, is being compiled, when T is 8bit
<clever> and const-expr hasnt eliminated that branch yet
<clever> the above, is under a case 2, of switch (sizeof(T)) {
<clever> i need a constexpr flag, to make it entirely abort the other case sections?
<klysm> movdqa xmm0,oword [rel 0xfbc]
<doug16k> use constexpr if you can yeah
<doug16k> it authorizes open season taking all assumptions about it
<heat> hmmm
<heat> why would -O2 generate bad debug info?
<doug16k> each newer version of C++ supports doing more impressive things in constexpr
<heat> -O0 works fine
<clever> `constexpr int s = sizeof(T); switch (s) {` didnt help
<clever> same error as i pasted above
<doug16k> heat, what does this say: your-cross-objdump --dwarf your-thing 2>&1 >/dev/null | wc -l
<heat> doug16k, O0 or with opt?
<doug16k> when bad debug info
<doug16k> that asks for all complaints about dwarf data to be sent to wc
<heat> 1
<heat> "x86_64-onyx-objdump: Warning: Location lists in .debug_loc section start at 0x180"
<doug16k> mine is 0
<heat> this problem only arises with clang
<doug16k> on system objdump, loads of warnings
<heat> gcc works okay
<doug16k> ah
<heat> I don't get a "oh this was optimised out and whatever", I just get garbage
<heat> the stack trace is accurate, the other debug info isn't
<doug16k> what if you use -g3 instead of -g
<heat> when I switched on O0, I get good values
<doug16k> or -ggdb
<heat> hold on
<doug16k> I have coerced screwy debug to work by plaing with -g
<heat> nope
<heat> -g3 gives me garbage still
<doug16k> what if you turn off the fancy value tracking stuff
<doug16k> so it doesn't try so hard to always see register variables right
<heat> how do I do that?
<doug16k> -fno-var-tracking -fno-var-tracking-assignments
<clever> yeah, i'm just totally stuck
<doug16k> turns off heroic attempts to track register variables
<clever> no matter what i do, gcc refuses to let me do a <<8 with an uint8_t return type
<doug16k> clever, anything << 8 is zero if uint8_t
<doug16k> shift in 8 zeros from the right
<clever> doug16k: it was the decltype!
<clever> dst[(stride*r) + i] = matrix_read<T>(x,y+i); compiled
<heat> nope
<clever> dst[(stride*r) + i] = matrix_read<decltype(dst[0])>(x,y+i); failed
<clever> doug16k: and that <<8, was in a `if (sizeof(T) == 2)` block, so it would never run for uint8_t
<clever> but with decltype, it was being fussy
<doug16k> that's probably why I do it with overloads