klange changed the topic of #osdev to: Operating System Development || Don't ask to ask---just ask! || For 3+ LoC, use a pastebin (for example https://gist.github.com/) || Stats + Old logs: http://osdev-logs.qzx.com New Logs: https://libera.irclog.whitequark.org/osdev || Visit https://wiki.osdev.org and https://forum.osdev.org || Books: https://wiki.osdev.org/Books
<geist> long live unencrypted irc!
<geist> still can access over pretty old, mostly secure comptars
<clever> geist: had a chance to checkout ext4 yet? i'm curious how testing in qemu-arm would be, havent tried that route yet
<clever> does LK have any kind of fs tests, where you create an image with a set of known files, then hash them from LK?
<geist> no but would be fairly easy
<geist> can map a file system object with virtio-blk and mount it
<clever> yeah, that sounds like a far better way of testing, then booting it on a real system and writing images to sd cards
<clever> geist: i was just thinking, with how ext2 worked, i dont think it supports sparse files?
<Mutabah> ext4 adds extents support
<geist> i think it just puts a block 0 in the slot
<geist> which is also iirc why the block numbering is off by one (or 0 is unallocatable, or both)
<Mutabah> huh, neat
<clever> geist: ahh, right, forgot about that part, i did see LK code to support that
<clever> Mutabah: and ext4 extents also have a slightly wonky way of signaling if an extent is initialized or not
<clever> if the size is between 0 and 1<<N inclusive (i forget N), then the blocks behind that extent contain data
<clever> but if the size is over 1<<N, clear bit N, that range of blocks is allocated, but does not contain data for the file
ElectronApps has joined #osdev
theruran has joined #osdev
<gateway2000> nice
<Mutabah> I'm sorry you had to see that
<klange> wish he'd ignore our presence by not being here
<klange> gonna use a big hammer, one sec
<Mutabah> I work on the theory that he has an endless supply of VPNs and usernames, so might as well make it easier on ourselves
<kazinsal> I've had a /ignore *!~lamos@* on for the better part of the day and it's been kind of entertaining seeing someone op out of nowhere and kick him then deop
<klange> Yeah, we don't have the luxury since we need to see it before he starts going off pinging random users with obscenities.
<kazinsal> The joys of community maintenance
elastic_dog has joined #osdev
Izem has joined #osdev
Vercas has joined #osdev
GeDaMo has joined #osdev
pretty_dumm_guy has joined #osdev
sortie has joined #osdev
pretty_dumm_guy has quit [Ping timeout: 250 seconds]
pretty_dumm_guy has joined #osdev
gog has joined #osdev
diamondbond has joined #osdev
regreg has quit [Read error: Connection reset by peer]
<vin> zid Bitweasil haha I am not mining chia. Just some basic research on building a cluster that can run HPC workloads
<vin> I haven't built acluster of cluster of servers before so I am being extra careful about the parts I am considering
<vin> I see processors have max memory size(usually 1TiB) , why is this a limitation when the address space is 48 bits? If it's iMC why can it only support N number of dimms per channel?
<zid> fuck knows, it makes no sense to me either, my cpu has like 275GB max which isn't even a power of 2
<vin> Almost seems like a pricing trick so intel can sell same chips unlocked with larger capacity. Most xeon processors come with a M variant (expanded memory) that supports 4 TiB
Vercas has quit [Remote host closed the connection]
<Bitweasil> zid, probably triple channel or something.
<Bitweasil> vin, without knowing workloads, it's kind of pointless to build a box.
<Bitweasil> And careful on the M variants, some of them have really weird memory requirements.
<vin> Bitweasil: Its a simple worklaod as of now, processing 100TB of data as quickly as possible.
<vin> I am coming up with designs that'll need the presidents card and system that is reasonably economical
<Bitweasil> I hate to be a dick here, but what do you mean by "processing"? Is the goal simply to stream 100TB of data off disk and into RAM, or to do compute on it, or feed it to GPUs, or over the network, or... what are you *doing* with 100TB of data?
<Bitweasil> Are you going to be disk read rate limited, CPU limited, RAM bandidth limited, GPU limited, ?
<vin> No GPUs at this point yet just sort that 100TB
<Bitweasil> Ok, so pure NVMe box? Or spinning rust storage, NVMe working, ?
<vin> I am thinking expecting to be disk read read rate limited
<vin> Yes I am considering pure pmem, pmem+nvme, pmem+nvme+disk
<vin> since there will be no point in loading 100TB on a single machine, it will be a cluster. So I am also expecting to use Mellanox infiniband NICs not sure if I should go with 40GbE or 100~GbE yet
<vin> One way to save money would be avoid the infiniband switch byt directly attaching the machines with qsfp with optical fabric (if I can affort)
<Bitweasil> Your budget exceeds my useful experiences by far enough that I have no useful advice.
<vin> wow I switched to moonlander last week and my letters are all over, sorry about that
<vin> Bitweasil: but if limited by your budget how would you build a machine for this workload?
<Bitweasil> Just for sorting? Sounds like my rainbow table merge box.
<Bitweasil> I'd put about 200TB of disk in a box with a bunch of fast RAM, rely on memory mapped files and hinting, and sort it.
<Bitweasil> That was purely spinning rust, SSDs weren't a thing yet.
<Bitweasil> Well, not an affordable thing.
<Bitweasil> Depends on what you're merging.
<Bitweasil> And sorting.
<Bitweasil> I'd likely break fragments into sorted inputs, then do a big merge.
<vin> Yea much like external merge sort
<Bitweasil> Which is roughly what I did with the RTs, I had sorted fragments coming in and sorted them, IIRC it was a dual Westmere box or something.
<Bitweasil> I don't know how much fast storage you can cram in a modern box, though.
<vin> but weren't you limited by the small number of pcie lanes and end up wasting a lot of cores
<dzwdz> 200tb of ssd storage is affordable now?
<Bitweasil> dzwdz, if you're talking about Infiniband interconnects, probably?
<Bitweasil> *shrug* I was mostly CPU limited on my merge sorts.
<zid> depends who you're talking about affording it
<Bitweasil> I wasn't in a hurry.
<zid> small business yes, me no
<Bitweasil> And it was just spinning disk, so random IO was horrid, I did everything I could to turn my workload into streaming IO.
<vin> dzwdz: I don't think so. 2 TB costs 380 so 200 TB will be 3800
<zid> and depends if you mean garbage kingston ssds, or optanes
<Bitweasil> ...
<zid> nice zero counting
<Bitweasil> You missed a zero.
<Bitweasil> I mean, if you need that kind of box, $40k in disk isn't that bad.
<dzwdz> OH i mixed up the price per GB with the price per TB ^^'
<vin> lol yes
<dzwdz> i was off by a few orders of magnitude there
<dzwdz> yeah 200tb is definitely affordable now
<zid> not in optanes it isn't
<dzwdz> it is in garbage kingstons
<zid> ye
<Bitweasil> Broadberry has a server I can configure up.
<vin> 380 for Samsung 980 Pro which are fast nvme ssds
<Bitweasil> Supports up to 24x drives.
<zid> 'fast'
<Bitweasil> And there's a 15.3TB Intel, so 24 of those is 360TB.
<vin> 7/5.1 GB/s seq read/write
<zid> they're fast compared to rust
<zid> compared to optanes they're not
<Bitweasil> Their Cyberstore 248n.
<Bitweasil> They've got some Optanes too.
<vin> zid: optane ssds have the same speed
<zid> are you tunnel visioning on sequential read/
<zid> optanes can do 1M iops
<Bitweasil> Anyway, my point is that 200+TB in a single system isn't infeasible.
<Bitweasil> Not cheap, that box was $80k without adding appropriate RAM and CPU, but still.
<vin> optane ssds have random read/write 1.5M iops and samsung 980 pro has 1M
<zid> the probably will be that you'll need a server board to have that much io
<zid> the problem* wow typing
<zid> which means dealing with a single board order as a random civilian
<zid> or getting a shipment that's been broken into single boards, with no warrenty from supermicro
<vin> Yea if I buy big prcessors then I will for sure be disk bound wasting cycles. Which is why I am inclined towards multiple machines with smaller disks
<zid> raaaaid
<zid> if your data is over 10 optanes you get 10x the read perf
<zid> even without losing any capacity
<vin> Yes I will be striping data across disks but pcie gen 4 has 32 GB/s and usualy 2 slots on the motherboard so I can't do more than 72 GB/s anyway. Unless I buy an expansion card
<zid> you will need a server board with more io
<zid> and erm.. most desktop boards will do a lot more than that anyway
<zid> pci-e 4.0 is 16GT/s per lane
<zid> per direction
<vin> https://www.supermicro.com/en/products/motherboard/X11DPX-T this has pcie 3 and 2 x16 slots so that will be a max of 32 GB/s right?
<bslsk05> ​www.supermicro.com: X11DPX-T | Motherboards | Products | Super Micro Computer, Inc.
<zid> pci-e is 8 per lane afaik
<zid> pci-e 3.0*
<mxshift> slots aren't always fully used electrically
<zid> that's what it being a x16 slot means
<vin> what is 8 per lane? Yes those are single direction speeds but I don't understanding the bidirectional speeds though. The signal can either be from one end or the other
<zid> it's a lane not a wire
<zid> two wires. tx and rx
<zid> it's full duplex
<vin> ahh
<clever> zid: 4 wires, they are each a differential pair!
<clever> max packet size is also a factor
<clever> one sec
<zid> that's why the high iops is hard on those ssds, lot o packets
<bslsk05> ​www.raspberrypi.org: What happens if you add pci=pcie_bus_safe to cmdline.txt on a Pi 4? - Raspberry Pi Forums
<vin> So I can read and write at the same time to multiple disks effectively using 64 GB/s on a pcie gen 3 x16?
<clever> each pci-e packet, has 24 bytes of headers, plus a max of 2^N bytes of payload
<clever> [root@amd-nixos:~]# lspci -vvv | grep -i max
<clever> MaxPayload 128 bytes, MaxReadReq 128 bytes
<clever> DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
<clever> the max payload size, is based on what both the pci-e controller, and pci-e device are capable of
<clever> the bios should negotiate the largest packet size possible
<zid> where do you get this 64 number from
<vin> 32*2?
<clever> in the first example i pasted, for every 128 bytes of useful data, you loose 24 bytes to headers
<zid> pci-e is 16GT/s *per lane*, you have 16 lanes. I think pci-e 3.0 is 8 not 16
<zid> so 8*16
<zid> *2 if you want full duplex
<clever> so take your byterate, and /(128+24)*128, to get the byterate after headers
<zid> so like 200GB/s+ per slot if you take it full duplex
<clever> the thread i linked, is about the rpi firmware not raising the MaxPayload correctly, so its got higher overheads then it should
<zid> than
<zid> comparitive
<clever> and its late here, is should get some sleep!
<vin> zid sorry I don't think I understood this correctly https://www.trentonsystems.com/blog/pcie-gen4-vs-gen3-slots-speeds this table says pcie3 has 16 slots and supports 16GB/s
<bslsk05> ​www.trentonsystems.com: PCIe Gen 4 vs. Gen 3 Slots, Speeds
<vin> So if that's single direction then with full duplec it must be 32?
<zid> okay it's counting T as b apparently, that explains it
<vin> thanks clever that's very interesting.
<zid> so /8
<vin> wait it's B not b
<bslsk05> ​en.wikipedia.org: PCI Express - Wikipedia
<mxshift> PCIe gen3 has 8GT/s per lane
<vin> thanks mxshift the article lists the same numbers
<mxshift> accounting for encoding overhead, that works out to 0.985 GB/s per lane
<zid> yes T=b, so T->B is /8
<mxshift> zid: except you also have encoding overhead
<zid> not really an 'except', just 'and'
<zid> not the full story
<mxshift> 130T == 128b
<mxshift> then protocol overhead
<zid> B not b
<zid> err yes b ,f uck
<zid> Now I'm going insane
<mxshift> as for X11DPX-T, it has 2x full x16 slots and 2 x16 slots that can be electrically x16 or x8
<mxshift> depending on if you want more 4x16 + 2x8 + 1x4 or 2x16 + 6x8 + 1x4
<vin> why are these throughputs listed for single direction when they are not the actual capcity the slot can support? Example x16 gen 3 should be listed as 32 GB/s
<zid> I only care how quick I can fill my vram, not how quick I can read from the framebuffer :p
<mxshift> vin: PCIe traffic is rarely symmetric. Single-direction throughput is much more useful to know
<vin> mxsh
<vin> mxshift: cool
<zid> also my ethernet is 1gigE not 2gigE for the same reason
<vin> Yea makes sense
<mxshift> zid: 1000BASE-T made it too easy to assume full duplex all the time. 1000BASE-T1 is half-duplex
<vin> I wish someone writes blog posts on different hardware protocols. As a systems engineer I neglect them until I really get into some trouble..
<vin> from a systems building prespective
<mxshift> blog posts are out there but PC interconnects are written by different people from networking from storage from ......
<vin> hmm that's right
<mxshift> my dayjob is designing server hardware so I frequently deal with quite a few of those different groups
<vin> Nice, like assembling from different OEMs and making them functional?
jamestmartin has joined #osdev
<mxshift> no, designing entirely new PCBs, enclosures, firmware, etc
<j`ey> oxide? :p
<mxshift> yup
<j`ey> mxshift: did you change your name, or are there 2 of you in here?
<mxshift> changed username
<vin> ah yes I remeber talking to you earlier
<geist> re: physical address space support, that varies depending on the chip and the chipset but at the core the limitation is most likely because adding more bits of physical is not free
<geist> at the minimum you have to reserve more pages in the TLB and cache tags for the max physical address
<geist> so consumer level cores may support 44 or 48 or maybe 50 bits of physical, but there's no point wasting bits in the tag ram for anything more, especially when you can't physically wire up sticks of ram more than that
<geist> server socs have more memory channels and more support, so the corresponding core may have more tag ram bits implemented
<zid> Max Memory Size (dependent on memory type) 375 GB
<zid> It's like they chose random bits to set to produce that number ;p
<zid> 0b010111011100....
heat has joined #osdev
<__sen> zid: 375*1024 is 384000, so probably a GiB vs GB issue that makes it look weirder than it is :)
<geist> that may be a function of the number of memory controllers you have and how much each of them can individually address
<geist> that seems pretty close to 3 memory controllers of 128GB addressing a piece
dzwdz is now known as thats_a_pretty_g
thats_a_pretty_g is now known as dzwdz
<vin> geist: Why can't one physically wireup more sticks to the controller? They can but per stick throughput will reduce and controller would need more die area?
<vin> You're right, I have seen servers with 12 channels supporting 2 TiB while 6 channels only support 1
<Bitweasil> Pins, drive strength, etc.
<Bitweasil> Each channel can only drive so much.
<Bitweasil> Registered gets you more, I think there are some load reduced DIMM specs too.
<Bitweasil> But each memory interface is a /lot/ of pins in the footprint.
<vin> geist: Bitweasil So intel 8280 supports 1 TiB max and intel 8280M supports 2 TiB, both have 2 iMC and 6 channels so why is there a difference from above reasoning? https://en.wikichip.org/wiki/intel/xeon_platinum/8280m
<bslsk05> ​en.wikichip.org: Xeon Platinum 8280M - Intel - WikiChip
<Bitweasil> Because Intel decided to segment it, probably.
<Bitweasil> Don't confuse "Intel did it" with "There exists a technical reason."
<vin> Yea that was my initial thought as well
<kazinsal> I'm imagining the grey market server equivalent of getting spending $2000 on an RTX 3090 cardboard box because you didn't read the description
<vin> what does that mean? That processor is not worth the money?
<mxshift> vin: most likely they support either an additional DIMM type (RDIMM, LRDIMM, or 3DS) or more ranks
<vin> prolly, I don't see them on specsheet
