#osdev on 2022-07-07 — irc logs at libera.irclog.whitequark.org

2021-05-23 01:57 klange changed the topic of #osdev to: Operating System Development || Don't ask to ask---just ask! || For 3+ LoC, use a pastebin (for example https://gist.github.com/) || Stats + Old logs: http://osdev-logs.qzx.com New Logs: https://libera.irclog.whitequark.org/osdev || Visit https://wiki.osdev.org and https://forum.osdev.org || Books: https://wiki.osdev.org/Books

00:04 heat has quit [Remote host closed the connection]

00:05 heat has joined #osdev

00:12 gog has quit [Read error: Connection reset by peer]

00:14 opal has quit [Remote host closed the connection]

00:14 gxt__ has quit [Remote host closed the connection]

00:14 wand has quit [Remote host closed the connection]

00:14 foudfou has quit [Remote host closed the connection]

00:15 opal has joined #osdev

00:15 gxt__ has joined #osdev

00:15 foudfou has joined #osdev

00:16 zaquest has quit [Remote host closed the connection]

00:17 zaquest has joined #osdev

00:17 gog has joined #osdev

00:20 wand has joined #osdev

00:23 foudfou has quit [Remote host closed the connection]

00:24 foudfou has joined #osdev

01:00 pretty_dumm_guy has quit [Quit: WeeChat 3.5]

01:29 Lumia has joined #osdev

01:29 heat has quit [Ping timeout: 240 seconds]

01:35 doug16k has quit [Remote host closed the connection]

01:36 Celelibi has joined #osdev

01:46 [itchyjunk] has quit [Ping timeout: 244 seconds]

01:50 doug16k has joined #osdev

01:51 [itchyjunk] has joined #osdev

01:58 <doug16k> a minute or two to parse dragon.c? wow

01:58 <doug16k> the parser should rip through it

02:00 smeso has quit [Quit: smeso]

02:03 <doug16k> can't say I have generated huge sources and tried to compile them much though

02:03 <doug16k> sounds like a good idea

02:05 <doug16k> reminds me of when you had a bunch of DATA statements in basic and you had a for loop that uses READ and POKE to put the asm in RAM :D

02:06 gog has quit [Ping timeout: 272 seconds]

02:07 <doug16k> not as bad though if your payload is numbers and the source is the numbers

02:07 <geist> What is interesting is I’ve observed in one case that aggressive C inclining can result in crazy compile times (vs C++)

02:07 <geist> Specifically zstd.c in the zstd compression thing

02:08 <geist> It is a .c file that uses the forced inline gcc intrinsic to compute some table, the way you would with a constexpr thing in C++

02:08 <geist> but it takes minutes to run, whereas the equivalent in C++ is almost instant

02:08 <geist> Someone on compiler team mentioned that’s because it’s a fundamentally different path in the compiler, and there’s lots of infrastructure to deal with constexpr things in C++

02:09 <geist> Whereas C is doing it much more brute force where it literally expands the whole thing and then has to use the backend optimizers to flatten it down

02:11 <moon-child> wow. Even constexpr is not that fast

02:11 <moon-child> d is moving to jit compilation for its constexpr-alike, and iirc one of the touted benefits of circle is better comile times

02:11 smeso has joined #osdev

02:13 <doug16k> hey I just realized, it might be drastically faster for zid to generate asm for those arrays

02:14 <doug16k> it would be hardly any changes

02:19 <doug16k> it would be interesting to put 1GB of .single in a file and see how long it takes

02:22 <moon-child> eh, if you care about performance, I would just do the linker trick

02:23 <doug16k> yeah but that is such a gross hack though. it would be ok-ish if it didn't create that nonsensical _size symbol

02:24 <doug16k> it can be weird if you are doing stuff close to address zero and there is this weird _size in the middle of your stuff

02:25 <doug16k> why oh why isn't it a half open range with a start and end symbol

02:25 <doug16k> cuckoo

02:26 <moon-child> at least #embed is coming

02:26 <moon-child> soon(tm)

02:32 <doug16k> I think I used it though, even though I dislike that _size thing it does

02:32 <doug16k> for vga font

02:33 <doug16k> kernel doesn't care about near address zero

02:36 <doug16k> I do some build step to bundle up the font data and the lookup table that says all the unicode codepoints it has and include that

02:38 <doug16k> this one https://www.inp.nsk.su./~bolkhov/files/fonts/univga/

02:38 <bslsk05> www.inp.nsk.su.: Unicode VGA font homepage

02:40 <doug16k> a trivial little python script to blast a binary into asm source and assemble that would be all you need to remove that ld trick dependency

02:41 <doug16k> isn't there .incbin in gnu as?

02:42 <doug16k> wow I could have sworn gnu assembler had that

02:44 <doug16k> oh it does

02:45 <doug16k> even has start and length optional parameters to extract a portion

02:46 <doug16k> and guarantees it won't do any funny alignment business, and leaves it up to you to restore alignment before or after

02:48 <doug16k> but yeah, doing it as generated asm preserves the endian independence

02:49 <doug16k> makes me wonder why it is not more common to generate code. protobuf does. parser generators. maybe some GUI form stuff.

02:50 <doug16k> hardly anything though

02:55 <clever> doug16k: i do like .incbin over generating giant c arrays, it also feels like it would compile faster

02:55 <clever> no time wasted turning your binary into hex, then back into binary

02:55 <clever> https://github.com/librerpi/lk-overlay/blob/master/platform/bcm28xx/arm/payload.S

02:55 <bslsk05> github.com: lk-overlay/payload.S at master · librerpi/lk-overlay · GitHub

02:56 <clever> an example where i generated an array of addr+length pairs using .incbin

02:56 <clever> there is a struct in the neighboring file, that turns it into a simpler api for c

02:56 <doug16k> with nice sane half open start and end symbols

02:57 <clever> https://github.com/librerpi/lk-overlay/blob/master/platform/bcm28xx/arm/arm.c#L192

02:57 <bslsk05> github.com: lk-overlay/arm.c at master · librerpi/lk-overlay · GitHub

02:57 <clever> on the C side, i can just do chosenPayload = &arm_payload_array[processor]; and chosenPayload->payload_addr, chosenPayload->payload_size

02:57 <doug16k> neat

02:57 <clever> in your case, you could do the exact same thing, with an array of multiple font blobs

03:00 <clever> the key to that trick, is to create a struct with a known memory layout, in this case a pointer and an int (pointer always being 32bit in this case), and then making the asm side being pairs of 32bit ints

03:00 <clever> extern arm_payload arm_payload_array[3];

03:00 Lumia has quit [Ping timeout: 240 seconds]

03:00 <clever> i even have the lenght specified, so gcc will try to enforce it, though only for constant indexes i think

03:01 <clever> and extern tells the linker to find it in another unit

03:01 <doug16k> yeah, my thing starts with known structs, by the time you get to the array part, you know where everything is

03:01 <doug16k> doesn't need end

03:04 <doug16k> it's the obvious header struct says count, then array of count items after struct, then another thing that starts at end of array

03:05 <doug16k> array is the lookup table for codepoint, corresponding bitmap is at an offset into 3rd region, based on what index matched the codepoint lookup

03:06 <doug16k> and precompute a lookup table of the first 127 ascii or something so they're O(1)

03:07 <doug16k> at runtime

03:08 <doug16k> it's semi-SOA

03:09 <doug16k> should make it fully SOA though

03:09 <doug16k> when doing the search, it pulls in unnecessary fields

03:11 <clever> oh, i do also have a font drawing example...

03:11 [itchyjunk] has quit [Remote host closed the connection]

03:11 <clever> https://github.com/cleverca22/wowmapviewer/blob/master/src/font.cpp

03:11 <bslsk05> github.com: wowmapviewer/font.cpp at master · cleverca22/wowmapviewer · GitHub

03:12 <clever> https://github.com/cleverca22/wowmapviewer/blob/master/bin/arial.info

03:12 <bslsk05> github.com: wowmapviewer/arial.info at master · cleverca22/wowmapviewer · GitHub

03:12 <clever> and the arial.tga in the bin

03:12 Lumia has joined #osdev

03:13 <doug16k> should use triangle strip and do whole string in one Begin

03:13 <clever> i was using this code as a testcase, when writing my own gl driver from scratch

03:13 <clever> and it only supported triangles, because i didnt understand strip at the time

03:14 <doug16k> ah. strip is easy, you just do top, bottom, top, bottom, top, bottom, across the row

03:14 <clever> i have since looked it up, but its not clear how to end a strip on v3d

03:15 <doug16k> ah, I see. I meant easy at API level :)

03:18 <clever> doug16k: but, what if i want to have 2 polygons, that dont share any vertex?

03:18 <doug16k> after the 1st two vertices, each one vertex is another whole triangle

03:18 <clever> how do i end the strip?

03:18 <doug16k> then you make a degenerate triangle

03:18 <clever> which is?

03:18 <doug16k> or there is a "primitive restart index" you can set for indexed

03:19 <doug16k> if you repeat a point then it will be a zero area triangle over to the new place

03:19 <clever> ah

03:19 <doug16k> imagine the "base" of the triangle has 0 width

03:19 <clever> yeah

03:19 <clever> then its just a line

03:19 <clever> and the rasterizer wont find any pixels within that "area"

03:19 <doug16k> right but because being on right edge overrules being on left edge, no pixels are inside

03:20 <doug16k> s/but//

03:20 <clever> but i also see a limitation there, 16bit index is the biggest it supports

03:20 <clever> which means the vertex array, can only have ~65535 elements

03:20 <doug16k> you don't need indexed

03:21 <doug16k> it would be silly to make text indexed

03:21 <doug16k> the index array says 0, 1, 2, 3, 4, 5, 6, ... every time

03:21 <doug16k> because you used a strip

03:21 <clever> so i would instead do vertex array primitives?

03:21 <doug16k> for that yeah

03:21 <doug16k> because the index traversal is trivial 0-N

03:21 <clever> though, that wants the index of the first vertex, as a 32bit int

03:22 <doug16k> 0 then

03:22 <doug16k> or whatever if you are packing multiple things into a single array being smart

03:22 <clever> and given the memory constraints, 32bit index and 32bit count, i'll run out of ram before i exaust those

03:23 <clever> vc4-v3d is 1gig of ram, so even with 1 byte vertex's, 30bits for the length/index would exaust all ram

03:23 <clever> bigger vertex data, means i use even fewer bits of index/length

03:24 <clever> vc6-v3d has an mmu and can address 4gig of data, but same rules, i would need a 1 byte vertex data to even come close to exausting that reach

03:25 <doug16k> you can do even better

03:25 <doug16k> you an make it one vertex per character and make it create the vertices in the shaders

03:26 <clever> i havent tested vertex shaders yet

03:26 <doug16k> I did one where it was an x,y,c triple for each glyph

03:26 <clever> and i dont know how to make 4 vertexes from 1 entry

03:26 <doug16k> using instancing, nvidia instantly draws it

03:26 <clever> i'm not sure v3d can do instancing

03:26 <doug16k> don't need it - it helps with the make-vertices-in-shader trick

03:26 <clever> the vertex shader seems to turn 1 unshaded vertex (attributes) into one shaded vertex (xyz+vary[])

03:28 gxt__ has quit [Remote host closed the connection]

03:28 gxt__ has joined #osdev

03:28 <doug16k> you must mean xyzw

03:29 <doug16k> NDC right?

03:29 <clever> w/1

03:29 <clever> wait no, thats pointless, lol

03:29 <clever> 1/w

03:30 <clever> page 60/61 of the pdf, is the vertex formats

03:30 <clever> the coordinate shader has to produce the ones on 61 i believe

03:30 <clever> x/y/z/w/x/y/z/ 1/w

03:30 <doug16k> yeah 1/w

03:31 <clever> xy is twice, because there is both 3d and screen 2d

03:31 <doug16k> then you get screenx= x*1/w, screeny = y*1/w, screenz = z*1/w, where screenx and y are -1 to 1 and z is 0 - 1 IIRC

03:32 <clever> screen x/y are also in a 12.4 fixed-point format

03:32 <doug16k> makes sense, 1/16th subpixel

03:32 <clever> and the hardware has some anti-aliasing built in

03:32 <clever> where you can render at double res, and it will down-scale as it saves to the framebuffer

03:33 <clever> beyond that, you need to down-scale with some other hw block

03:34 <clever> ah, there is what i was looking for, page 78, shader state record formats

03:34 <clever> you need one of those, to describe the shader

03:34 <doug16k> super sampling is smart. hits the cache almost every time

03:34 <clever> the "GL shader" takes 3 shaders, coordinate, vertex, and fragment

03:34 <clever> then the "NV Shader" (no vertex?) takes just a fragment shader, and pre-shaded vertex data

03:35 <clever> and ive got no clue what the "VG Shader" is for, it takes counts for things, but has no pointer to the shader!?

03:35 <clever> *doh*, there it is, fragment shader code address

03:36 <clever> not sure what it does

03:36 <doug16k> what fragment shader does?

03:36 <clever> the fragment shader decides the final color of a pixel

03:36 <doug16k> yeah

03:36 <clever> the vertex shader generates the xyz/vary[] from attributes

03:37 <clever> the coordinate shader only generates xyz for tile binning

03:37 <doug16k> the rasterization step interpolates stuff from the vertices and executes the shader for each intermediate value

03:37 <clever> yeah

03:38 <clever> taking a closer look at table 45 on page 79

03:38 <clever> a GL shader, takes a fragment shader uniform count (unused), fragment shader varying count, fragment shader addr, and fragment shader uniform addr

03:39 <clever> given that uniform count is unused, that implies its just going to fall off the end of the array and hit undefined data, and you should just not read too many uniforms

03:39 <doug16k> absolutely

03:39 <doug16k> gpus couldn't care less about correctness

03:40 <clever> then the vertex shader has the same unused uniform count, the total attribute size, code adde, and uniform addr

03:40 <clever> so fragment and vertex can use different uniforms

03:40 <clever> oops, and missed one, vertex shader also has an 8bit attribute selection mask

03:40 <clever> that implies you can have up to 8 attributes on a vertex, and select any combination of those 8 for the vertex shader

03:41 <clever> does that fit with what opengl standards say?

03:41 <doug16k> 8 is plenty

03:41 <clever> then you have the same thing again for coordinate shader, uniform count(unused), attribute mask, attribute size, code addr, uniform addr

03:42 <doug16k> there is a getInteger api to query limits like that

03:42 <clever> then things get a bit more messy

03:43 <clever> you have an array of up to 8 base addresses, for each atrribute

03:43 <doug16k> it's 8 vectors right? of xyzw? float4 right?

03:43 <doug16k> that is tons

03:43 <clever> then you have an array of sizes in bytes, the 8 strides, the 8 vertex offsets, and the 8 coordinate offsets

03:44 <clever> i think the shader is free to interpret the attributes however it likes

03:44 <clever> one sec

03:44 <doug16k> yes

03:44 <clever> page 60

03:45 <clever> > the vertex attribute data for each of the 16 vertices, is loaded into a single column for each vertex

03:45 <clever> > such that the qpu can read a horizontal vector of vertices for each individual attribute.

03:45 <clever> > the attributes for each vertex are packed into the column, according to the setup data supplied to the VCD from the appropriate shader state record

03:46 <clever> then pages 54 and 55

03:46 <doug16k> what version of opengl is it supposed to be?

03:46 <doug16k> sounds at least 3.3

03:47 <doug16k> 3.3 is the 1st awesome version imho

03:47 <clever> page 12 says all of that

03:47 <clever> opengl-es 1.1/2.0 and openvg 1.1

03:47 <doug16k> 2.0 is good

03:48 <clever> back to page 54, if i was to do a horizontal 32bit load, for y=0, then i would get attribute 0 in the 16 lanes of the QPU

03:48 <doug16k> 2.0 is "oops, sorry we made you do so many function calls" and 3.3 is "sorry, everything is buffers and shaders now!"

03:48 <clever> as-in, attribute 0 of 16 vertices

03:49 <clever> but, if i was to do 16bit laned, y=0, h=0, i would get the lower 16bits of each attribute 0, i think

03:49 <doug16k> 1.x made you do too many calls

03:49 <clever> so i could pack multiple attributes

03:49 Lumia has quit [Ping timeout: 240 seconds]

03:49 <clever> thats also the feeling i got from opengl ES

03:49 <clever> it felt like ES was for embedded systems, and forced you to do buffer based stuff, rather then function calls up the wazoo

03:50 <clever> and the wowmapviewer font code, is making a call per vertex, which wouldnt have worked on the rpi's old opengl ES

03:51 <doug16k> if you are in kernel mode already, your glVertex could poke it straight into the GPU somehow, and it would be pure genius

03:51 <doug16k> that is why it was like that - SGI had so much acceleration, the glVertex call just did MMIO or something

03:51 <clever> the QPU doesnt really have any of its own ram

03:52 <clever> it just dma's into main system ram

03:52 <clever> the old v3d2 driver i was writting for linux, would mmap buffers into userland, that where physically contiguous

03:52 <clever> so i could then just pass that buffer directly to the qpu

03:52 <clever> and i only had to do cache management

03:53 <doug16k> or not even in kernel mode - could map it in wherever

03:53 <clever> my rough understanding of this region, is that the VPM is an uint32_t[64][16], one of the hardware blocks will pre-load attribute data into it for me

03:53 <clever> the coordinate/vertex shader must then compute things, and write the result back into the VPM

03:54 <clever> the FEP (front end pipe) will then read the shaded vertices from the VPM, rasterize them, and feed varyings to the VRI (interpolator)

03:55 <clever> and schedule fragment shader jobs on the QPU

03:56 <clever> the tile buffer is 64x64 pixels, when using 32bit pixels, and no multi-sampling

03:56 <clever> and because it renders a whole time at once, it only ever needs to track depth/coverage data for a 64x64 pixel region

03:56 <clever> and while storing that tile to ram, it is a series of 64 x 256byte AXI bursts, so it should be very good at maxing out memory bandwidth

03:56 Lumia has joined #osdev

03:57 <clever> in "4x multisample mode", its instead 32x32 pixels when written out to ram

03:58 <clever> i assume its just taking each 2x2 chunk, averaging them together, and writing it out as 1 pixel

03:58 <doug16k> yeah, that's one area I wish it were easier to study. tiling is handwaved

03:58 <clever> it does also have a 64bpp mode, which results in either 32x32 or 16x16 tiles, depending on multi-sample

03:58 <doug16k> the exact algorithm I mean

03:59 <doug16k> it's the strip mining optimization isn't it?

03:59 <clever> that part, is under driver control

03:59 <clever> https://github.com/librerpi/lk-overlay/blob/master/platform/bcm28xx/v3d/v3d.c#L309-L330

03:59 <bslsk05> github.com: lk-overlay/v3d.c at master · librerpi/lk-overlay · GitHub

03:59 <doug16k> aka loop blocking, aka loop tiling

04:00 <clever> lines 313-315, specifies the tile coordinates

04:00 <clever> then line 318 calls a function that was generated earlier, with 319 computing the addr of that function, based on the tile coord

04:00 gxt__ has quit [Remote host closed the connection]

04:00 <clever> 327 will store the file and continue, while 324 will store the tile and report the frame as finished

04:01 gxt__ has joined #osdev

04:01 <clever> 310/311 then just manually unrolls a simple loop, to step thru every tile in the frame

04:01 <doug16k> where do the draw calls come into it?

04:01 <clever> but you could do tiles in any order you want

04:01 <clever> thats the generated function, that 318 calls

04:01 <doug16k> so it replays all the draw calls on each tile?

04:01 <clever> the binner will run your coordinate shader, figure out what polygons are in each tile, and generate a control list, that draws a subset of the polys

04:01 <clever> yeah

04:02 <doug16k> right binner figures out which it covers

04:02 <doug16k> nightmare

04:02 <clever> thats why it has both a coordinate and a vertex shader

04:02 <clever> the coordinate shader is just a lobotomized vertex shader, with the varying computation deleted

04:03 <clever> so you can get the screen xy, and then see what tiles it covers

04:03 <doug16k> it's conceptually the same idea as using a solid fill shader when doing shadow maps

04:03 <clever> mesa hides this from you, and auto-generates the coordinate shader from the vertex shader

04:03 <clever> but its also not having to run at the full res

04:04 <clever> you could basically treat it as an image with 1/64th the resolution

04:04 <clever> the whole tile is 1 pixel

04:04 <clever> if the polygon touches it, in the bin it goes

04:04 <clever> https://github.com/librerpi/lk-overlay/blob/master/platform/bcm28xx/v3d/v3d.c#L389-L395

04:04 <bslsk05> github.com: lk-overlay/v3d.c at master · librerpi/lk-overlay · GitHub

04:05 <clever> this comment explains some of it

04:06 dude12312414 has joined #osdev

04:06 <doug16k> in the bin what goes?

04:06 dude12312414 has quit [Remote host closed the connection]

04:07 <clever> the polygon

04:07 <clever> one bin per tile

04:07 <doug16k> up to how many

04:07 <clever> treat each tile as 1 pixel, and rasterize it over that screen

04:07 <clever> thats what the tile allocation size is for

04:07 <clever> static int getTileAllocationSize(int n) { return 1 << (5 + n); }

04:08 <clever> page 71, code 112, the tile allocation block can be 32, 64, 128, or 256 bytes

04:08 <clever> it then gets filled with compressed primitive lists or inline primitive lists

04:09 Lumia has quit [Quit: ,-]

04:10 <clever> i forget where, but there is also an overflow space you can configure

04:10 <clever> and it will generate jump opcodes to pop into there temporarily

04:10 <clever> and it can fire an irq to request more

04:10 <doug16k> sounds fun to make a driver for that

04:10 <doug16k> it's almost unbelievable how high level it is

04:11 <doug16k> it's almost the opengl api directly

04:11 <clever> less work for the 250mhz VPU to do

04:11 <clever> back when this soc was arm-less, lol

04:12 <doug16k> my verilog is not even close to implementing that

04:12 <doug16k> I could get stuff to work but not cleanly I bet

04:12 <clever> the most complex thing that remains, is compiling a shader into QPU asm

04:13 <clever> ah, there it is, "address of overspill binning"

04:13 <clever> ctrl+f for that

04:13 <clever> > the address of additional memory that the PTB can use for binning once the initial pool runs out

04:14 <clever> > this may be set up prior to the PTB actually running out

04:14 <clever> that 2nd part, implies that you can react to an OOM IRQ, and then resume binning

04:14 <doug16k> so you can just start a new batch

04:14 <clever> and the next register then says

04:15 <clever> >if this count (the size) is zero when the PTB runs out of binning memory, the PTB will halt, waiting for a non-zero value to be written to this register

04:15 <clever> > once the PTB has taken this overspill memory, this register is set to zero

04:15 <doug16k> PTB?

04:15 <clever> primitive tile binner

04:15 <clever> from the big graph on page 13

04:16 <clever> > in the tile binning phase, only the vertex coordinate transform part of shading is performed

04:16 <clever> > the primitive tile binner fetches the transformed vertex coordinates from the VPM, and works out which tiles, if any, each primitive overlaps

04:17 <clever> > as it goes along, the PTB builds a list in memory for each tile, which contains all the primitives impacting that file, plus references to any state changes that apply

04:18 <clever> page 62

04:19 <clever> > during the binning pass, the PTB automatically writes out a new control list for rendering each tile during the rendering pass

04:19 <clever> > the binning list must finish with a flush command, to cause the PTB to finalize all these file lists

04:20 <clever> > all that the host processor then needs to do for the rendering pass list, is to setup the tile rendering mode configuration, and link together all the tile lists created by the PTB as sub-lists to the main list

04:21 <clever> > the only control items that the host processor needs to add to per tile lists "tile coords" item before and a "store tile" item after each tile list

04:21 <clever> as i showed in the unrolled loop earlier

04:21 MiningMarsh has quit [Quit: ZNC 1.8.2 - https://znc.in]

04:21 MiningMa- has joined #osdev

04:21 jack_rabbit has joined #osdev

04:22 knusbaum has quit [Ping timeout: 244 seconds]

04:22 MiningMa- is now known as MiningMarsh

04:22 <clever> doug16k: the CLE (control list executor) also has 2 threads, one for binning, and one for rendering, so the rendering thread can be computing frame 2, while the binning thread is computing frame 3

04:23 <clever> and while ive not used them yet, there are also semaphores, so the rendering thread can stall until binning has completed

04:23 <clever> instead, i'm waiting for an irq, and manually starting rendering

04:24 <clever> oh, interesting, the CLE starts executing at a defined start addr, and PAUSES when it hits the end addr

04:24 <clever> if you extend a control list, you can change the end addr, and it will RESUME! executing

04:25 <doug16k> yeah that sounds exactly like what hardware would do

04:25 <clever> so basically, instead of waiting for frame3 to finish binning, and reseting everything

04:25 <doug16k> it will watch that register 24/7 without complaining once

04:25 <clever> you can append frame4's binning list to the existing one, and extend the end-pointer

04:25 <clever> and the hw will automatically start binning frame4 when frame3 finishes

04:26 <clever> thats very nice, it lets you queue up several frames of work, and you dont have to be quick on the irq handling

04:26 <doug16k> yeah, covering latency is the whole job

04:26 <doug16k> you never want them taking turns waiting for the other

04:27 <clever> but with the call opcode, it can be difficult to know where within that master list and sub-functions, you currently are

04:27 <clever> so they also have a magic marker opcode, that just increments the marker count in a status reg

04:27 <clever> inject that whever you feel, and keep track of how many markers your expecting per frame

04:28 <clever> page 63

04:29 <clever> > in gl mode, the pipeline up to the PTB/PSE consists of the following steps:

04:29 <clever> > 1 determine a batch of vertices to shade in the VCM (vertex cache manager)

04:29 <clever> > 2 find space in the VPM (vertex pipe memory) to store the batch of vertex input attributes and shaded vertices

04:29 <clever> > 3 fetch vertex attributes to the VPM using the VCD

04:29 <clever> > 4 shade the vertices using a vertex/coordinate shader

04:30 <clever> > 5 PTB/PSE reads shaded vertex data from the VPM

04:30 <clever> PSE==primitive setup engine, PTB==primitive tile binner

04:30 darkstardevx has joined #osdev

04:30 <doug16k> how far are you into the graphics driver?

04:30 <clever> so the binner and renderer threads, are both running those 5 steps, to feed either the PTB or PSE

04:31 <clever> for my old pi1 linux work, i can render 2d textured polygons

04:31 <doug16k> that's awesome

04:31 <clever> let me grab a screenshot...

04:31 <doug16k> hardware compositor right there

04:32 <clever> doug16k: https://i.imgur.com/bnAGIQ5.png

04:32 <clever> each character is made up of 4 triangles

04:33 <clever> 2 to draw it in black, then 2 to draw it again in white, with an offset

04:33 <clever> creating the shadow effect

04:33 <doug16k> and if you rotated the vertices the textures would interpolate diagonally (properly)?

04:33 <clever> i assume so

04:33 <doug16k> oh man so you have accelerated texturing already

04:33 <doug16k> just do the transformations on the cpu and put NDC triangles 2d

04:33 <clever> for that textured case, i'm using 2 varyings for the texture UV, and 3 varyings for an rgb color

04:34 <clever> so i can multiply the RGBA from the texture, by the varying, to color the text

04:34 <clever> https://www.youtube.com/watch?v=GHDh9RYg6WI

04:34 <bslsk05> '2d and 3d demo' by michael bishop (00:00:21)

04:34 <clever> this is the non-textured demo

04:35 <clever> its a single triangle, with the cpu using sin/cos to rotate the pre-shaded vertex XY's

04:35 <clever> and then RGB in the varyings

04:35 <clever> so the interpolator will give a smooth scale from 100% red to 0% red

04:36 <doug16k> that is doing the grunt part of rendering. even if you did all the transform and projection on the cpu, it would be nothing for the cpu

04:36 <clever> https://github.com/librerpi/lk-overlay/blob/master/platform/bcm28xx/v3d/v3d.c#L349-L356

04:36 <bslsk05> github.com: lk-overlay/v3d.c at master · librerpi/lk-overlay · GitHub

04:36 <clever> this is where that sin/cos happens

04:36 <doug16k> cpu has simd right?

04:37 <clever> yep

04:37 <clever> for (int i=0; i<16; i++) { int temp = a[i] * b[i]; if (store) c[i] = temp; if (accumulate) accumulator[i] += temp; }

04:37 <clever> the VPU can run this entire line of code in just 2 clock cycles

04:37 <clever> but its integer only, no float in that mode

04:38 <clever> mult only allows 16bit inputs, most other opcodes can work on a full 32bit input

04:38 <doug16k> that's not really simd though, but yeah cool, vectorization support

04:39 <clever> https://www.youtube.com/watch?v=l7lIewA9fm4

04:39 <bslsk05> 'vpu accelerated mandelbrot, final version' by michael bishop (00:00:18)

04:39 <clever> this is an example of what you can do with the VPU

04:39 <doug16k> I guess it is simd

04:39 <clever> https://www.youtube.com/watch?v=B9SuK3eR8uw

04:39 <bslsk05> 'non-accelerated mandelbrot' by michael bishop (00:00:13)

04:39 <clever> and this is doing the same function (but with floats) in scalar mode

04:39 <clever> what kind of fps does the non-accelerated one have?

04:40 <doug16k> yeah but mandelbrot can be anything from trivial to brutal

04:40 <doug16k> what zoom is that at?

04:41 <clever> the non-accelerated was the default from LK, which i think is just showing the whole thing

04:41 <doug16k> not fancy multiword precision? just ieee floats?

04:41 <clever> the accelerated one, is dynamicaly changing the zoom over time, because its fast enough to run at ~20fps

04:41 <clever> non-accelerates is just standard 32bit float

04:42 <clever> https://github.com/littlekernel/lk/blob/master/lib/gfx/gfx.c#L730-L768

04:42 <bslsk05> github.com: lk/gfx.c at master · littlekernel/lk · GitHub

04:42 <clever> the non-accelerated source

04:42 <doug16k> ok, see what I mean right? it might be a super fast float one or crazy gigantic precision

04:42 <clever> its the hardware float opcodes, which take 21 clocks for some operations

04:42 <clever> and its doing 1 pixel at a time

04:43 <clever> https://github.com/librerpi/lk-overlay/blob/master/app/vpu-mandelbrot/core.S#L62

04:43 <bslsk05> github.com: lk-overlay/core.S at master · librerpi/lk-overlay · GitHub

04:43 <clever> vs this code, which is doing most operations in 1-2 clocks, and its doing 16 pixels at once

04:46 <doug16k> is it way faster?

04:47 <clever> yes

04:47 <doug16k> most interesting is the latency of add and mul

04:47 <clever> 10 seconds for float, ~90ms for vectorized int

04:48 <clever> several orders of magnitude

04:48 <doug16k> like python/mysql stuff

04:48 <doug16k> you go in and it's 14 seconds and when you are done it is 48ms

04:49 <doug16k> someone had two nested for loops with a cursor.execute inside

04:49 <doug16k> never heard of a join

04:50 <clever> heh

04:51 knusbaum has joined #osdev

04:52 jack_rabbit has quit [Ping timeout: 244 seconds]

04:53 <doug16k> it's fun though. the improvements are thousands of percent when you fix bad sql stuff

04:53 <clever> yeah

04:53 <clever> ive identified sql problems before, without even seeing the sql

04:54 <clever> i could tell what the developer did wrong, just from how the api's responded

04:54 <clever> it was an ingame mail box

04:54 <clever> and the more mail you had, the longer it took to load a single page of mail

04:54 <clever> there was an index on account, but no index for the limit clause to act on

04:55 <clever> and it got bad enough, that the default request timeout in the client would just give up

04:55 <doug16k> yeah, he made it so it has to "think about" all mail, then extract a subset

04:56 <doug16k> it's so tempting to do pagination with LIMIT

04:56 <doug16k> instead of continuation based

04:57 <doug16k> select ...everything... LIMIT 1024,16 sucks compared to select ...everything after id N... LIMIT 16

04:57 <clever> but what about when your ID's have holes, due to deleted emails, or ID's getting assigned to other users?

04:58 <doug16k> you don't care about the holes. they tell you the last id in the last response they got, and you select > that

04:58 <clever> ah, but thats fine, as long as you know the id from the previous page

04:58 <clever> yeah

04:58 <clever> but that forces the client to know what the previous page ended on

04:58 <clever> what if i want to skip to the 10th page?

04:58 <doug16k> right

04:58 <doug16k> so you can't make it parallel, but it can go on forever without timeout problem

04:59 <doug16k> the LIMIT 1024,16 one can spray a ton of different ones in parallel

04:59 <doug16k> in your fantasies, the server handles it really well

04:59 <clever> now it makes sense, why github's API is the way it is

04:59 <clever> https://github.com/librerpi/lk-overlay/blob/master/platform/bcm28xx/v3d/v3d.c#L143-L153

04:59 <bslsk05> github.com: lk-overlay/v3d.c at master · librerpi/lk-overlay · GitHub

04:59 <clever> this is the shader for the spinning rgb triangle

05:00 <clever> its a VLIW, where the left column is add only operations, and the middle column is mult only operations

05:00 <clever> 4 inputs, 2 outputs, computed independantly, in parallel

05:00 <clever> first, we pop one varying, and we set the 4th(d) byte of r3 to 1.0 (auto-converts to 255)

05:01 <clever> then we add r5 to the popped varying (interpolation is half done, this finishes it), and pop the 2nd varying

05:01 <clever> then we add r5 to the popped varying (interpolation is half done, this finishes it), and pop the 3nd varying

05:01 LostCarcosa has joined #osdev

05:01 <clever> then we add r5 to the popped varying (interpolation is half done, this finishes it), copy the 1st varying to the 0th byte (a)

05:02 <clever> then copy the 2nd and 3rd varying to bytes 1&2(b&c)

05:02 <clever> and finally, move that to the output color register, and signal the end of the thread

05:02 <clever> dead-simple, and makes use of how you can pipeline an operation by mixing the 2 ALU's

05:06 <clever> i think every QPU opcode takes 4 clock cycles

05:06 <clever> and that program is 9 opcodes long, so 36 clocks

05:06 <clever> and at 250mhz, a single QPU can do 6.9 million pixels/second

05:07 <clever> thats entirely ignoring the fact that the QPU is a 16 lane vector core

05:08 <clever> assuming every pixel is touched once, it could render a 640x480 frame at 21 fps, ignoring the fact that its vectorized, and there are multiple QPU

05:09 <doug16k> wow look how dumb gcc is https://godbolt.org/z/jn1E5Mrq8

05:09 <bslsk05> godbolt.org: Compiler Explorer

05:10 <doug16k> what the hell is it doing?

05:10 <clever> i think its doing stores, of 1 byte each

05:10 <clever> but that is a rather verbose way of doing it

05:10 <doug16k> is that not just one store then add 4 to ptr and write back?

05:11 <clever> i'm not familiar with x86 abi, so thats a bit of a mess

05:11 <clever> i think maybe, its not caching the list pointer?

05:11 <doug16k> oh it's awful. I can write that in a few insns

05:11 <clever> and its re-reading it?

05:11 <doug16k> ah you need restrict

05:11 <doug16k> yep that's it

05:12 <clever> insert it where?

05:12 <doug16k> https://godbolt.org/z/GjffKboT5

05:12 <bslsk05> godbolt.org: Compiler Explorer

05:12 <clever> wow

05:12 <LostCarcosa> clang code does not improve with the restrict, interesting

05:12 <LostCarcosa> I mean it improves a bit, but not that much

05:12 <doug16k> it isn't sure that *list is itself

05:13 <doug16k> without restrict

05:13 <doug16k> osm

05:13 <doug16k> oops, isn't itself

05:14 <doug16k> what if *list = &list ?

05:14 <doug16k> right?

05:14 <doug16k> it's scared to death with char *

05:14 <clever> lol

05:14 <clever> this is just a primitive for appending 8/16/32bit values onto a byte array

05:14 <clever> i could instead memcpy struct's into the array

05:15 <clever> mesa has a fancy thing for doing just that

05:15 <doug16k> wait, it's not that

05:15 <doug16k> entirely. if it did the trick with big load it isn't sure that it will be affecting its input with those writes

05:16 <doug16k> isn't sure it will not be affected by its stores I mean

05:16 <doug16k> it's the same thing with vectorization

05:16 <doug16k> it looks like it could vectorize, but you didn't swear that the input and output don't overlap, so it is afraid

05:17 <clever> what exactly does restrict do?

05:17 <doug16k> restrict means, I guarantee that this isn't aliasing a variable

05:17 <clever> ah

05:17 <clever> platform/bcm28xx/v3d/v3d.c:170:3: error: dereferencing type-punned pointer will break strict-aliasing rules [-Werror=strict-aliasing]

05:17 <clever> uint32_t d = *((uint32_t *)&f);

05:17 <clever> ive also got this new error, it cant cast a float to an uint32

05:18 <doug16k> use memcpy

05:18 <doug16k> that is dumb and archaic

05:18 <doug16k> builtin memcpy will see wht you mean

05:18 <doug16k> memcpy(&d, &f, sizeof(d)

05:18 xenos1984 has quit [Read error: Connection reset by peer]

05:18 <doug16k> compiler is guaranteed to understand

05:19 <clever> i think i can memcpy that addword too

05:19 <doug16k> no UB

05:19 <moon-child> 'what if *list = &list' then you suck

05:19 <moon-child> :P

05:19 <moon-child> I mean, strict aliasing also sucks

05:19 <moon-child> but

05:19 <clever> as long as this code is on an LE system

05:19 <clever> but also, doing so, defeats the whole reason i was just about to compile this :P

05:20 <doug16k> I love strict aliasing because when I go look at the assembly, I don't think the compiler is screwed up

05:21 <doug16k> it does something like what I would do

05:21 <clever> ok, just one deleted func to fix...

05:21 <doug16k> not rereading things all paranoid delusional

05:22 <clever> now it compiles

05:23 <doug16k> strict aliasing gives it a chance of not thinking a store through a pointer invalidated every register variable

05:23 <clever> doug16k: https://gist.github.com/cleverca22/eedba1589a616d995d8582c868ec7552 is what the VPU compiler gave, without restrict

05:23 <bslsk05> gist.github.com: example.S · GitHub

05:24 <clever> load list into r2, add 1 byte, store back to list, store a byte from int to the pre-incremented value

05:24 <clever> load list again, increment again, shift the arg by increasing amounts, store a byte

05:24 <doug16k> is *list aligned?

05:25 <clever> the 32bit word can land at un-aligned addresses

05:25 <clever> because its in a control list, with a mix of 8, 16 and 32bit things

05:25 <doug16k> can you nop into alignment?

05:25 <clever> unlikely

05:26 <doug16k> it would be so good to not do all that for one 32 bit thing

05:26 <clever> opcodes 16/17 for example, are 8bit + 32bit

05:26 <clever> so you would need to nop it into mis-alignment, so the 8bit opcode puts it back in

05:26 <clever> but then opcode 32, is 8bit + 8bit + 32bit + 32bit + 32bit

05:26 <clever> so you need a different number of nop's to pad each opcode

05:27 <clever> at least i dont see any peverse 32bit + 8bit + 32bit cases, lol

05:29 <doug16k> can't you test the low 2 bits and conditional branch to 32 bit store? it would hit it 25% of the time

05:30 <clever> i could

05:30 <doug16k> depends on how many cycles a mispredict is

05:30 <clever> which reminds me of what ive seen in the official firmware's memcpy

05:30 <clever> it will test to see if the lower 2bits of the src&dest match

05:30 <clever> if they are similarly mis-aligned, you can do byte-wise for the start to get into alignment, then 32bit for the bulk

05:32 <clever> next, i believe the v3d expects everything in LE, and the VPU is LE always

05:32 <doug16k> if low two bits are 0, branch to whole 32 bit, else if low bit is 0, branch to 16 bit halves version, else sucky version

05:32 <clever> and arm is LE by default

05:32 <clever> so yeah, i can just memcpy the int32 over, just like the float

05:33 <doug16k> then it's 25% awesome, 50% pretty good, 25% sucks

05:33 <doug16k> I think

05:35 <doug16k> no wait, 00 = 32 bit, 01 = 8 bit, 10 = 16 bit, 11 - 8 bit

05:36 <doug16k> the low 2 bits of address

05:36 DonRichie has joined #osdev

05:36 <clever> refresh the gist i linked earlier

05:36 <clever> memcpy makes the C look simpler, but the asm is far more complex

05:36 <doug16k> you must have -fno-builtin

05:36 <doug16k> add -fbuiltin-memcpy

05:37 <doug16k> or you have -ffreestanding

05:37 <doug16k> which implies fno-builtin

05:37 <doug16k> or wait, if this cpu can't do misaligned, it has to do memcpy

05:38 <doug16k> but still

05:38 <doug16k> you should -fbuiltin-memcpy

05:38 xenos1984 has joined #osdev

05:38 <clever> cc1: error: unrecognized command line option ‘-fbuiltin-memcpy’

05:38 <doug16k> really? wow, what version?

05:38 <clever> vc4-elf-gcc (GCC) 6.2.1 20161217

05:39 <doug16k> maybe not implemented

05:39 <doug16k> surprising, builtin-memcpy helps C code massively

05:40 <clever> i also see room for improvement

05:40 <clever> lib/libc/string/arch/arm/arm/memcpy.S:FUNCTION(memcpy)

05:40 <clever> LK comes with asm copies of things like memcpy

05:40 <clever> i dont think i ever wrote one

05:40 <clever> lib/libc/string/memcpy.c:void *memcpy(void *dest, const void *src, size_t count) {

05:40 <doug16k> yeah it should for arch that don't like misaligned

05:40 <clever> so it falls back to whatever gcc did with this fallback

05:41 <doug16k> I have a pretty good memcpy

05:41 <doug16k> it does the textbook thing, get destination misaligned, copy biggest chunks possible

05:41 <doug16k> er, get destination aligned

05:41 <clever> that sounds similar to what ive seen in the official firmware

05:42 <clever> https://github.com/littlekernel/lk/blob/master/lib/libc/string/arch/arm/arm/memcpy.S vs https://github.com/littlekernel/lk/blob/master/lib/libc/string/memcpy.c

05:42 <bslsk05> github.com: lk/memcpy.S at master · littlekernel/lk · GitHub

05:42 <bslsk05> github.com: lk/memcpy.c at master · littlekernel/lk · GitHub

05:42 <doug16k> it works its way up to biggest chunks, does main loop in biggest chunks, then work your way down to smaller until byte

05:42 <clever> aha

05:42 <clever> the c code is doing exactly what i said earlier

05:42 <clever> byte-wise copy until both are 32bit aligned, then word-wise copy

05:43 <clever> its just a matter gcc not compiling that in an optimal manner

05:43 <clever> and i kinda dont want to look at the official firmware, that feels too much like copying then :P

05:44 <clever> i forget exactly how they did it

05:45 <geist> ah yes that memcpy

05:45 <geist> i was quite prouid of it

05:45 <clever> it looks to be the exact same algo as what ive seen in decompiles of the official firmware

05:45 <geist> i also wrote the one in darwin for arm32. i think it's still there

05:46 <clever> so i feel less bad about copying the algo in general

05:46 <clever> but the actual asm, i dont really want to copy

05:46 <geist> https://github.com/littlekernel/lk/blob/master/lib/libc/string/arch/arm/arm/memcpy.S#L124 is the real money shot. a fun trick you can do with arm32

05:46 <bslsk05> github.com: lk/memcpy.S at master · littlekernel/lk · GitHub

05:46 <clever> *looks*

05:47 <clever> why are you touching CPSR?

05:47 <clever> that feels very weird for memcpy to do

05:48 <clever> enless, are you abusing it, and conditional execution?

05:48 <clever> so you can skip some of the stores?

05:48 <doug16k> this is my memcpy that tries a bit https://github.com/doug65536/dgos/blob/master/kernel/arch/x86_64/cpu/isr.S#L2032

05:48 <bslsk05> github.com: dgos/isr.S at master · doug65536/dgos · GitHub

05:49 <geist> clever: that's the trick!

05:49 <clever> neat

05:49 <doug16k> too easy on x86 though

05:49 <clever> geist: the VPU could potentially pull off the same trick

05:49 sympt has quit [Ping timeout: 240 seconds]

05:50 <clever> since i can manipulate sr like that, and it has conditional execution

05:50 <geist> yeah., note this is all about trying to get things aligned properly

05:50 <geist> so that you can then do fast wordwise copies (via ldm/stm)

05:50 <clever> yeah, when you need to copy 1-3 bytes, to get both into alignment

05:50 <geist> since on arm32 that sort of thing mattered

05:50 <geist> this code actually copies up to 15 bytes, to align it on a 16 byte boundary

05:50 <geist> using that trick

05:51 <clever> i dont think vpu will benefit from anything more then 32bit alignment

05:51 <geist> https://github.com/darwin-on-arm/xnu/blob/master/osfmk/arm/bcopy.s is also some code i wrote

05:51 <bslsk05> github.com: xnu/bcopy.s at master · darwin-on-arm/xnu · GitHub

05:51 <doug16k> I'd get it cache line aligned if it is going to do big blocks

05:51 <geist> still there, hah

05:51 <clever> heh

05:51 <clever> i still need to get around to trying to build xnu

05:52 <clever> and a userland

05:52 <clever> but now that you mention 16 byte copies....

05:52 <clever> i can do a 4096 byte copy, in just 2 opcodes....

05:52 <clever> at the cost of trashing the vector regs

05:53 <clever> and i dont think the vector stuff has any real alignment requirements

05:53 __xor has joined #osdev

05:53 <geist> in the arm case it only really needed 4 bytes, but the 16 bytes you get somewhat for free, and the inner copy (.L_bigcopy) moves 32 bytes at a time

05:53 <geist> using ldm/stm

05:53 __xor has quit [Client Quit]

05:53 <clever> yeah, your mention of ldm/stm is what reminded me about using vector copies

05:54 <geist> arm64 defacto memcpy algorithm uses a whole different strategy, and doesn't concern itself much with alignment

05:54 <geist> since arm64 is intrinsically able to do unaligned access, and is generally not penalized any more than say x86 is. ie, it's okay to do unaligned and in most cases is probably just as fast

05:55 <clever> just the issue about the load/store not being atomic?

05:55 <geist> that is definitely the case, and where load/stores stride a cache line it may take an extra cycle or so, etc

05:55 <clever> i think youve mentioned before, that say a 32bit write, that is 32bit aligned, will be seen by other cores, as either having happened or not happened

05:55 <clever> yeah

05:55 <geist> that's right

05:56 <clever> and the exact numbers vary by core

05:56 <geist> both x86 and arm have a wordwise atomicity guarantee that's actually written down. but most of that only applies to native units that are aligned

05:56 <clever> for example, if i write 64 bits with an stm, on a 32bit core (so its storing 2 regs), what can a 2nd core observe?

05:56 <geist> no the architectuyre is more strict than that, it's not per core

05:57 <clever> so, could that 64bit write get shorn in half?

05:57 <doug16k> the only torn access I have heard of on x86 is some MMX

05:57 <geist> you are only guaranteed up to probably the native register size. stm/ldms are generally considered to be functionally equivalent to a series of load/stores in order

05:57 <geist> doubleplus so since the cpu can literally be interrupted in the middle of it

05:57 <clever> yeah, so that ldm could get torn, and effectively is just 2 seperately atomic 32bit writes

05:58 <geist> well, be careful throwing around words like 'atomic' here, since that has different meaning

05:58 <clever> in terms of what another core can see if it reads that addr during the write

05:58 <geist> also weak memory model, etc. but what you are guaranteed not to see in particular cases are torn writes

05:58 <geist> based on alignment, word size, etc

05:59 <clever> but what if i was to do that, on an aarch64 core, in aarch32 mode?

05:59 <geist> but due to weak memory model it doesn't guarantee that you'll see it in order, or at all

05:59 <geist> but you wont see half of it

05:59 <clever> would it still be treated as 2 32bit writes? because its from say r2 and r3

05:59 <geist> what case is this specifically?

05:59 mzxtuelkl has joined #osdev

05:59 <clever> an aarch64 core, but in aarch32 mode, doing an stm, to save r2+r3 to ram

06:00 knusbaum has quit [Ping timeout: 244 seconds]

06:00 MiningMarsh has quit [Quit: ZNC 1.8.2 - https://znc.in]

06:00 <geist> you are guaranteed according to the arm32 memory access rules

06:00 knusbaum has joined #osdev

06:00 <geist> will it do more? possibly. but that's not specced

06:00 <clever> makes sense

06:00 <clever> it only has to meet aarch32 rules, because its claiming to be aarch32

06:01 <geist> the obvious case is say ldm/stm that starts at offset 0xc

06:01 <geist> and it crosses into some new cache line

06:01 <geist> or more specifically, offset 0xffc

06:01 <geist> so you store a word at 0xffc and another at 0x1000

06:01 <clever> i'm assuming its also 32bit aligned

06:01 <geist> in arm32 rules that means two writes, and it would still be the case in arm64

06:01 <geist> but in arm64 if you stored a single 64bit word at 0xffc it wouldn't be guaranteed, because it was unaligned 64bit

06:02 <clever> yeah

06:02 <geist> note x86 has pretty much the same rules. it's just strongly ordered

06:02 <geist> but you can get torn writes based on misalignment

06:02 <clever> and by ordered, you mean that an arm core could see either write as having happened first?

06:02 <geist> (and exceptions with various SSE things that have weaker rules)

06:03 <geist> for arm32 two writes? absolutely

06:03 <clever> while x86 always observes the writes happening in program order

06:03 <geist> right

06:03 <clever> any kind of rules, on how arm can reorder things?

06:03 <geist> so on x86 two 32bit writes back to back could show up as -- A- AB

06:03 <geist> yes. it can do what the fuck it wants, and barriers and things that generate barriers force it

06:03 <clever> but on arm, they could show up as -- A- AB -B ?

06:03 <geist> (though the rules are hella complicated)

06:04 MiningMarsh has joined #osdev

06:04 <geist> yes

06:04 <clever> where might i find the rules on how it re-orders things?

06:04 <geist> with arm the general base rule is 'outside of a barrier it can emit writes in whatever order the cpu wants to, but it must be internally consistent with itself'

06:05 <geist> but then there are a bunch of sub rules about how to get it to not do that in specific cases, but it's programmers problem

06:05 <clever> i would assume that a pair of back2back 32bit writes, would prefer to occur in order of address?

06:05 <geist> it doesn't. it doesn't say how it reorders things, it basically says 'the cpu is free to reorder things how it wants to (with some exceptions to that)'

06:05 <clever> so they can turn into an axi burst, for example

06:05 <doug16k> if it has a cache line already it can just stick the store in the line, even if there is a previous store that missed and must wait

06:06 <geist> exactly. it allows it to be much more lazy about how cache liens are filled, written back, etc. it's an implementatino detail

06:06 <clever> doug16k: and that can make the missed store become visible after the others

06:06 <geist> removes a lot of complexity on the back end of the cpu

06:06 <doug16k> yep

06:06 <geist> *however* the rules also state that the cpu is ordered relative to itself

06:06 <clever> but if i'm doing a pair of 32bit writes, to the same cache line, they are either both going to go thru, or both miss

06:06 <geist> so it cannt move reads before writes, etc

06:07 <clever> assuming another core doesnt steal the cache line mid-write

06:07 <geist> well, reads of the same line, before writes. etc. it has to act from a single cores poiunt of view that it's in order

06:07 <geist> even if the cpu is highly OOO

06:08 <clever> if your doing reads, against a pending write, does it peek into the write queue, and act like the write had already completed?

06:08 <geist> clever: als dont assume the cpu is running the instructions in order. it could have completely rearranged the sequence it ran the two stores in

06:08 <clever> so it can see its own writes in program order

06:08 <geist> based on what registers the stores depend on, etc

06:08 <clever> yeah, that starts to complicate it even further

06:09 <geist> if you had two stores for address A and B, and the A store writes a reg that depends on some complex logic it may 'get to' the B store immediately, issue it, and then the A store happens later in the pipeline

06:09 <geist> since it has detected that the two stores dont depend on each other

06:10 <geist> but this is where the memory barriers in ARM come in: DMB and DSB instructions. you're inserting barriers that have various semantics about drawing lines in the sand, saying this must happen before that, etc

06:10 <geist> thats a DMB in general. a DSB is more aggressively dumping all of the load/stores, in general

06:11 <clever> lets take xhci as an example, each message in the command rings has an "is valid" flag at some point

06:11 <clever> for that, you would write the message minus the valid, then flush the cache and issue a barrier?

06:11 <clever> then write the valid, flush the cache, and go on?

06:12 <geist> yes. also flushing the cache implicitly has barriers in it as part of the algorithm

06:12 <doug16k> barrier before writing the last 32 bits, then write last 32 bits

06:12 <clever> doug16k: but if its cached memory, you need to flush, and then there is the question of what order the flushing writes to ram....

06:12 <doug16k> yeah

06:12 <geist> https://github.com/littlekernel/lk/blob/master/arch/arm64/cache-ops.S#L25 that's why this dsb is there, basically

06:12 <bslsk05> github.com: lk/cache-ops.S at master · littlekernel/lk · GitHub

06:13 <clever> but, if this command ring is write-only, you want write-combined instead?

06:13 <geist> it forces everything out so that after that point things are cool

06:13 <clever> like a framebuffer

06:13 <doug16k> point is, you are guaranteeing that the previous 3 stores actually are globally visible before you even set the valid bit

06:13 <doug16k> the cache flush kinda wrecks the example :D

06:14 <geist> yah ignoring the cache part, if you want to make sure what you wrote on cpu A is observable by other entities on the bus that participate in cache coherency, you issue a DSB to flush it out

06:14 <geist> that makes sure all stores that are pending actually make it out of the cpus write buffer, which you can kinda thing of as a L0 cache

06:15 <clever> oh right, i'm assuming the xhci isnt coherent with the arm caches

06:15 <geist> note this is all when using 'normal memory' which is fully cached, etc. when you're reading/writing to pages you have mapped as 'device' or 'strongly ordered' theres a pile of additional rules that happen, and they're much more strict

06:15 <geist> but a lot of complexity WRT the ordering of outstanding 'normal' memory transactions and new uncached (device/strongly ordered) memory

06:15 <geist> that's where a ton of the subtle rules come in

06:16 <clever> lets assume that the xhci isnt coherent with the arm caches at all, and i choose to use write-combined, because this block of memory is write-only

06:16 <geist> if evertything is just plain cached memory and you're not worrying about entities that dont participate in cache coherency (ie you're only thinking about other cpus and mmu TLB fetchers) then you're playing with normal memory barriers and weak memory model

06:16 <clever> write-combined, means the arm only stores 1 cacheline worth of data? and tracks what bytes are dirty, so it can flush it out safely, without knowing the old contents?

06:17 <geist> clever: then that's a completely different kettle of fish that i honestly dont remember the rules for

06:17 <clever> ah

06:17 <geist> no. write combined is a form of uncached

06:17 <geist> ie not 'normal memory'

06:17 <doug16k> clever, what you said is x86 WC memory

06:17 <doug16k> with the byte enables and no write allocate

06:17 <clever> doug16k: ahh, i was mostly guessing how it worked

06:17 <geist> it's one of the nGnRnE variants i think

06:18 <geist> basically, uh, i think nGRE? i forget

06:18 <clever> geist: i did recently read a doc you linked, that explained that tangle of letters, let me dig it up again...

06:18 <geist> though actually i think i'm wrong. there's a variant of 'normal memory' that i think covers what you want. lemme find it in fuchsia code

06:19 <geist> https://fuchsia.googlesource.com/fuchsia/+/refs/heads/main/zircon/kernel/arch/arm64/include/arch/arm64/mmu.h#269 this is it

06:19 <bslsk05> fuchsia.googlesource.com: zircon/kernel/arch/arm64/include/arch/arm64/mmu.h - fuchsia - Git at Google

06:19 <geist> it's a variant of 'normal memory' but treated as uncached + write combined

06:19 <clever> cross-referencing to my armv8 docs....

06:20 <geist> (for those that know x86, the MAIR register is basically PAT on x86. each page table has a 3 bit index into the MAIR which has a list of 8 different types of memory)

06:20 <geist> anyway like i said there's a bunch of complex rules with regards to how outstanding memory transactions are sorted with different cache properties

06:20 <geist> and those i always have to look up

06:21 <geist> and the general safe rule is to assume they're not sorted and insert barriers as appropriate

06:21 <clever> ok, i see an MAIR0 register...

06:21 <clever> AttrIndex[2] says if its reading MAIR0 or MAIR1

06:22 <clever> ah, but this is just a 32bit compat thing

06:22 <geist> MAIR has 8 fields of 8 bits, each describes a particular memory type you can define

06:22 <clever> on the 64bit side, its just MAIR_EL1

06:22 <clever> yeah, i see that in the 64bit reg

06:22 <geist> yah that's just because there are 3 bits in the page table entry that point to one of the 8 fields in the MAIR_EL1

06:23 <clever> and those 8 fields cant fit into a 32bit reg

06:23 <clever> so aarch32 cut it into 2 regs

06:23 the_lanetly_052 has joined #osdev

06:23 <geist> right. and though those 8 bits per field let you describe a ton of combinations of cache/uncached/device/etc bits, in practice only about 4 combinations are useful

06:23 <geist> and it's the 4 fuchsia has in the link

06:24 <geist> i think those are basically identical to linux's and freebsds

06:24 curi0 has joined #osdev

06:24 <clever> this reminds me of a remap thing (might be the same thing) that i saw in the paging tables

06:24 <clever> where originally, those 3 bits where the mode itself

06:24 <geist> you might want to define a read-allocated + write through variant, which you can

06:24 <clever> but now it has enough modes, that it needs 8bits to describe the mode

06:24 <geist> right that happened somewhere in the armv6/armv7 days

06:24 <geist> i think it was something like 'TEX remap' or whatnot

06:24 <clever> so its instead using the 3bit as an index, to one of 8 modes

06:24 <clever> kind of like a palette in an image

06:25 <geist> right

06:25 <geist> ie, the MAIR

06:25 <geist> 'memory attribute something register'

06:25 <clever> where you might only have 2 bits per pixel, but you can then assign 4 unique 32bit colors

06:25 <clever> MAIR_EL1, Memory Attribute Indirection Register (EL1)

06:26 <clever> so in the fuchsia code you linked, your assigning slot 3 to 0x44, and then creating a constant that says to just shove 3 into the paging tables

06:26 <geist> looks like linux uses 5 variants: https://github.com/torvalds/linux/blob/eaa54b1458ca84092e513d554dd6d234245e6bef/arch/arm64/mm/proc.S#L62

06:26 <bslsk05> github.com: linux/proc.S at eaa54b1458ca84092e513d554dd6d234245e6bef · torvalds/linux · GitHub

06:26 <geist> right

06:26 <curi0> whats the typical process for moving a BAR of a PCI device like ?

06:27 <curi0> i'm trying to understand how linux does it

06:27 <clever> ah, and your clearly defining all 8 codes, on seperate defines, while linux mashed them all into 1

06:27 <geist> curi0: in general you just write to the BAR, but the hard part is knowing what to put there, and allocate it

06:27 <geist> the hardware itself you simply write to it and it takes effect immediately

06:27 <doug16k> curi0, it describes it in the PCI spec. there is a procedure to autodetect how big it needs to be and that tells you its alignment too

06:28 <geist> correct. annoyingly since they were trying to be compact they didn't just define another config field that says 'its this big' which would have been really damn amazing

06:28 <geist> since the only way to determine its size is to temporarily write all 0xfffs to it and read back which bits are unimplemented, really annoying

06:29 <clever> ive done the same thing to figure out how some VPU stuff needs to be aligned

06:29 <doug16k> yeah. I guess they figured it was clever to just let the unimplemented bits tell you

06:29 <clever> handy trick, when you can read the value back, and they dont implement bits that you shouldnt be setting

06:29 <curi0> what about moving it away from what the BIOS assigns ? i can see this in my kernel log for BAR 0 "releasing [mem 0x600000000-0x60fffffff 64bit pref]" and "assigned [mem 0x400000000-0x5ffffffff 64bit pref]"

06:29 <geist> we actually have an oustanding bug in fuchsia where the user space pci driver is probing the PCI bus and for a fraction of an instant must overwrite every BAR

06:29 <curi0> its also resizing it but thats not relevant for me rn

06:29 <geist> which if the kernel happens to use that device implicitly in that window. boom

06:30 <clever> geist: mutex time?

06:30 <geist> hard, because user space

06:30 <clever> yeah

06:30 <geist> have to basically freeze all cpus

06:30 <clever> thats the kind of thing that pci in kernelspace gets for free

06:30 <geist> clever: and when pci in kernelspace scans things when its still single cpu

06:31 <clever> that makes it even simpler!

06:31 <doug16k> why not make a shadow copy of all the config spaces in ram and use that?

06:31 <geist> you could bump into it if you say were running more than one cpu and one cpu is scanning the pci bus and the other one is writing to the framebuffer

06:31 <geist> or a pci serial card (the case we hit in fuchsia)

06:32 <doug16k> with the 1111 masks in the BARS I mean

06:32 <curi0> my default my BIOS assigns a 256MB sized bar at 0x600000000. however there is not enough room there for it to resize to 8GB. so the amdgpu driver tells kernel to find 8GB free and it does at 0x300000000

06:32 <clever> if you where scanning a device after smp is up, you would need some kind of mutex over the whole pci card

06:32 <geist> doug16k: how would that help? you have to write to it to see it, and in that instant the BAR is unconfigured

06:32 <clever> and the serial/framebuffer code would have to grab it every time it touches the device, yuck

06:32 <curi0> for now i just want to figure out how to move it and resizing i will figure out later

06:32 <curi0> *by default my BIOS

06:32 <clever> curi0: i assume you would just look for a hole after ram, and shove it there?

06:32 <geist> curi0: well, like i said you can simply do it but you need to allocate the space. the allocation is the hard part

06:33 <doug16k> geist, oh I figured you could intercept their config accesses and know what to present them

06:33 <geist> especially if the device is on the other side of a bridge, because you also have to adjust the bridge to cover the new zone

06:33 <clever> i would assume the BAR can go anywhere in the 64 (or 48?) bit addr space

06:33 <curi0> clever, yup

06:33 <curi0> yes it adjusts the bridge too

06:34 <geist> clever: there are some limitations there. specifically bridges only can bridge 64bit 'prefetchable' memory (there's a separate config field for that)

06:34 <curi0> do i need to copy memory from the old bar location to the new one or not ?

06:34 <doug16k> curi0, that isn't what BARS mean

06:34 <geist> so anything that is 64bit must be either on the root bus (and thus behind no bridge) or intrinsically prefetchable, because if it son the other side of a bridge it has to also be prefetchable

06:35 <doug16k> BARS don't say where to get it from RAM

06:35 <clever> geist: ah, here is the thing you linked earlier, that explains the letter soup in the linux source: https://developer.arm.com/documentation/100941/0101/Memory-types?lang=en

06:35 <bslsk05> developer.arm.com: Documentation – Arm Developer

06:35 <doug16k> BARS say where to the device address range into the address space

06:35 <doug16k> where to insert*

06:35 <curi0> thanks for explaination

06:36 <geist> yah in this case think of the video card as having 256MB (or 8GB) of internal memory that it's dumping onto the cpu's memory bus

06:36 <doug16k> it configures what range of addresses to make the device go "oh oh that's me!" on the PCI bus

06:36 <geist> and the BAR says how big the window is and where to put it

06:36 <clever> curi0: my understanding, is that the BAR maps a chunk of memory on the pci device, to some physical address, so changing the BAR, just moves the ram to a new addr, and whatever data was in that ram, is moved along with it

06:36 jack_rabbit has joined #osdev

06:36 <geist> normally things like say an ethernet card may have fairly small bars, like 4K or 16K because all they're doing is presenting some memory mapped registers

06:36 <Andrew> Turns out that cross compilers are inevitable

06:37 <clever> i did also recently see mention, about how a GPU may have 8gig of ram, but the BAR is only 256mb in size, and can never grow bigger

06:37 knusbaum has quit [Ping timeout: 244 seconds]

06:37 <clever> and you use MMIO to change where in gpu ram, that 256mb window points

06:37 <doug16k> clever, you use DMA for everything

06:37 <geist> right, but theres some new resizable bar feature that some newer cards like that i think expands that notion

06:37 <clever> however, if bus mastering is enabled, the gpu can read any host ram

06:37 <clever> doug16k: yep, exactly

06:37 <curi0> mine does become 8GB

06:37 <curi0> resizable BAR capability

06:37 <geist> so presumably there is some newer feature that actually likes to have large windows into gpu ram

06:37 <curi0> but im not interested in doing that for now

06:37 <clever> so instead of moving the 256mb BAR to point to different gpu ram

06:37 <clever> you tell the gpu to dma stuff from host ram to gpu ram

06:38 <clever> and it fills itself

06:38 <doug16k> the CPU can just memcpy into the GPU instead of scheduling DMA

06:38 <Andrew> gcc -m32 -nostlib -nostdinc -fno-builtin -fno-stack-protector -no-pie -fno-pic -c kernel.c -o kernel.o && ld -m elf_i386 -Tlink.ld -o kernel.elf kernel.o # ld gives some weird 'i386 input of kernel.o incompatible with i386:x_64 output'

06:38 <doug16k> if they lock a buffer, you just access it where it really is, if you want

06:38 <Andrew> I finally understood why people say that not using cross compilers -> huge pain

06:39 ddevault has quit [Write error: Connection reset by peer]

06:39 jleightcap has quit [Read error: Connection reset by peer]

06:39 patwid has quit [Read error: Connection reset by peer]

06:39 gjnoonan has quit [Read error: Connection reset by peer]

06:39 alethkit has quit [Read error: Connection reset by peer]

06:39 exec64 has quit [Read error: Connection reset by peer]

06:39 tom5760 has quit [Read error: Connection reset by peer]

06:39 sm2n has quit [Read error: Connection reset by peer]

06:39 milesrout has quit [Read error: Connection reset by peer]

06:39 <geist> Andrew: yeah it's much simpler to just have a cross compiler that directly does what you want. doesn't mean you can't force your native one to generate code how you want, but it's an extra thing you have to futz with

06:39 ddevault has joined #osdev

06:39 gjnoonan has joined #osdev

06:39 <doug16k> if they did the lock where it is write only and discard so it can just do the stores right into GPU memory with write combined stores of all line bursts

06:40 <geist> and it's harder to get people to help you because your setup is intrinsically a special snowflake in many cases

06:40 tom5760 has joined #osdev

06:40 milesrout has joined #osdev

06:40 patwid has joined #osdev

06:40 jleightcap has joined #osdev

06:40 exec64 has joined #osdev

06:40 alethkit has joined #osdev

06:40 sm2n has joined #osdev

06:40 <geist> it's a bunch of extra variables that enter the picture that may be specific to your setup

06:42 <clever> geist: aha, found all of the right sections, so your 0x44, says its device, non gathering, non-orderable, with early write ack, nGnRE, but in the case of xhci command rings, i think gathering wouldnt be an issue?

06:43 <geist> with appropriate barriers yes

06:43 <clever> G vs nG seems like a thing for true mmio

06:43 <geist> i mean probably. honestly my brain is turning off

06:43 <clever> where the write itself, has side-effects

06:43 <geist> it requires full attention to grok arm cache shit

06:43 <clever> yeah

06:44 <geist> https://weblog.jamisbuck.org/2010/12/29/maze-generation-eller-s-algorithm is what i was looking at actually

06:44 <bslsk05> weblog.jamisbuck.org: Buckblog: Maze Generation: Eller's Algorithm

06:44 <geist> was thinking of bashing together some C code for that. i have the old MAZE.BAS here on my altair

06:44 <geist> i used to love this thing, it's a cool little algorithm

06:44 <curi0> so to change the BAR address do I just have to write the new address to the BAR register in PCI configuration space (after configuring the bridge) ?

06:44 <clever> so if i'm understanding things right, i could change the MAIR field to 0x4c, and it may slightly improve performance, but i havent checked what the other 4 does

06:45 <doug16k> curi0, not really, but if you read it back, and it is still the same value, then sure

06:45 <geist> clever: well, you had darn well better know what you're doing

06:45 <doug16k> if not then you are violating alignment or something

06:45 <geist> there are no free lunches

06:45 <curi0> whats the process for changing the address then ?

06:45 <curi0> i tried looking through the linux kernel and couldnt find

06:45 <clever> geist: yeah, i would also want barriers, and to test things heavily

06:46 <clever> hmmm, but then the early-ack, does that lie to a barrier opcode?

06:46 <geist> again the process is easy: you just write the new value. the hard part is knowing the value, allocating it and making sure it doesn't overlap, and making sure that particular hardware can handle it, and making sure any bridges in front of it can handle it

06:46 <doug16k> curi0, you will luckily give an address that is aligned properly with the right bits set in low 3 bits, or you will not do it right, and it will not read back what you wrote

06:47 <curi0> thanks i'll try these in a vm

06:47 <clever> geist: oh, and now i realize, i decoded your 0x44 wrong already, starting over!

06:47 <geist> we're not being dismissive. it's just complicated

06:47 <geist> clever: yeah 0x44 is a variant of the normal memory stuff

06:47 <clever> yep

06:48 <geist> curi0: is this in the context of your kernel or something?

06:48 <curi0> im just trying to understand how linux moves BARs

06:48 <curi0> nothing else really

06:48 <clever> geist: which is just normal memory, outer non-cacheable, inner non-cacheable, so the letter soup i decoded was meaningless!

06:48 <geist> sure but how are you going to test it?

06:48 <curi0> efi program

06:48 <curi0> already reading stuff with it'

06:48 <geist> and in general linux doesn't move bars unless it has to

06:48 <geist> usually it leaves it alone, (on an x86 machine at least)

06:49 <curi0> for me it does because thats the only way it can find space for an 8GB bar on my system

06:49 <clever> 2022-07-07 03:29:34 < curi0> what about moving it away from what the BIOS assigns ? i can see this in my kernel log for BAR 0 "releasing [mem 0x600000000-0x60fffffff 64bit pref]" and "assigned [mem

06:49 <clever> no such messages on my system

06:50 <geist> what hardware is this for?

06:50 <geist> probably a fairly new vid card with the resizable bar stuff

06:50 <clever> and my largest bar is 256mb

06:50 <curi0> amd rx 580 8gb and i5 3470

06:50 <geist> ah yeah it's fairly new i think

06:51 <curi0> 5 years

06:51 <geist> i'll have to check mine (geforce 1080ti) as far as i can tell that gen nvidia didn't like to do that

06:51 <geist> but possible there's some new driver bits that do it

06:51 <curi0> i think it works all the way back to R9s from 2013

06:52 <curi0> amd has had support for a while on linux

06:52 <geist> anyway, like we said moving the bar is easy, allocating it and knowing it's okay to is harder. linux has a whole pci bus driver that has a holistic view of the world, so it knows where to allocate new space

06:53 <curi0> yeah im hoping uefi has anything that will make that easier (i havent found anything so far)

06:54 <geist> yah also keep in mind it's likely that the video immediately explodes if you move it

06:54 <geist> because now the framebuffer will be in a new spot

06:55 <geist> unless it's in a separate BAR

06:55 <doug16k> is it integrated?

06:56 <geist> not if it's an fx 580

06:56 <geist> and a bridge was mentioned. integrated stuff tends to be on the root bus

06:56 <doug16k> yeah if code is expecting the framebuffer and stuff at one place and you move it, then it will not work of course

06:57 <geist> i was wondering how mundane devices get 64bit mapped recently when writing a full pci driver for LK but found out that basically they were all on the root bus

06:57 <geist> ie, some derpy AHCI device or e1000 can sit in 64bit range non-prefetchable because they dont have to sit behind a bridge

06:57 <geist> which would force the 64bit range to be prefetch

06:58 <geist> honestly surprised PCI hasn't specced some sort of new bridge type that fixes that bug

06:58 <doug16k> what device is it where you care if it is prefetchable?

06:59 <geist> only true prefetchable bars i've seen are for vid card BARs

06:59 <doug16k> oh wait, you mean it makes the accesses prefetchable when they weren't?

06:59 <geist> but presumably something like e1000's mmio bars should *not* be prefetchable

06:59 <geist> but if they were behind a bridge they'd have to be forced to if using a >4GB address

06:59 <geist> due to the limitation of the legacy bridge spec, that only gets 64bit addresses for prefetch regions

07:00 <geist> i was surprised to discover this

07:01 <geist> so what i've seen is e1000s and whatnot that declare they have a 64bit non prefetch BAR, whcih means functionally they're forced to be 32bit BARs unless they're on the root bus (ie, built into the soc)

07:04 <doug16k> BARs that are huge tend to be prefetchable (framebuffers)

07:04 <doug16k> non-prefetchable tend to be tiny

07:04 <doug16k> so I guess nobody is worried that non prefetchable have to be < 4G

07:04 <clever> and the smaller it is, the more easily you can pack it into the <4gig

07:05 <clever> although, wont those BAR's be covering up ram?

07:05 <doug16k> not necessarily

07:05 <doug16k> the hole can be remapped to the end of ram

07:06 <clever> how?

07:06 <geist> in the very early days of x86 machines getting close to 4GB ram, yes. but pretty quickly x86 machines started remapping top of ram that would be covered up to >4GB. so you get a discontiguity in ram

07:06 <geist> it's intel or AMD specific, but there are various SOC control registers that let you set where the top of 'low' memory is

07:06 <geist> (TOLUD, etc)

07:06 <clever> ah

07:06 <doug16k> yeah, people hardly care anymore about the lost ram

07:07 <doug16k> at first everyone was freaking out because microsoft used PAE to coerce you into getting server version

07:07 <geist> which functionally is telling the intel/amd socs where to stop decoding DRAM for the <4GB hole, and then another one that says where to stop decoding DRAM > 4GB

07:07 <clever> so you could set the "low" (<4gig?) memory to end at say 1gig?

07:07 <clever> and then you have a 3gig hole, and the rest of the ram is >4gig?

07:07 <doug16k> so there was little point in remap

07:07 <geist> so on a machine with say 0-3GB, then 4GB - 11GB there may be two control registers somewhere: one set to 3GB and another one at 11GB

07:07 <clever> yeah, makes sense

07:08 <geist> then that's intrinsically priming the built in address decoders where to redirect transactions to the dram controller vs everything else

07:08 <clever> so you can artificially create your own hole in ram

07:08 <clever> and then shove all of the pci stuff in that hole

07:08 <geist> yah this is part of what the bios has to set up

07:08 <geist> you can actually find these registers, they're somewhat documented. i think the AMD ones are called TOLUD and TOLUD2. iirc. they're MSRs

07:08 <geist> intel has another similar set. i think in pci device 0:0.0

07:08 <geist> or something like that

07:09 <clever> in the latest linus tech tips, they showed off some more of what intel developers are doing

07:09 <clever> and they modified the cpuid registers via a debug interface, WHILE WINDOWS WAS RUNNING

07:09 <geist> or maybe the other way around and the intel one is TOLUD and the AMD one is something else

07:09 <clever> then re-opened cpu-z, and the cpuid said LTT

07:09 <doug16k> TOPMEM?

07:09 <geist> yah, TOPMEM

07:09 <geist> that's the AMD ones. right?

07:09 <doug16k> think so yeah

07:10 <geist> clever: re: the cpuid stuff, AMD at least documents a lot of that. there are some control MSRs that let you set various 'hard coded' stuff that shows up in cpuid

07:10 <geist> basically things that the bios can do to override features so the os doesn't think it's there, etc

07:10 <clever> geist: except this was going via a backdoor debug channel, and the os wasnt having to co-operate

07:10 <geist> that too

07:11 <geist> lots of those bits are not necessarily hard coded. filled in with microcode, or MSRs or whatnot. probablyu just sram cells sitting there

07:11 <clever> yeah

07:11 <clever> https://www.youtube.com/watch?v=pyVZ05SO0Ic

07:11 <bslsk05> 'I saved the best for last' by Linus Tech Tips (00:18:33)

07:11 <geist> i mean i dont know what they did but based on what i've seen and heard, i'm totally not surprised

07:12 <clever> they did also mess with the overclock over that port

07:12 <clever> which ive done too

07:12 <geist> reading the BKDG for AMD 15h is one of the most informative things about how the sausage is made i've read in a while. it really goes into pretty intricate detail

07:12 <geist> and one can assume that intel has similar amounts of control knobs

07:13 <clever> my desktop motherboard is for "gaming", and a special usb-host port, that can identify itself as an HID device if you put a button

07:13 <geist> anyway off to bed i go

07:13 <clever> and then using a closed-source app, you can mess with the clock stuff

07:13 <geist> i meant to go like an hour and a half ago

07:13 <geist> damn you clever!

07:13 <clever> same

07:13 <clever> damn you too!

07:13 <clever> 4am

07:15 <doug16k> I can't stop thinking about meltdown during this video

07:24 the_lanetly_052 has quit [Ping timeout: 244 seconds]

07:29 <doug16k> I'd ask them how many performance counter errata are going to be in 12th gen

07:34 <Andrew> Building binutils at the moment

07:34 <Andrew> Wish me luck :D

07:35 toluene has quit [Read error: Connection reset by peer]

07:37 <doug16k> hmmm, I wonder if any virus scanners detect an infinite loop that increments a variable as a precision timer for sidechannels

07:38 toluene has joined #osdev

07:38 <moon-child> if you're native code, presumably you can just rdtsc

07:38 <moon-child> and don't need a spincrement loop

07:39 <doug16k> there are facilities to put a divisor on it right?

07:39 <moon-child> maybe?

07:39 <moon-child> I haven't heard of any, but that doesn't mean they don't exist

07:39 <doug16k> I think so. anyway, you can disable the whole thing too

07:42 kaichiuchi has quit [Ping timeout: 244 seconds]

07:42 kazinsal has quit [Read error: Connection reset by peer]

07:42 danlarkin has quit [Read error: Connection reset by peer]

07:42 kazinsal has joined #osdev

07:42 kaichiuchi has joined #osdev

07:42 danlarkin has joined #osdev

07:44 arch-angel has quit [Read error: Connection reset by peer]

07:44 <doug16k> https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/timestamp-counter-scaling-virtualization-white-paper.pdf

07:47 <doug16k> makes the cpu do a tsc * 16.48 fixedpoint scale factor and give that to guest

07:47 xenos1984 has quit [Read error: Connection reset by peer]

07:47 <doug16k> host decides how much tinfoil to use

07:49 <doug16k> can use so much tinfoil, it increments every 22 hours with 3.5GHz base

07:50 <doug16k> or make it seem like you base frequency is 229 terahertz

07:52 <doug16k> not sure it allows 1 in top 16 bits, it would be funny, I'm kidding

07:53 <doug16k> makes me wonder if that is for hosts to better falsify CPUID :D

07:54 <doug16k> pretend it is some better cpu, make the base frequency look right

07:55 <doug16k> that would be what the upper 16 bits are for. lower 16 would be for scaling down precision to prevent usermode spectre

07:55 <doug16k> spamming increment in thread would be the workaround

07:56 <doug16k> lower 48 I mean

07:58 arch-angel has joined #osdev

07:58 curi0 has quit [Remote host closed the connection]

07:59 <doug16k> funny it is for VM guests and not in general

08:05 xenos1984 has joined #osdev

08:11 liz has joined #osdev

08:19 LostCarcosa has quit [Quit: Leaving]

08:56 GeDaMo has joined #osdev

09:19 <Andrew> Ahhhhh

09:19 <Andrew> Using cross compilation makes things so much better

09:26 <zid> yes, yes it does

09:26 <zid> makes what better? :P

09:30 arch-angel has quit [Quit: Leaving]

09:31 arch-angel has joined #osdev

09:36 arch-angel has quit [Client Quit]

09:36 <mjg_> real os developers write bytecode by hand until they have a self-hosting kernel

09:38 <Mutabah> speaking of that... did you see the GDQ "Triforce%" run?

09:38 <Mutabah> (bootloaders/bytecode... and by "hand")

09:39 <mjg_> wut?

09:39 <Mutabah> Summer Games Done Quick featured a showcase of Ocarina of Time

09:40 <zid> If you want me to explain any of it I probably can mutabah

09:40 <mjg_> i only find "live reaction" videos

09:40 <Mutabah> Arbitrary code execution, used to show off beta content left on the cartridge... and then some

09:40 <zid> I know a fair amount of OoT

09:40 <Mutabah> zid: Oh, I saw it live, and a nice breakdown of how it was performed

09:40 <mjg_> heh, solid. i'll take a look later, fortunately i never played any of the zelda games

09:40 <zid> performed sure

09:41 <zid> but I know what it *does* :p

09:41 <mjg_> what cpu was that running on?

09:41 <zid> mips

09:41 <Mutabah> The breakdown video mentioned that they limited the interface to what could be done with a physical controller

09:42 SGautam has joined #osdev

09:42 <Mutabah> i.e. didn't use two bits that didn't have buttons, but controllable over the controller interface

09:42 <zid> so you're not going to ask how SRM works? :( *sadge*

09:43 <Mutabah> :) I've watched enough speedruns of Zelda64 to know :D

09:43 <zid> boo

09:43 <mjg_> https://www.youtube.com/watch?v=PNbkv_DJ0f0 is this the video?

09:43 <bslsk05> 'Ocarina of Time TAS by dwangoAC, TASBot, Savestate, Sauraen in 53:05 - Summer Games Done Quick 2022' by Games Done Quick (01:12:43)

09:43 <zid> go on then, you explain it

09:43 <zid> I'm waiting :D

09:43 <Mutabah> and triggered enough use-after-frees :)

09:43 <zid> nod, they have a bunch of silly internal jargon, but it's technically a use-after-free that they're exploiting

09:44 <Mutabah> You pick up an object while it's not being rendered (through camera manipulation), which ties its position/rotation to link's position (by updating it every frame)

09:44 <zid> (superslide teleporting through a load volume with a strength upgrade usually)

09:44 <zid> but there's other methods that achieve the same result like remote camera

09:44 <Mutabah> You then leave the map, causing it to be unloaded while still held (... probably because it wasn't being properly tracked due to being off-screen originally)

09:45 <Mutabah> Next time you transition maps, a new object is loaded over it... but link is still "holding" the object, so this new random object (or even random blob of code) gets clobbered with the position/rotation updates

09:45 <zid> It's just a disconenct between link being able to or knowing he should, drop things, and things freeing

09:45 <zid> the superslide teleport locks link into the grab

09:46 gildasio has joined #osdev

09:46 <zid> remote camera there's just no reason for link to stop holding the thing anyway

09:46 <Mutabah> mjg_: That was the run, https://www.youtube.com/watch?v=qBK1sq1BQ2Q is the breakdown

09:46 <bslsk05> 'Finally Obtaining the Triforce in Ocarina of Time: Triforce Percent Explained' by Retro Game Mechanics Explained (00:34:24)

09:46 <zid> offset in the object (actor) they tend to overwrite is the draw function pointer

09:47 <mjg_> Mutabah: danke

10:37 heat has joined #osdev

10:41 <mjg_> Mutabah: so is the triforce thing real or something injected by tas?

10:41 <mjg_> Mutabah: started watching the explanation and it makes a suspicious statement in the second minute

10:42 <mjg_> Mutabah: well i watched another minute and now i know :-P

11:27 dennis95 has joined #osdev

11:41 gildasio has quit [Remote host closed the connection]

11:47 gildasio has joined #osdev

11:47 <zid> I've invented an amazing chocolate delivery system

11:47 <zid> You forget it's in your pocket, then open one end and suck

11:49 <Andrew> zid: I tried with weird -m32 flags, etc and ld complains about expecting i386:x86_64, rejecting gcc's i386 output

11:49 <zid> yup, that's one of the annoying things about trying to make combined bootstrap + kernel images

11:49 <zid> getting 32bit code inside a 64bit elf

11:50 <zid> It's *much* easier in assembly in that respect, because you can just [bits 32] -felf64

11:51 SGautam has quit [Quit: Connection closed for inactivity]

11:58 <mjg_> i stopped eating chocolate few weeks ago

11:58 <mjg_> prompted by getting myself from ~70 kg to 80 :[

11:59 <mjg_> at this point i look pregnant

12:00 <zid> I weigh like.. 50kg? *does math*

12:00 <zid> oh 60

12:00 <mjg_> posture-wise i look like this guy https://www.youtube.com/watch?v=42FLAr86hbI

12:00 <bslsk05> 'Squat Cobbler HSC - Beaches 'n' Peaches | Better Call Saul Extras' by Movies Breaker (00:03:29)

12:00 <mjg_> zid: what's your height

12:00 <zid> 6

12:01 <mjg_> foot?

12:01 <zid> mete- yes

12:01 <mjg_> you sound underweight then

12:01 <zid> I don't know what I weigh, but it isn't 50 or 70 :p

12:01 <mjg_> https://www.nhlbi.nih.gov/health/educational/lose_wt/BMI/bmi-m.htm

12:02 <bslsk05> www.nhlbi.nih.gov: Calculate Your BMI - Metric BMI Calculator

12:02 <zid> I need like 12 more sets of 5kg scales

12:03 <mjg_> funny trap with "fixing your diet": it is very easy to end up with something atrocious

12:03 <zid> Idk how to be fat

12:03 <mjg_> and funny trap with exercise: common advice is beyond garbage and will leave you injured

12:03 <mjg_> how old are you man

12:03 <zid> mid 30s

12:04 <mjg_> in my mid 20s i was the "can eat anything" person, or so i thought

12:04 <mjg_> so am i

12:04 <zid> BMR doesn't really change much

12:04 <mjg_> intersetingly i started getting overweight *post* pandemic

12:05 gog has joined #osdev

12:05 <GeDaMo> I was around 25 when weight just started accumulating :/

12:05 <zid> It goes from like 25 to 21 from 20 to 80

12:05 <mjg_> you might have unknowingly changed your eating habits

12:05 <mjg_> for example people have no idea how much they snack on crap

12:06 <mjg_> and that's one of the major weight gainers

12:06 <zid> cal_in - cal_out, the rest is just strategy for if you have issues with it

12:06 <zid> like eating low calorie foods so you can still overeat

12:06 <zid> or becoming an olympic swimmer

12:07 <mjg_> from what i hear cico is oversimplified

12:07 <zid> it's just physics

12:07 <mjg_> namely your body's ability to extract calories depends on the particular food

12:08 <mjg_> and can be significantly less than the supposed calories indicated on packaging

12:08 <zid> that's good not bad :P

12:08 <zid> I mean, bad for me, good for you

12:08 <mjg_> well it is in your favor so to speak

12:09 <mjg_> but you may find yourself in nutrient deficit

12:09 <zid> that's nearly impossible

12:09 <mjg_> i don't rmember the recommended "safe" deficit

12:09 <mjg_> for long term weight loss

12:09 <zid> the only thing people are commonly lacking is vit D, and that's because we can't synthesize it or eat it (mostly)

12:10 <zid> you have to photosynthesize like a damn plant

12:10 <zid> and women can be lacking iron, for obvious reasons

12:10 <zid> but nobody's getting scurvy unless they have eating disorders where they can only eat burnt chips or whatever

12:11 <mjg_> i'm pretty sure you would get yourself into a solid deficit with sufficeitnly shitty diet, which is "obtainable"

12:11 <zid> There was a scottish guy, lost like 100kg just drinking water and eating vitamin pills, they had him on regular blood tests and he was fine

12:11 <mjg_> huh?

12:11 <GeDaMo> https://en.wikipedia.org/wiki/Angus_Barbieri%27s_fast

12:11 <bslsk05> en.wikipedia.org: Angus Barbieri's fast - Wikipedia

12:12 <mjg_> Died 7 September 1990 (aged 50–51)

12:12 <mjg_> i'm sure this had nothing to do with it

12:12 <zid> he's scottish, that's way above average

12:12 <zid> actually, studies show that calories kill

12:12 xenos1984 has quit [Read error: Connection reset by peer]

12:12 <zid> if you wanna live a long rat life, stop eating

12:13 <zid> I assume it's just "more machinery churns, so more machinery wears out" in a very abstract sense

12:13 <GeDaMo> https://en.wikipedia.org/wiki/Calorie_restriction#Life_extension

12:13 <bslsk05> en.wikipedia.org: Calorie restriction - Wikipedia

12:14 <klange> mjg_: 35 years after the fasting

12:14 <klange> if anything, it was probably complications from his weight _before_ the event that eventually did him in...

12:14 <zid> scottish life expectancy is only 61, in 2020

12:15 <zid> and that was 80 years ago

12:15 <zid> it's the lowest in w. europe

12:16 <GeDaMo> https://www.badspacecomics.com/post/the-suit

12:16 <bslsk05> www.badspacecomics.com: The Suit - Bad Space Comics

12:16 <mjg_> > In Scotland between 2018-2020: Male healthy life expectancy was 60.9 years. Female healthy life expectancy was 61.8 years.

12:17 <mjg_> i thought you were joking

12:17 <zid> scotland is a silly place

12:17 <mjg_> i have to take back my comment then

12:17 <zid> ohh it's THAT one, seed that GeDaMo it's fun

12:17 <GeDaMo> :P

12:18 <zid> (took me a while to figure out how to make it load)

12:18 <mjg_> GeDaMo: ouch

12:19 <mjg_> GeDaMo: have you read "i have no mouth and i must scream"?

12:19 <mjg_> about 15 minutes read afair and right up your alley i think

12:20 <GeDaMo> I assume I have, I know the name but I can't remember exactly what it's about

12:20 <mjg_> ellison

12:20 <mjg_> people vs a computer

12:20 SpikeHeron has quit [Quit: WeeChat 3.5]

12:20 <mjg_> an almighty one

12:20 <zid> I prefer I have no feet and I must sock

12:20 <GeDaMo> Oh yeah, I remember it now

12:21 <mjg_> cmon man, how can you forget such a classic

12:21 <klange> fuck me that's disturbing

12:21 <GeDaMo> https://rowrrbazzle.blogspot.com/2016/06/answer-by-fredric-brown-full-short.html

12:21 <bslsk05> rowrrbazzle.blogspot.com: Perkin Worbeck's Magic Newt: "Answer" by Fredric Brown (1954) (complete short-short story)

12:21 <mjg_> klange: *do not* read the story i recommended :-P

12:21 <zid> I'm currently working on a basilisk

12:22 <mjg_> GeDaMo: have you read "the last question"?

12:23 <GeDaMo> Yes

12:23 SpikeHeron has joined #osdev

12:23 <mjg_> https://www.martincwiner.com/wp-content/uploads/2011/06/The-Last-Question-Isaac-Asimov.pdf for the uninitiated

12:23 <GeDaMo> My exit message is currently "There is as yet insufficient data for a meaningful answer." :P

12:24 <zid> link "they're made of meat" next

12:24 <zid> if we're doing "famous short stories nerds know"

12:24 <klange> How about a fun one?

12:24 <klange> "The Road not Taken"

12:24 <GeDaMo> Ah, I like that one

12:26 <GeDaMo> https://xkcd.com/1782/

12:26 <bslsk05> xkcd - Team Chat

12:28 <mjg_> you reminded me i need to read the thing from outer space

12:28 <mjg_> i never got around to it

12:28 <zid> After that, read ascendance of a bookworm

12:28 <zid> that's a nice little short story

12:29 <GeDaMo> https://www.badspacecomics.com/post/grounded

12:29 <bslsk05> www.badspacecomics.com: Grounded

12:29 <zid> (19 novels so far)

12:29 <mjg_> noted

12:29 <mjg_> have you read sandkings?

12:29 <GeDaMo> "thing from outer space"?

12:29 <mjg_> not sci-fi though

12:29 <mrvn> mjg_: calories indicated on packaging is like MTBF on harddisks.

12:30 <mjg_> GeDaMo: or whatever the title was, the original material for the "the thing" movie

12:30 xenos1984 has joined #osdev

12:30 <mjg_> GeDaMo: i'm pretty sure it has 'thing' in the title :>

12:30 <GeDaMo> Ah, the thing from another world

12:30 <mjg_> oh maybe that

12:30 <GeDaMo> https://en.wikipedia.org/wiki/Who_Goes_There%3F

12:30 <bslsk05> en.wikipedia.org: Who Goes There? - Wikipedia

12:31 <mjg_> man, 0/2

12:31 <mjg_> ooh Its extended novel version, found in an early manuscript titled Frozen Hell, was finally published in 2019.

12:31 <mjg_> i did not know that's a thing

12:31 <GeDaMo> The first film was called "The Thing from Another World"

12:32 <GeDaMo> I also did not know that :|

12:32 <mjg_> makes me happy i did not read the original :-P

12:34 <GeDaMo> And apparently there's going to be a new film based on the full novel

12:34 <mjg_> :O

12:34 <mjg_> nice

12:34 <GeDaMo> https://bloody-disgusting.com/movie/3602436/universal-blumhouse-developing-new-version-thing-will-adapt-long-lost-original-novel/

12:34 <bslsk05> bloody-disgusting.com: Universal and Blumhouse Developing New Version of 'The Thing' That Will Adapt Long Lost Original Novel! - Bloody Disgusting

12:35 <mjg_> although i have to note "the thing" by carpenter does seem like the perfect movie

12:35 gog has quit [Ping timeout: 260 seconds]

12:35 * mrvn wonders about the uniform rules on ST Strange new Worlds: Gold for command, Blue for science/medical, red for everyone else making red-shirts hard to spot and white for nurse Chapel?

12:35 <mjg_> as in i don't know what you can do to improve on it

12:35 <GeDaMo> Yeah, same

12:37 <mrvn> if I knew how to make it better I would be rich and famous.

12:37 <mjg_> mrvn: let me restate, i don't see any weak points

12:37 <mrvn> it's a pretty bad commedy :)

12:37 <mjg_> well there is one bit, i did not like how the main character (what has the fucker's name?) poured wisky (or whatever) into a computer

12:38 <mjg_> the chess game they showed afair did not add up

12:38 <GeDaMo> MacReady

12:38 <mjg_> as in blatantly different positions between shots

12:38 <mrvn> OMG, what has the computer ever done to hime? Or: What a waste of good whiskey?

12:38 <mjg_> why not both

12:39 <mrvn> mjg_: hehe, continuity errors are fun. They probably had to reshoot a the scene and some intern had to set up the chess board.

12:41 gog has joined #osdev

12:50 gog has quit [Ping timeout: 272 seconds]

13:02 mzxtuelkl has quit [Read error: Connection reset by peer]

13:03 mzxtuelkl has joined #osdev

13:28 <sbalmos> mrvn: Pike?

13:32 <mrvn> sbalmos: yes

13:35 <sbalmos> mrvn: I haven't seen the ep for today yet, so I'm guessing that's the whisky reference.

13:39 <mrvn> sbalmos: no, that was from The Thing

13:39 [itchyjunk] has joined #osdev

13:39 <sbalmos> mrvn: ah, crap, sorry then. I lose again.

13:40 <mrvn> sbalmos: I'm justr wodnering why everyone is wearing an uniform except nurse Chapel

13:40 <sbalmos> mrvn: She's a civilian

13:41 <sbalmos> mrvn: See Ep 1. She's a civilian from I believe Carnegie-Mellon on a Starfleet medical trial program or such. I'd have to go look back up the exact wording they used.

13:41 <mrvn> that explains it.

14:08 <zid> oh, westworld is back

14:09 <zid> it lost the plot a bit but it's still pretty

14:17 ornxka has joined #osdev

14:18 <ornxka> why my memory gets corrupted/??

14:18 <zid> you corrupted it.

14:18 <ornxka> ah yes, it figures that the source of all of my other problems would be at fault here too

14:19 <zid> not a lot anybody can do to help you with "why wrong!?!?//11one"

14:21 blockhead has quit []

14:23 <ornxka> its one of those "looking for moral support and commiseration rather than concrete suggestions" sort of things

14:24 <sbalmos> life sucks, computers are unforgiving. But the good thing is, they do exactly what you tell them to do, and nothing more.

14:24 <sbalmos> The bad thing is, they do exactly what you tell them to do, and nothing more.

14:24 <zid> computers were a mistake, try becoming a hermit

14:25 <sbalmos> zid: you know, that mountain cabin is looking more tempting all the time, and not just because of computers

14:25 <ornxka> normally i hate that sentiment because i didnt write 99.99% of the code that runs on my computer and thus they do not do "exactly what i tell them to do" and its more a consensus between me and several thousand other people

14:25 <ornxka> but in this instance it is indeed doing exactly what i told it to do

14:26 <ornxka> and there is no one to blame but myself..

14:27 <ornxka> being a hermit sounds nice, imagine all of the time you would have to spend on hobbyist osdev

14:27 <zid> very little, too busy smelting copper

14:27 <zid> and mixing clay

14:30 <mrvn> ornxka: residual radiation from satelite debries entering the atmosphere

14:31 <mrvn> That mountain cabin needs netflix

14:33 <ornxka> tired: accidental use after free bug, wired: my gangstalkers are beaming gamma rays at my dev machine and flipping my bits to impede my development progress

14:33 lg has quit [Ping timeout: 244 seconds]

14:36 <heat> get kasan

14:36 <heat> never worry about use after frees again

14:37 terminalpusher has joined #osdev

14:38 gog has joined #osdev

14:39 <zid> gog omg you can't do that wtf

14:39 <zid> oh sorry I mistook you for heat

14:39 <gog> what

14:39 <gog> I'll do it anyway idc

14:40 * gog does it

14:40 <zid> Okay you'll need my paypal info then

14:40 <heat> just do it gog

14:40 <heat> like nike

14:40 <heat> and shia labeouf

14:40 <gog> yes

14:41 <zid> I do wonder *why* heat keeps doing it though

14:43 <zid> I'd imagine the chafing would be incredible

14:45 lg has joined #osdev

14:50 ethrl has joined #osdev

14:57 hysv has joined #osdev

15:05 terminalpusher has quit [Remote host closed the connection]

15:09 <clever> mrvn: looking at the opengl specs for glDrawArrays, its almost exactly maps ontop of "Vertex Array Primitives" from v3d, but v3d wants everything in one bit array, while glDrawArrays wants multiple arrays

15:09 <clever> so mesa would still have to do a somewhat complex fetch from n arrays, write interleaved to one array

15:11 X-Scale` has joined #osdev

15:11 <zid> I'm not sure I've actually used glDrawArrays before, thinking about it

15:12 <zid> glDrawElements for life I guess

15:12 <clever> *reads*

15:12 X-Scale has quit [Ping timeout: 276 seconds]

15:12 <clever> zid: ahh, that maps closely to the "Indexed Primitive List" that ive already been using

15:13 X-Scale` is now known as X-Scale

15:13 <zid> elements is just the same thing but with *another* array :p

15:13 <zid> I basically never end up with flat geometry unless I'm hand-generating it

15:14 <clever> zid: but, is that array pointing into fully assembled vertices, or seperate arrays?

15:14 <zid> It's identical, but with an extra array of indicies

15:14 <zid> glDrawArraysIndirect :P

15:14 <clever> ah, so same problem remains, and that seems even worse for the host cpu

15:14 <clever> v3d expects an array of structs, each one describing a single vertex fully

15:14 <zid> In practice it might not be too bad? most people tend to just use a single array anyway

15:15 <zid> but with glVertexAttribPointer to chop up that single array

15:15 <clever> but the way the docs describe it: you can prespecify separate arrays of vertices, normals, and colors and use them to construct a sequence

15:15 <clever> it sounds more like structs of arrays

15:15 <zid> yea I'm not sure anybody ever really does that

15:16 <zid> It's generally just a huge GL_BUFFER full of 18 element verts

15:16 <zid> rather than 6 arrays of triples

15:16 <clever> ah

15:16 <clever> and then how do you tell gl what the layout is within those buffers?

15:16 <zid> >but with glVertexAttribPointer to chop up that single array

15:17 <zid> entity.c: glVertexAttribPointer(0, 3, GL_FLOAT, GL_FALSE, sizeof(GLfloat[8]), 0);

15:17 <zid> entity.c: glVertexAttribPointer(1, 3, GL_FLOAT, GL_FALSE, sizeof(GLfloat[8]), (GLvoid *)(sizeof(GLfloat[3])));

15:17 <zid> entity.c: glVertexAttribPointer(2, 2, GL_FLOAT, GL_FALSE, sizeof(GLfloat[8]), (GLvoid *)(sizeof(GLfloat[6])));

15:17 <zid> That's an 8 element vert, split into 3/3/2

15:17 <zid> (xyz, rgb, uv I think in that case)

15:18 lg has quit [Ping timeout: 244 seconds]

15:18 <clever> was going to say, that v3d wants its pre-shaded vertex data in a very specific layout

15:18 <clever> then i remembered, if you have vertex shaders, that goes out the window

15:18 <zid> The shader ends up with vec3, vec3, vec2 `in` data

15:18 <clever> and you can lay them out in any order, and then just select the right attributes

15:19 <clever> behind the scenes, your 8 element vert, becomes an 8x16 matrix in the VPM, all 8 elements for 16 vertices

15:19 <zid> layout(location=0) in vec4 pos; layout(location=1) in vec3 norm; layout(location=2) in vec2 uv;

15:19 <clever> and the shader compiler can just map coord.x to the right row in the VPM

15:19 <zid> That's what it looks like in glsl ^

15:19 <clever> so the ordering within the buffer doesnt matter

15:20 <clever> so attributes solve the problem i was expecting

15:20 <clever> you just have to adjust the compiled shader, to agree with the glVertexAttribPointer settings

15:21 <zid> and then there's "..Instanced" I think which adds another field that's a sequential int and how many times to render the same geom over and over so you can index it into a texture to do different things per copy

15:21 <zid> useful for like, minecraft

15:21 <clever> ive not seen any signs of v3d supporting instancing

15:21 <clever> so the gl drivers would have to duplicate the data for you

15:21 <zid> where your data is always going to be a single quad with the same UVs, and all you care about is varying an offset

15:22 <zid> single cube*

15:22 <clever> yeah, mrvn mentioned using instancing for drawing text lastnight

15:22 <zid> It saves pci-e trips to let the card know directly that you're rendering the same geom 32768 times

15:22 <zid> rather than 32768 calls

15:22 <zid> and having to blow up your memory usage 32768 times

15:23 <clever> in the case of v3d, its not over a pci bus, so its just a question of how big the vertex attribute arrays become

15:23 <clever> oh wait, i just remembered something

15:23 <zid> otherwise you'd have to duplicate all 18 verts instead of adding a single input buffer containing the array you're indexing

15:23 <clever> yeah, that actually fits, i forgot about that

15:23 <clever> the shader state, wants 8 pointers

15:23 <clever> to the start of 8 attribute arrays

15:24 <clever> so it could work with the other struct of arrays i saw earlier

15:24 <clever> and it wants 8 strides, so you can interleave it however you want

15:25 <zid> happy that you're happy, I don't know your thing so I'm just throwing out info about how gl programs "can work" in case anything matches your hw's caps cleanly

15:25 <clever> yeah

15:25 <clever> the only other problem i can forsee

15:26 <clever> is that each shader (coodinate, vertex, and i think fragment), has an 8bit mask

15:26 <clever> to select which of the 8 attributes its fetching from

15:26 <clever> what happens if you need more then 8?

15:26 <zid> than

15:26 <zid> 8 is a lot

15:26 <mrvn> "We have no lawyers here, that is why this is an utopia."

15:26 <zid> even for PBR

15:27 <clever> zid: so your 3/3/2 is an extreme edge case?

15:27 <zid> xyz, normal uv, texture uv, displacement map uv, pre-baked lighting uv, etc

15:27 <zid> that's 3

15:27 <zid> attribute 0, attribute 1 and attribute 2 are in use

15:27 <clever> oooooo

15:27 <zid> I still have 5 free if your limit is 8

15:27 <clever> that might be what i was mis-understanding

15:28 <clever> and explains why the attributes have a size on them

15:29 <clever> so, i can put the entire `vec3 pos` into attribute 0, set the size to 12 (3 floats), address to the first pos in the interleaved array, and stride to the distance between 2 pos's

15:30 <clever> that makes far more sense

15:30 <zid> 16 is the max on pc class hw apparently

15:30 * mrvn just wants flat shading, maybe a texture for advanced graphics

15:30 <zid> and the stride value has a max which is no less than 2048

15:30 <zid> so try not to have verts bigger than 2kB :P

15:30 <clever> so that just leaves 2 mysteries

15:31 <clever> > Attribute Array [n] Vertex Shader VPM Offset (from Base Address)

15:31 <clever> > Attribute Array [n] Coordinate Shader VPM Offset (from Base Address)

15:31 <zid> what's a coordinate shader, what's a VPM

15:31 <clever> my guess, is that this is a byte offset, from the starting address, so you can mis-align the attributes

15:31 <clever> a coordinate shader is just a vertex shader, with the vary[] part deleted

15:31 <clever> its job is to only compute screen xy coords

15:32 <clever> the VPM is a chunk of memory that is used to send attributes to shaders, and temporarily store shaded vertex data

15:32 <zid> so some internal boofer?

15:32 <clever> until the polygon has been fully drawned

15:32 <clever> yeah

15:32 <clever> my first guess, is that you could use it as a byte offset into the vertex attributes

15:33 <clever> so the 8 attribute mask, selects a different 8 attributes

15:33 <clever> then you could have an attribute array of 1234 5678 9,10,11,12, and then one shader uses 1-8, while another shader uses 5-12

15:34 <clever> so each shader is limited to a max of 8 consecutive attributes, but that is a sliding window over the entire attribute selection?

15:34 <zid> sounds weird

15:35 <clever> i'm just guessing, i could be wrong

15:35 <clever> mesa also hides coordinate shaders from you

15:35 <clever> the compiler will just delete all of the vary[] outputs from a vertex shader, and then delete any computation with unused outputs

15:35 <clever> and boom, there is your coordinate shader

15:38 <clever> that just leaves the extended attribute array

15:40 <clever> i think its just defining the stride for fetching extra attributes beyond that

15:40 <clever> but its not clear how exactly, i should cross-refernece to mesa

15:45 lg has joined #osdev

15:55 gxt__ has quit [Remote host closed the connection]

15:55 foudfou has quit [Remote host closed the connection]

15:55 gildasio has quit [Write error: Connection reset by peer]

15:55 foudfou has joined #osdev

15:55 gxt__ has joined #osdev

15:56 gildasio has joined #osdev

16:01 Geertiebear has joined #osdev

16:21 dennis95 has quit [Quit: Leaving]

16:35 <heat>

16:46 gildasio has quit [Quit: WeeChat 3.5]

16:49 nyah has joined #osdev

16:53 dennis95 has joined #osdev

16:53 frkzoid has joined #osdev

16:55 <frkzoid> looks like M$ has reached the extinguish phase: https://www.phoronix.com/scan.php?page=news_item&px=Systemd-Creator-Microsoft

16:55 <bslsk05> www.phoronix.com: Systemd Creator Lands At Microsoft - Phoronix

16:56 <GeDaMo> Maybe Windows is going to adopt systemd :|

17:01 <vdamewood> GeDaMo; That would be funny.

17:02 <GeDaMo> Yes, I'm sure we'd all laugh at that :|

17:05 * vdamewood installs systemd on GeDaMo

17:05 <mrvn> "I just hope that things will work out and eat a stgeady flow of pizza until they do." I can get behind that philosophy.

17:05 <mrvn> GeDaMo: the big question then is: will it get better or worse?

17:06 <GeDaMo> systemd or Windows? :P

17:07 <vdamewood> Yes!

17:08 <gog> hot take: i like systemd

17:09 <gog> i find it to be less fragile than various init systems I used over the years

17:09 <zid> I've never really used an init system

17:09 <zid> beyond /etc/init.d/net

17:09 foudfou has quit [Remote host closed the connection]

17:10 foudfou has joined #osdev

17:10 * vdamewood gives gog a fishyd

17:11 <vdamewood> Personally, I like systemd, too.

17:11 * gog fishyctl eat

17:11 <geist> well, if systemd is fragile enough that losing a key member of their team at this point is fatal, then it's not well run

17:11 <gog> is that what happen?

17:11 <geist> but probable they'll just keep working on it

17:11 <geist> MSFT is a different company nowadays

17:12 <zid> maybe they want windows to be bootable into wsl2

17:12 <zid> via systemd

17:12 <geist> yah totes

17:12 [itchyjunk] has quit [Ping timeout: 244 seconds]

17:13 <zid> kernel does bringup then runs init which is systemd's init, instead of running explorer or whatever

17:13 <gog> maybe it's time to replace systemd with a compatible but less hulking alternative

17:13 <zid> (I have no idea what windows' init process is)

17:14 <geist> i used to know, and it was a complicated set of this spanws that with the services and whatnot

17:14 <gog> like they're doing with pulse audio/pipewire

17:14 <geist> but no idea if any of that is the same now

17:14 <geist> reminds me. rant: work is forcing me to switch my work computer from cinnamon to GNOME and i hate every part of it

17:14 <gog> boooooo

17:15 <geist> plain GNOME is such a stupid backstep in functionality in the interest of looking nice

17:15 <gog> I've not used new gnome before

17:15 <geist> i have to install like 8 extensions to get it kinda halfway back to what i want

17:15 <vdamewood> geist: Can you install gnome-shell plugins/extensions Whatever they're called?

17:15 <geist> yes. which is what i've been doing. one of them is panel to bar or something, which is the biggie

17:15 <geist> problem is the extensions are kinda fragile, seems the more you put in there the more possibility of something colliding with something else, etc

17:16 <geist> and things like you have problems with > 4 workspaces because you can't set global keyboard things for 5 and 6, etc

17:16 <geist> it's really lame

17:16 <vdamewood> I remember RHEL 6(?) included a bunch of extensions by default to make GNOME 3 more like GNOME 2.

17:16 <geist> the biggest one: there's no goddamn desktop icons

17:16 <gog> can you use plasma

17:16 <geist> apparently that's a choice by the designers, they thought it was messy

17:16 [itchyjunk] has joined #osdev

17:17 <geist> there are a few extensions for desktop icons but they seem to involve essentially running an instance of chromium to drive it

17:17 <geist> i'm a huge messy desktop icon user. i keep everything i'm doing right there in little clusters

17:17 <gog> yeah the gnome designers have a real obsession with "clean" design but at the expense of configurability

17:17 <vdamewood> geist: That... sounds terrible. @ chromium

17:18 <geist> even the file menus dont have a shortcut for ~/Desktop because it's not a real folder that gnome cares about

17:18 <geist> grar. it's so lame. i used to not mind it but i had forked off their train into MATE and cinnamon years ago because i didn't like where it was going

17:19 <vdamewood> Can we smash GNOME and replace it with something better?

17:19 <geist> gog: what is plasma in this context?

17:20 <gog> kde plasma desktop

17:20 <gog> probably not eh

17:20 <gog> if they're forcing you to use gnome

17:20 <geist> ah no. it's some work thing where they only want us to use gnome

17:21 <geist> now that i spent half a day working with gnome i know that i can make it 75% back to what i want, so i'll just stick with cinnamon until they really really force me

17:22 <gog> yes

17:22 <vdamewood> I can work with GNOME as soon as I get the terminal added to the dock.

17:22 <gog> is there any good business reason to force you to use a particular desktop environment? seems like it'd damage productivity

17:23 <GeDaMo> Getting the new one to work 75% as well as the old one seems to be the norm now :/

17:23 <vdamewood> gog: Makes the systems easier to maintain, if it's a work computer.

17:23 <gog> true

17:23 <gog> and security review

17:23 <vdamewood> gog: Makes it easier to share systems.

17:23 <gog> yeh

17:24 <geist> yah it's easeir to maintain

17:26 <psykose> i remember having to do stuff with some restrictions similar to that and safe to say i have no interest in ever doing it again

17:29 <geist> yah thats my mini rant, but so it goes. like all thingsl ike that the change of shortcut keys or ui differences you usually just get to, it's loss of functionality that's a real drag

17:30 mzxtuelkl has quit [Quit: Leaving]

17:30 <gog> can you get access to classic mode?

17:31 <gog> or convince them to install the "desktopfolder" package?

17:31 <geist> well i have root access, so i can install pretty much any package i need

17:31 <gog> oh ok then

17:31 <geist> so it's also possible i can just keep using cinnamon forever, i'm just off the support chain

17:31 <gog> yeh

17:32 <sbalmos> is the support chain even worthwhile?

17:32 <geist> but they were insistent enough that there's actually a login popup that says 'you need to switch to GNOME by date X'

17:32 <geist> well, to be honest yes, they're quite good. or at least if the machine is having trouble you go through em

17:32 <gog> cause if you dont support me now/you'll never support me again/i can still hear you saying/you broke the support chain

17:34 <geist> you know the rules (and so do i)

17:34 <geist> a full commitments what i'm thinking of

17:35 blockhead has joined #osdev

17:36 <gog> wrong song :p

17:40 gildasio has joined #osdev

17:48 <GeDaMo> https://www.youtube.com/watch?v=B_0m3L1_1Dg

17:48 <bslsk05> 'Life Is a Long Song (2001 Remaster)' by Jethro Tull - Topic (00:03:17)

18:11 foudfou has quit [Remote host closed the connection]

18:11 foudfou has joined #osdev

18:19 <geist> ah jethro tull. one of my favorite vinyls i have is Aqualung

18:19 <geist> really a great vinyl in general since it has a fair amount of dynamic range, etc

18:19 <geist> a modern mix sounds much more compressed as usual

18:24 <geist> huh looks like qemu is getting support for LoongArch

18:24 <geist> though last i had looked it wasn't that interesting. some sort of mips like derivative

18:25 <GeDaMo> It's MIPS-based

18:26 <GeDaMo> I think there are some instructions to assist in emulating x86

18:26 <geist> but i guess has divered enough to be its own arch

18:26 <GeDaMo> «In August 2021, Linux maintainers complained that submitted LoongArch code is "...a blind copy of the MIPS code...", however "only with a different name".» https://en.wikipedia.org/wiki/Loongson

18:26 <bslsk05> en.wikipedia.org: Loongson - Wikipedia

18:27 vinc has joined #osdev

18:28 <GeDaMo> Possibly due to lack of documentation

18:33 heat has quit [Ping timeout: 264 seconds]

18:44 doug16k has quit [Remote host closed the connection]

19:05 CaCode has joined #osdev

19:09 vdamewood has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]

19:15 mrvn has quit [Ping timeout: 256 seconds]

19:25 vinc has quit [Read error: Connection reset by peer]

19:31 terminalpusher has joined #osdev

19:32 <ddevault> let's say I have a scenario where I have two userspace processes exchanging data by having the kernel copy it between buffers in their respective address spaces

19:32 <ddevault> would it be reasonable to statically allocate an extra set of page tables for the kernel to map any physical address into its address space temporarily?

19:32 <ddevault> am I going to take a big performance hit from doing that?

19:33 X-Scale` has joined #osdev

19:33 vdamewood has joined #osdev

19:33 X-Scale has quit [Ping timeout: 255 seconds]

19:33 <ddevault> hm, I know linux does the reverse, unmapping the kernel while in userspace

19:35 X-Scale` is now known as X-Scale

19:35 <ddevault> it would be improved if I kept track of the last such mapping and avoiding invalidating the TLB if they're already okay, so if two processes are exchanging a lot of messages it's not thrashing it

19:38 <geist> depends. in general fiddling with the kernel address space on an SMP system is expensive since you have a global TLB shootdown to do (on x86 at least)

19:38 <geist> but, if you can use a temporary, per-cpu region of the kernel (say every cpu gets a 4MB region, or a 1GB region or whatnot)

19:39 <geist> *and* you can ensure that the code that does the map/copy/unmap runs on one cpu yuo can design a per cpu mapping scheme that avoids tlb shootdowns

19:39 vdamewood has quit [Ping timeout: 255 seconds]

19:40 <geist> but you may be trading some amount of global responsiveness and preemptability, depending on how you pin the thread/task to a single cpu

19:40 <ddevault> note that I have a non-preemptable microkernel, though SMP of course is still an issue

19:40 dude12312414 has joined #osdev

19:40 <geist> if it's non preemptable, how do you handle SMP? do you only allow one cpu in the kernel at a time?

19:40 <ddevault> maybe I can have a pool of page tables to work with to avoid modifying them too often and let old ones stick throughout several operations

19:41 dude12312414 has quit [Remote host closed the connection]

19:41 <ddevault> we don't have SMP yet, but no, it's only (going to be) non-preemptable on a per-CPU basis

19:41 <ddevault> there will still be multiple threads in the kernel at a time

19:41 <geist> so non preemptiable as in it'll be reentrant but non preemptable

19:41 <ddevault> but once a CPU enters the kernel it won't leave until the syscall or interrupt is done

19:41 <geist> reentrant in the sense that multiple cpus may be active at the same time

19:41 <ddevault> yeah, more or less

19:41 <geist> okay, so the problem remains then: how do you fiddle with kernel address space without excessive TLB shootdowns

19:42 <geist> other than 'get a better cpu'

19:42 <ddevault> hah

19:42 <geist> BTW I'm assuming you're on x86, correct?

19:42 <ddevault> aye, though we'll have riscv64 soon enough

19:42 <ddevault> I think a pool of temporary page tables might be the ticket

19:42 <geist> yah and sadly riscv has the same problem

19:42 <geist> the alternative is to do a per page lookup and then do memcpy against the raw page

19:43 <ddevault> so that you need to have >N processes doing IPC or such at the same time before the pool runs out and it starts having to deal with TLB misses

19:43 <geist> which you map into the kernel via a more global mechanism, including just linearly mapping all of memory (if you're on 64bit you can mostly guarantee that)

19:43 <ddevault> well, we do identity map 64 GiB

19:43 <ddevault> and to be fair I could just say "more than 64G of RAM is a problem for future me"

19:43 <geist> so yeah that's another strategy: find the mapping in user space and then copy directly into the physical mapping

19:44 <ddevault> that's another thing, we could also just short circuit the page mappings for any physical address which is identity mapped

19:44 <geist> if the copy source or destination is always active a time (IPC from or to the thread that's active) then only one end point of the copy has to be against the physmap

19:44 <ddevault> then only high memory has to worry about TLB

19:44 gog has quit [Quit: byee]

19:44 <ddevault> it's a rendezvous model

19:44 <ddevault> so one thread is blocked and the other is in a syscall

19:44 gog has joined #osdev

19:45 <geist> right then it turns into a different problem: instead of temporarily mapping buffers into the kernel, you're temporarily mapping physical pages into the kernel

19:45 <geist> the latter is a more generalized problem and cant be nicely solved

19:45 <geist> for example if you need page 1 and 27 and 32 you can use large pages to map 0-255 and happen to get it in one shot, etc

19:46 <ddevault> I think for now I'll just have the memory enumeration code peace out if the physical address is >64G

19:46 <ddevault> with a comment saying // So you want Helios to support more than 64GiB? Great! You can deal with the problems

19:46 terminalpusher has quit [Remote host closed the connection]

19:47 mrvn has joined #osdev

19:47 <ddevault> though, bleh, who says that a system with <64G of RAM won't map it at addresses <64G

19:47 terminalpusher has joined #osdev

19:47 <geist> yah note that on 64bit machines you have a lot of headroom there. realistically you can chew up a sizable chunk of the kernel to get you a TB or so before it starts getting tight

19:47 <geist> depending on how much space you want to reserve for it and what arch you're on

19:48 <geist> since 64bit systems usually have something like 47 or 48 bits of kernel address space

19:48 <geist> and can usually use 1GB pages to map stuff like that

19:48 <ddevault> yeah that's a point

19:48 <ddevault> are huge pages treated differently by the TLB?

19:48 <geist> i'm generally a fan of the physmap strategy, despite it being somewhat of a security issue in general

19:48 <geist> generally they're more efficient yes

19:48 <geist> use less TLB entries

19:48 <ddevault> nice

19:49 terminalpusher has quit [Remote host closed the connection]

19:49 <ddevault> how do those actually work, by the way

19:49 terminalpusher has joined #osdev

19:50 <ddevault> since a virtual address defines a series of indicies into page tables that ultimately siphons out 4K portions of address space

19:50 <geist> what part specifically?

19:50 <ddevault> so address + 4K is the next entry in the page table

19:50 <ddevault> do you have to allocate them sparsely or something if the page size is >4K?

19:50 <clever> ddevault: the page tables are a tree

19:50 <geist> what are these pronouns referring to precisely?

19:50 <ddevault> err, I see it now

19:50 <ddevault> the PD has the page size bit, not the PT

19:51 <clever> ddevault: so if you go one level up in the tree, each slot refers to a larger chunk of ram

19:51 <ddevault> yeah, thanks clever

19:51 <geist> right. what clever is saying. it's implied by the depth of the tree you're on

19:51 <geist> x86 has kinda silly terminology, i like more of the ARM strategy where they say it's a L0-L3 page table, and a page is a 'terminal page table entry'

19:52 <geist> so the level you're at where you hit a terminal entry is how large the page is

19:52 <geist> first level? 512GB. next level? 1GB. next level? 2MB. next level? 4K

19:53 <geist> which if you do the log 2 math is 12 bits = 4k. +9 = 21 bits = 2MB. +9 = 30 bits = 1GB. +9 = 39 bits = 512GB

19:53 <ddevault> the way I like to think about it is just dicing up each series of bits in a virtual address as an index into a page table

19:53 <ddevault> thinking of the address space more discretely than continuously

19:54 <geist> in the case of a 5th level it'd be 39 + 9 = 48 bits = 256 TB

19:54 <geist> yep. precisely. and the shift of 9 bits is because each page table (12 bits long in this case, because 4K) has 8 byte entries, which is 3 bits. so it's 12 - 9

19:54 <geist> ie, each section of the split of the address is 9 bits wide

19:54 <geist> 512 entries

19:55 <geist> s/12 - 9/12-3=9/

19:55 <ddevault> it did take me a while to grok page tables, though

19:55 <ddevault> for some reason they didn't click

19:55 <geist> yah i've seen it a lot. takes a while for a lot of folks to finally have it click

19:56 <geist> but usually they have an ah-ha moment

19:56 <clever> i opted to go the simple route, a single layer of paging tables

19:56 <clever> https://github.com/librerpi/rpi-open-firmware/blob/master/arm_chainloader/mmu.c#L12-L37

19:56 <bslsk05> github.com: rpi-open-firmware/mmu.c at master · librerpi/rpi-open-firmware · GitHub

19:56 <geist> helps for arches that support it, but lots of them dont support terminal entries at the top level

19:57 <clever> from memory, each slot in this layer is 1mb, 4096 slots total for 4gig of virt space

19:57 <clever> that way, i dont have to deal with allocating a bunch of tables for the next level

19:57 <clever> and the linker can allocate the 1st layer

19:57 <geist> i forget if riscv defines that 512GB pages work in SV48

19:57 gog has quit [Quit: byee]

19:57 foudfou has quit [Quit: Bye]

19:57 <clever> i'm cheating by using a 32bit system

19:58 <clever> so i dont have to worry about 512gig pages, lol

19:58 foudfou has joined #osdev

19:58 <clever> and on the subject of what we where discussing yesterday

19:58 <geist> note 4MB pages didn't come along for a while in x86

19:58 <geist> pentium era, iirc

19:58 <clever> https://fuchsia.googlesource.com/fuchsia/+/refs/heads/main/zircon/kernel/arch/arm64/include/arch/arm64/mmu.h#269

19:58 <bslsk05> fuchsia.googlesource.com: zircon/kernel/arch/arm64/include/arch/arm64/mmu.h - fuchsia - Git at Google

19:58 <clever> i can see how this decodes as normal memory, uncached in both inner&outer

19:58 <clever> but i cant see where it says to be write combined

20:02 <geist> oh i dont remember, i think it's implied by it being normal memory vs device memory where it switches to a new model of the nGnRnE stuff

20:02 <PapaFrog> Pentium Pro, IIRC.

20:03 <geist> B2.7 if you have the latest ARM ARM, talks about "Memory Types and attributes"

20:03 <geist> which lays down the ground rules for the fundamental difference between normal memory and device memory

20:03 <geist> and then within those classifications what the different sub bits mean

20:04 <clever> *looks*

20:05 <geist> it says at some point that device-GRE is pretty much the same thing as normal uncached memory, *except* the cpu is not allowed to speculatively fetch it

20:06 <clever> The Normal memory type attribute applies to most memory in a system. It indicates that the hardware is permitted

20:06 <clever> by the architecture to perform Speculative data read accesses to these locations, regardless of the access permissions

20:06 <clever> for these locations.

20:06 <clever> for my version of the doc, it starts by stating that normal memory can be prefetched

20:06 <geist> right

20:06 <clever> and that doesnt seem to care if it can be cached or not

20:07 <geist> thats sort of the fundamental difference. the lowest tier of normal memory (uncached) is sort of like the least restricted version of device memory (device-GRE) except the latter cannot be prefetch/speculatively accessed

20:07 <geist> so they almost overlap

20:07 <clever> yeah

20:09 <clever> ive also noticed, MAIR has 4 different aliases

20:09 <clever> PRRR, MAIR0+MAIR1 (that one is known), and NMRR

20:10 <geist> that's all 32bit nonsense

20:10 <geist> never heard of the prrr and nmrr but not surprised

20:11 <mrvn> Except now comes ARM and has those contigous pages. E.g. 16k pages with 4k granularity that take up 4 entries in the page table.

20:11 <geist> yep

20:11 <zid> PRRRRRR is a good register

20:11 <mrvn> That is kind of the like the initial idea about having to space out entries.

20:12 <mrvn> -the

20:12 <geist> it's kinda a freebie, except it adds some amount of software complexity

20:12 <geist> so it's sort of an opt in

20:12 * clever reads PRRR, Primary Region Remap Register

20:12 <geist> it's some arm32 shit

20:13 <geist> basically MAIR_EL1 in 64bit mode cleans all that up

20:13 <clever> yeah

20:13 <geist> arm32 had a good 30 year run to build up some legacy as they added new features and had to cram bits in new registers, etc

20:13 <clever> it seems to be tied to whatever TTBCR.EAE is

20:13 <geist> and now arm64 has had a 12 year run to start picking up new stuff

20:13 <clever> TTBCR, Translation Table Base Control Register

20:13 <clever> Extended Address Enable. The meanings of the possible values of this bit are:

20:15 <j`ey> geist: soon arm64 will be a teenager

20:15 <clever> j`ey: how long until it can drink? lol

20:16 <j`ey> 6 years, since its in the UK :P

20:17 <sbalmos> j`ey: Does that mean, even in Supervisor Mode, you'll start randomly getting spasms and a new "you can't make me!" bit set?

20:17 <j`ey> hah

20:17 <sbalmos> or is that where the compiler says "I hate you! You're so stupid!" instead of compiler errors?

20:18 <clever> sbalmos: the ultimate "it wont let me"!

20:18 <j`ey> gcc and llvm are way past that!

20:18 <clever> i see that problem from a lot of noobs, who describe any error as "it wont let me" and dont bother saying what the error is

20:19 <sbalmos> clever: like "why's my memory corrupt"?

20:19 <clever> sbalmos: no, even dumber, they mkdir /mnt/data/foo, then ask why it wont let them mkdir /data/foo/bar

20:20 GeDaMo has quit [Quit: There is as yet insufficient data for a meaningful answer.]

20:20 <clever> but they omit enough details, that it takes an hour to realize that

20:25 <clever> bbl

20:30 <mrvn> clever: how do I create a directory without mkdir?

20:48 hysv has quit [Remote host closed the connection]

20:48 heat has joined #osdev

20:53 <clever> mrvn: i can see how you might do it with a text editor, gcc, the syscall function, and the right numbers, lol

20:53 <PapaFrog> Solution.. mkdir -p

20:53 <clever> run the mkdir syscall, without ever typing mkdir!

20:53 <PapaFrog> Maybe add sudo?

20:54 <heat> sup doofuses

20:54 <mrvn> clever: you missed the point. That's what noobs always ask.

20:55 <heat> that's an easy question

20:55 <heat> mknod

20:55 <heat> NEXT

20:55 <\Test_User> hexeditor on /dev/sda

20:55 <mrvn> How do I do X without that thing that was specifically made to do X because nothing else would do it?

20:56 <heat> well, they don't know if there's another way

20:56 <heat> the question is valid

20:56 <clever> \Test_User: better umount the disk first! lol

20:56 <\Test_User> my answer should work for basically any question about how to do x without x

20:56 <\Test_User> lol yeah

20:56 <\Test_User> "basically" meaning it doesn't do hexeditor without a hexeditor :P

20:58 <mrvn> heat: then they explain why they ask: I know 'mkdir' was specifically designed to create directories in the best way possible. But isn't there something better out there?

20:58 <heat> and there is

20:58 <heat> mkdirat

20:58 <heat> this is why these questions aren't quite stupid

20:58 <mrvn> for the purpose of this example that is the same

20:58 <heat> you need to give me a concrete example

20:59 <heat> also, not everyone knows quite as much as you

21:00 gildasio has quit [Remote host closed the connection]

21:00 foudfou has quit [Remote host closed the connection]

21:00 foudfou has joined #osdev

21:00 gildasio has joined #osdev

21:03 <mrvn> heat: I will let you know next time I see it on stackoverflow

21:04 <mrvn> slightly similar: https://stackoverflow.com/questions/72891650/unordered-multimap-element-output-order-is-weird/72891685#72891685

21:04 <bslsk05> stackoverflow.com: c++ - unordered_multimap element output order is weird - Stack Overflow

21:05 <mrvn> Obviously it's called unordered because it has the items in-order of input.

21:05 <heat> ok that's easy

21:05 <heat> they thought it's called unordered because the map doesn't sort itself

21:05 <heat> this is not a stupid question, just a lack of knowledge

21:08 <mrvn> except the experimentally prooved it's not in input order

21:10 foudfou has quit [Remote host closed the connection]

21:10 foudfou has joined #osdev

21:20 dennis95 has quit [Quit: Leaving]

21:20 gog has joined #osdev

21:24 ethrl has quit [Quit: WeeChat 3.4.1]

21:34 <kingoffrance> ive seen "set" used as implying order and uniqueness, but that might be a math thing

21:35 <kingoffrance> *unique elements

21:36 <gog> yes

21:49 <mrvn> since when are math sets ordered?

21:51 <mrvn> "What do you want your password to be?" "How about me-ma's birthday? Janury 2nd 34." "So just 1234?"

21:55 * kingoffrance thats even better holds up road sign "good luck" arrows pointing in all directions

21:55 <kingoffrance> theres some movie a guy on a ship is directing the plane and waving it all over...cue plane crash lol

21:56 terminalpusher has quit [Remote host closed the connection]

22:07 <mrvn> kingoffrance: https://www.youtube.com/watch?v=NfDUkR3DOFw

22:07 <bslsk05> 'Roger Roger - Airplane! (8/10) Movie CLIP (1980) HD' by Movieclips (00:01:37)

22:11 <geist> heh what a silly movie

22:11 <geist> though i still think Top Secret! is their best

22:14 <PapaFrog> I picked the wrong day to quit sniffing glue.

22:14 <sortie> surely airplane! is best

22:14 <PapaFrog> Yes it is, and don't call me Shirley!

22:14 <sortie> :D

22:15 <mrvn> PapaFrog: shirley you are joking.

22:24 <mats1> yes daddy

22:27 <gorgonical> Tantalizingly close to a working printk

22:27 <gorgonical> I should probably change these printks to early printk tho because cpuid isn't available yet and it's a nullptr deref

22:27 <gorgonical> Seems like spinlocks are wrong atm lol

22:28 <gorgonical> Locking to obtain console access and never passing the acquire

22:40 <mrvn> Your console shouldn't have any methods. The lock guard object you get from acquire should have methods so you can't call them without lock held at all.

22:41 <heat> this seems like linux

22:41 <gorgonical> It is stolen from linux, yes

22:42 <gorgonical> It isn't console.lock or anything though

23:06 nyah has quit [Quit: leaving]

23:37 zaquest has quit [Remote host closed the connection]

23:38 gildasio has quit [Quit: WeeChat 3.5]

23:38 zaquest has joined #osdev

23:39 <mrvn> Why isn't that a thing in the STL. Like `std::mutexed<T>` with `acquire()` that returns a proxy objkect holding the mutex and letting you cann `o->method(bla);`

23:40 <mrvn> s/cann/call//

23:40 <heat> cuz the stl is crap

23:40 <heat> there it is

23:40 <heat> I said it

23:41 <heat> why do STL types take deleters and allocates as template arguments instead of an argument you pass to the constructor

23:41 <heat> does your unique_ptr have a different deleter? not the same type, sorry!

23:55 <heat> have you guys compiled a modern linux kernel on 400MHz?

23:55 <heat> it's a great experience