klange changed the topic of #osdev to: Operating System Development || Don't ask to ask---just ask! || For 3+ LoC, use a pastebin (for example https://gist.github.com/) || Stats + Old logs: http://osdev-logs.qzx.com New Logs: https://libera.irclog.whitequark.org/osdev || Visit https://wiki.osdev.org and https://forum.osdev.org || Books: https://wiki.osdev.org/Books
heat has quit [Remote host closed the connection]
heat has joined #osdev
gog has quit [Read error: Connection reset by peer]
opal has quit [Remote host closed the connection]
gxt__ has quit [Remote host closed the connection]
wand has quit [Remote host closed the connection]
foudfou has quit [Remote host closed the connection]
opal has joined #osdev
gxt__ has joined #osdev
foudfou has joined #osdev
zaquest has quit [Remote host closed the connection]
zaquest has joined #osdev
gog has joined #osdev
wand has joined #osdev
foudfou has quit [Remote host closed the connection]
foudfou has joined #osdev
pretty_dumm_guy has quit [Quit: WeeChat 3.5]
Lumia has joined #osdev
heat has quit [Ping timeout: 240 seconds]
doug16k has quit [Remote host closed the connection]
Celelibi has joined #osdev
[itchyjunk] has quit [Ping timeout: 244 seconds]
doug16k has joined #osdev
[itchyjunk] has joined #osdev
<doug16k> a minute or two to parse dragon.c? wow
<doug16k> the parser should rip through it
smeso has quit [Quit: smeso]
<doug16k> can't say I have generated huge sources and tried to compile them much though
<doug16k> sounds like a good idea
<doug16k> reminds me of when you had a bunch of DATA statements in basic and you had a for loop that uses READ and POKE to put the asm in RAM :D
gog has quit [Ping timeout: 272 seconds]
<doug16k> not as bad though if your payload is numbers and the source is the numbers
<geist> What is interesting is I’ve observed in one case that aggressive C inclining can result in crazy compile times (vs C++)
<geist> Specifically zstd.c in the zstd compression thing
<geist> It is a .c file that uses the forced inline gcc intrinsic to compute some table, the way you would with a constexpr thing in C++
<geist> but it takes minutes to run, whereas the equivalent in C++ is almost instant
<geist> Someone on compiler team mentioned that’s because it’s a fundamentally different path in the compiler, and there’s lots of infrastructure to deal with constexpr things in C++
<geist> Whereas C is doing it much more brute force where it literally expands the whole thing and then has to use the backend optimizers to flatten it down
<moon-child> wow. Even constexpr is not that fast
<moon-child> d is moving to jit compilation for its constexpr-alike, and iirc one of the touted benefits of circle is better comile times
smeso has joined #osdev
<doug16k> hey I just realized, it might be drastically faster for zid to generate asm for those arrays
<doug16k> it would be hardly any changes
<doug16k> it would be interesting to put 1GB of .single in a file and see how long it takes
<moon-child> eh, if you care about performance, I would just do the linker trick
<doug16k> yeah but that is such a gross hack though. it would be ok-ish if it didn't create that nonsensical _size symbol
<doug16k> it can be weird if you are doing stuff close to address zero and there is this weird _size in the middle of your stuff
<doug16k> why oh why isn't it a half open range with a start and end symbol
<doug16k> cuckoo
<moon-child> at least #embed is coming
<moon-child> soon(tm)
<doug16k> I think I used it though, even though I dislike that _size thing it does
<doug16k> for vga font
<doug16k> kernel doesn't care about near address zero
<doug16k> I do some build step to bundle up the font data and the lookup table that says all the unicode codepoints it has and include that
<doug16k> this one https://www.inp.nsk.su./~bolkhov/files/fonts/univga/
<bslsk05> ​www.inp.nsk.su.: Unicode VGA font homepage
<doug16k> a trivial little python script to blast a binary into asm source and assemble that would be all you need to remove that ld trick dependency
<doug16k> isn't there .incbin in gnu as?
<doug16k> wow I could have sworn gnu assembler had that
<doug16k> oh it does
<doug16k> even has start and length optional parameters to extract a portion
<doug16k> and guarantees it won't do any funny alignment business, and leaves it up to you to restore alignment before or after
<doug16k> but yeah, doing it as generated asm preserves the endian independence
<doug16k> makes me wonder why it is not more common to generate code. protobuf does. parser generators. maybe some GUI form stuff.
<doug16k> hardly anything though
<clever> doug16k: i do like .incbin over generating giant c arrays, it also feels like it would compile faster
<clever> no time wasted turning your binary into hex, then back into binary
<bslsk05> ​github.com: lk-overlay/payload.S at master · librerpi/lk-overlay · GitHub
<clever> an example where i generated an array of addr+length pairs using .incbin
<clever> there is a struct in the neighboring file, that turns it into a simpler api for c
<doug16k> with nice sane half open start and end symbols
<bslsk05> ​github.com: lk-overlay/arm.c at master · librerpi/lk-overlay · GitHub
<clever> on the C side, i can just do chosenPayload = &arm_payload_array[processor]; and chosenPayload->payload_addr, chosenPayload->payload_size
<doug16k> neat
<clever> in your case, you could do the exact same thing, with an array of multiple font blobs
<clever> the key to that trick, is to create a struct with a known memory layout, in this case a pointer and an int (pointer always being 32bit in this case), and then making the asm side being pairs of 32bit ints
<clever> extern arm_payload arm_payload_array[3];
Lumia has quit [Ping timeout: 240 seconds]
<clever> i even have the lenght specified, so gcc will try to enforce it, though only for constant indexes i think
<clever> and extern tells the linker to find it in another unit
<doug16k> yeah, my thing starts with known structs, by the time you get to the array part, you know where everything is
<doug16k> doesn't need end
<doug16k> it's the obvious header struct says count, then array of count items after struct, then another thing that starts at end of array
<doug16k> array is the lookup table for codepoint, corresponding bitmap is at an offset into 3rd region, based on what index matched the codepoint lookup
<doug16k> and precompute a lookup table of the first 127 ascii or something so they're O(1)
<doug16k> at runtime
<doug16k> it's semi-SOA
<doug16k> should make it fully SOA though
<doug16k> when doing the search, it pulls in unnecessary fields
<clever> oh, i do also have a font drawing example...
[itchyjunk] has quit [Remote host closed the connection]
<bslsk05> ​github.com: wowmapviewer/font.cpp at master · cleverca22/wowmapviewer · GitHub
<bslsk05> ​github.com: wowmapviewer/arial.info at master · cleverca22/wowmapviewer · GitHub
<clever> and the arial.tga in the bin
Lumia has joined #osdev
<doug16k> should use triangle strip and do whole string in one Begin
<clever> i was using this code as a testcase, when writing my own gl driver from scratch
<clever> and it only supported triangles, because i didnt understand strip at the time
<doug16k> ah. strip is easy, you just do top, bottom, top, bottom, top, bottom, across the row
<clever> i have since looked it up, but its not clear how to end a strip on v3d
<doug16k> ah, I see. I meant easy at API level :)
<clever> doug16k: but, what if i want to have 2 polygons, that dont share any vertex?
<doug16k> after the 1st two vertices, each one vertex is another whole triangle
<clever> how do i end the strip?
<doug16k> then you make a degenerate triangle
<clever> which is?
<doug16k> or there is a "primitive restart index" you can set for indexed
<doug16k> if you repeat a point then it will be a zero area triangle over to the new place
<clever> ah
<doug16k> imagine the "base" of the triangle has 0 width
<clever> yeah
<clever> then its just a line
<clever> and the rasterizer wont find any pixels within that "area"
<doug16k> right but because being on right edge overrules being on left edge, no pixels are inside
<doug16k> s/but//
<clever> but i also see a limitation there, 16bit index is the biggest it supports
<clever> which means the vertex array, can only have ~65535 elements
<doug16k> you don't need indexed
<doug16k> it would be silly to make text indexed
<doug16k> the index array says 0, 1, 2, 3, 4, 5, 6, ... every time
<doug16k> because you used a strip
<clever> so i would instead do vertex array primitives?
<doug16k> for that yeah
<doug16k> because the index traversal is trivial 0-N
<clever> though, that wants the index of the first vertex, as a 32bit int
<doug16k> 0 then
<doug16k> or whatever if you are packing multiple things into a single array being smart
<clever> and given the memory constraints, 32bit index and 32bit count, i'll run out of ram before i exaust those
<clever> vc4-v3d is 1gig of ram, so even with 1 byte vertex's, 30bits for the length/index would exaust all ram
<clever> bigger vertex data, means i use even fewer bits of index/length
<clever> vc6-v3d has an mmu and can address 4gig of data, but same rules, i would need a 1 byte vertex data to even come close to exausting that reach
<doug16k> you can do even better
<doug16k> you an make it one vertex per character and make it create the vertices in the shaders
<clever> i havent tested vertex shaders yet
<doug16k> I did one where it was an x,y,c triple for each glyph
<clever> and i dont know how to make 4 vertexes from 1 entry
<doug16k> using instancing, nvidia instantly draws it
<clever> i'm not sure v3d can do instancing
<doug16k> don't need it - it helps with the make-vertices-in-shader trick
<clever> the vertex shader seems to turn 1 unshaded vertex (attributes) into one shaded vertex (xyz+vary[])
gxt__ has quit [Remote host closed the connection]
gxt__ has joined #osdev
<doug16k> you must mean xyzw
<doug16k> NDC right?
<clever> w/1
<clever> wait no, thats pointless, lol
<clever> 1/w
<clever> page 60/61 of the pdf, is the vertex formats
<clever> the coordinate shader has to produce the ones on 61 i believe
<clever> x/y/z/w/x/y/z/ 1/w
<doug16k> yeah 1/w
<clever> xy is twice, because there is both 3d and screen 2d
<doug16k> then you get screenx= x*1/w, screeny = y*1/w, screenz = z*1/w, where screenx and y are -1 to 1 and z is 0 - 1 IIRC
<clever> screen x/y are also in a 12.4 fixed-point format
<doug16k> makes sense, 1/16th subpixel
<clever> and the hardware has some anti-aliasing built in
<clever> where you can render at double res, and it will down-scale as it saves to the framebuffer
<clever> beyond that, you need to down-scale with some other hw block
<clever> ah, there is what i was looking for, page 78, shader state record formats
<clever> you need one of those, to describe the shader
<doug16k> super sampling is smart. hits the cache almost every time
<clever> the "GL shader" takes 3 shaders, coordinate, vertex, and fragment
<clever> then the "NV Shader" (no vertex?) takes just a fragment shader, and pre-shaded vertex data
<clever> and ive got no clue what the "VG Shader" is for, it takes counts for things, but has no pointer to the shader!?
<clever> *doh*, there it is, fragment shader code address
<clever> not sure what it does
<doug16k> what fragment shader does?
<clever> the fragment shader decides the final color of a pixel
<doug16k> yeah
<clever> the vertex shader generates the xyz/vary[] from attributes
<clever> the coordinate shader only generates xyz for tile binning
<doug16k> the rasterization step interpolates stuff from the vertices and executes the shader for each intermediate value
<clever> yeah
<clever> taking a closer look at table 45 on page 79
<clever> a GL shader, takes a fragment shader uniform count (unused), fragment shader varying count, fragment shader addr, and fragment shader uniform addr
<clever> given that uniform count is unused, that implies its just going to fall off the end of the array and hit undefined data, and you should just not read too many uniforms
<doug16k> absolutely
<doug16k> gpus couldn't care less about correctness
<clever> then the vertex shader has the same unused uniform count, the total attribute size, code adde, and uniform addr
<clever> so fragment and vertex can use different uniforms
<clever> oops, and missed one, vertex shader also has an 8bit attribute selection mask
<clever> that implies you can have up to 8 attributes on a vertex, and select any combination of those 8 for the vertex shader
<clever> does that fit with what opengl standards say?
<doug16k> 8 is plenty
<clever> then you have the same thing again for coordinate shader, uniform count(unused), attribute mask, attribute size, code addr, uniform addr
<doug16k> there is a getInteger api to query limits like that
<clever> then things get a bit more messy
<clever> you have an array of up to 8 base addresses, for each atrribute
<doug16k> it's 8 vectors right? of xyzw? float4 right?
<doug16k> that is tons
<clever> then you have an array of sizes in bytes, the 8 strides, the 8 vertex offsets, and the 8 coordinate offsets
<clever> i think the shader is free to interpret the attributes however it likes
<clever> one sec
<doug16k> yes
<clever> page 60
<clever> > the vertex attribute data for each of the 16 vertices, is loaded into a single column for each vertex
<clever> > such that the qpu can read a horizontal vector of vertices for each individual attribute.
<clever> > the attributes for each vertex are packed into the column, according to the setup data supplied to the VCD from the appropriate shader state record
<clever> then pages 54 and 55
<doug16k> what version of opengl is it supposed to be?
<doug16k> sounds at least 3.3
<doug16k> 3.3 is the 1st awesome version imho
<clever> page 12 says all of that
<clever> opengl-es 1.1/2.0 and openvg 1.1
<doug16k> 2.0 is good
<clever> back to page 54, if i was to do a horizontal 32bit load, for y=0, then i would get attribute 0 in the 16 lanes of the QPU
<doug16k> 2.0 is "oops, sorry we made you do so many function calls" and 3.3 is "sorry, everything is buffers and shaders now!"
<clever> as-in, attribute 0 of 16 vertices
<clever> but, if i was to do 16bit laned, y=0, h=0, i would get the lower 16bits of each attribute 0, i think
<doug16k> 1.x made you do too many calls
<clever> so i could pack multiple attributes
Lumia has quit [Ping timeout: 240 seconds]
<clever> thats also the feeling i got from opengl ES
<clever> it felt like ES was for embedded systems, and forced you to do buffer based stuff, rather then function calls up the wazoo
<clever> and the wowmapviewer font code, is making a call per vertex, which wouldnt have worked on the rpi's old opengl ES
<doug16k> if you are in kernel mode already, your glVertex could poke it straight into the GPU somehow, and it would be pure genius
<doug16k> that is why it was like that - SGI had so much acceleration, the glVertex call just did MMIO or something
<clever> the QPU doesnt really have any of its own ram
<clever> it just dma's into main system ram
<clever> the old v3d2 driver i was writting for linux, would mmap buffers into userland, that where physically contiguous
<clever> so i could then just pass that buffer directly to the qpu
<clever> and i only had to do cache management
<doug16k> or not even in kernel mode - could map it in wherever
<clever> my rough understanding of this region, is that the VPM is an uint32_t[64][16], one of the hardware blocks will pre-load attribute data into it for me
<clever> the coordinate/vertex shader must then compute things, and write the result back into the VPM
<clever> the FEP (front end pipe) will then read the shaded vertices from the VPM, rasterize them, and feed varyings to the VRI (interpolator)
<clever> and schedule fragment shader jobs on the QPU
<clever> the tile buffer is 64x64 pixels, when using 32bit pixels, and no multi-sampling
<clever> and because it renders a whole time at once, it only ever needs to track depth/coverage data for a 64x64 pixel region
<clever> and while storing that tile to ram, it is a series of 64 x 256byte AXI bursts, so it should be very good at maxing out memory bandwidth
Lumia has joined #osdev
<clever> in "4x multisample mode", its instead 32x32 pixels when written out to ram
<clever> i assume its just taking each 2x2 chunk, averaging them together, and writing it out as 1 pixel
<doug16k> yeah, that's one area I wish it were easier to study. tiling is handwaved
<clever> it does also have a 64bpp mode, which results in either 32x32 or 16x16 tiles, depending on multi-sample
<doug16k> the exact algorithm I mean
<doug16k> it's the strip mining optimization isn't it?
<clever> that part, is under driver control
<bslsk05> ​github.com: lk-overlay/v3d.c at master · librerpi/lk-overlay · GitHub
<doug16k> aka loop blocking, aka loop tiling
<clever> lines 313-315, specifies the tile coordinates
<clever> then line 318 calls a function that was generated earlier, with 319 computing the addr of that function, based on the tile coord
gxt__ has quit [Remote host closed the connection]
<clever> 327 will store the file and continue, while 324 will store the tile and report the frame as finished
gxt__ has joined #osdev
<clever> 310/311 then just manually unrolls a simple loop, to step thru every tile in the frame
<doug16k> where do the draw calls come into it?
<clever> but you could do tiles in any order you want
<clever> thats the generated function, that 318 calls
<doug16k> so it replays all the draw calls on each tile?
<clever> the binner will run your coordinate shader, figure out what polygons are in each tile, and generate a control list, that draws a subset of the polys
<clever> yeah
<doug16k> right binner figures out which it covers
<doug16k> nightmare
<clever> thats why it has both a coordinate and a vertex shader
<clever> the coordinate shader is just a lobotomized vertex shader, with the varying computation deleted
<clever> so you can get the screen xy, and then see what tiles it covers
<doug16k> it's conceptually the same idea as using a solid fill shader when doing shadow maps
<clever> mesa hides this from you, and auto-generates the coordinate shader from the vertex shader
<clever> but its also not having to run at the full res
<clever> you could basically treat it as an image with 1/64th the resolution
<clever> the whole tile is 1 pixel
<clever> if the polygon touches it, in the bin it goes
<bslsk05> ​github.com: lk-overlay/v3d.c at master · librerpi/lk-overlay · GitHub
<clever> this comment explains some of it
dude12312414 has joined #osdev
<doug16k> in the bin what goes?
dude12312414 has quit [Remote host closed the connection]
<clever> the polygon
<clever> one bin per tile
<doug16k> up to how many
<clever> treat each tile as 1 pixel, and rasterize it over that screen
<clever> thats what the tile allocation size is for
<clever> static int getTileAllocationSize(int n) { return 1 << (5 + n); }
<clever> page 71, code 112, the tile allocation block can be 32, 64, 128, or 256 bytes
<clever> it then gets filled with compressed primitive lists or inline primitive lists
Lumia has quit [Quit: ,-]
<clever> i forget where, but there is also an overflow space you can configure
<clever> and it will generate jump opcodes to pop into there temporarily
<clever> and it can fire an irq to request more
<doug16k> sounds fun to make a driver for that
<doug16k> it's almost unbelievable how high level it is
<doug16k> it's almost the opengl api directly
<clever> less work for the 250mhz VPU to do
<clever> back when this soc was arm-less, lol
<doug16k> my verilog is not even close to implementing that
<doug16k> I could get stuff to work but not cleanly I bet
<clever> the most complex thing that remains, is compiling a shader into QPU asm
<clever> ah, there it is, "address of overspill binning"
<clever> ctrl+f for that
<clever> > the address of additional memory that the PTB can use for binning once the initial pool runs out
<clever> > this may be set up prior to the PTB actually running out
<clever> that 2nd part, implies that you can react to an OOM IRQ, and then resume binning
<doug16k> so you can just start a new batch
<clever> and the next register then says
<clever> >if this count (the size) is zero when the PTB runs out of binning memory, the PTB will halt, waiting for a non-zero value to be written to this register
<clever> > once the PTB has taken this overspill memory, this register is set to zero
<doug16k> PTB?
<clever> primitive tile binner
<clever> from the big graph on page 13
<clever> > in the tile binning phase, only the vertex coordinate transform part of shading is performed
<clever> > the primitive tile binner fetches the transformed vertex coordinates from the VPM, and works out which tiles, if any, each primitive overlaps
<clever> > as it goes along, the PTB builds a list in memory for each tile, which contains all the primitives impacting that file, plus references to any state changes that apply
<clever> page 62
<clever> > during the binning pass, the PTB automatically writes out a new control list for rendering each tile during the rendering pass
<clever> > the binning list must finish with a flush command, to cause the PTB to finalize all these file lists
<clever> > all that the host processor then needs to do for the rendering pass list, is to setup the tile rendering mode configuration, and link together all the tile lists created by the PTB as sub-lists to the main list
<clever> > the only control items that the host processor needs to add to per tile lists "tile coords" item before and a "store tile" item after each tile list
<clever> as i showed in the unrolled loop earlier
MiningMarsh has quit [Quit: ZNC 1.8.2 - https://znc.in]
MiningMa- has joined #osdev
jack_rabbit has joined #osdev
knusbaum has quit [Ping timeout: 244 seconds]
MiningMa- is now known as MiningMarsh
<clever> doug16k: the CLE (control list executor) also has 2 threads, one for binning, and one for rendering, so the rendering thread can be computing frame 2, while the binning thread is computing frame 3
<clever> and while ive not used them yet, there are also semaphores, so the rendering thread can stall until binning has completed
<clever> instead, i'm waiting for an irq, and manually starting rendering
<clever> oh, interesting, the CLE starts executing at a defined start addr, and PAUSES when it hits the end addr
<clever> if you extend a control list, you can change the end addr, and it will RESUME! executing
<doug16k> yeah that sounds exactly like what hardware would do
<clever> so basically, instead of waiting for frame3 to finish binning, and reseting everything
<doug16k> it will watch that register 24/7 without complaining once
<clever> you can append frame4's binning list to the existing one, and extend the end-pointer
<clever> and the hw will automatically start binning frame4 when frame3 finishes
<clever> thats very nice, it lets you queue up several frames of work, and you dont have to be quick on the irq handling
<doug16k> yeah, covering latency is the whole job
<doug16k> you never want them taking turns waiting for the other
<clever> but with the call opcode, it can be difficult to know where within that master list and sub-functions, you currently are
<clever> so they also have a magic marker opcode, that just increments the marker count in a status reg
<clever> inject that whever you feel, and keep track of how many markers your expecting per frame
<clever> page 63
<clever> > in gl mode, the pipeline up to the PTB/PSE consists of the following steps:
<clever> > 1 determine a batch of vertices to shade in the VCM (vertex cache manager)
<clever> > 2 find space in the VPM (vertex pipe memory) to store the batch of vertex input attributes and shaded vertices
<clever> > 3 fetch vertex attributes to the VPM using the VCD
<clever> > 4 shade the vertices using a vertex/coordinate shader
<clever> > 5 PTB/PSE reads shaded vertex data from the VPM
<clever> PSE==primitive setup engine, PTB==primitive tile binner
darkstardevx has joined #osdev
<doug16k> how far are you into the graphics driver?
<clever> so the binner and renderer threads, are both running those 5 steps, to feed either the PTB or PSE
<clever> for my old pi1 linux work, i can render 2d textured polygons
<doug16k> that's awesome
<clever> let me grab a screenshot...
<doug16k> hardware compositor right there
<clever> each character is made up of 4 triangles
<clever> 2 to draw it in black, then 2 to draw it again in white, with an offset
<clever> creating the shadow effect
<doug16k> and if you rotated the vertices the textures would interpolate diagonally (properly)?
<clever> i assume so
<doug16k> oh man so you have accelerated texturing already
<doug16k> just do the transformations on the cpu and put NDC triangles 2d
<clever> for that textured case, i'm using 2 varyings for the texture UV, and 3 varyings for an rgb color
<clever> so i can multiply the RGBA from the texture, by the varying, to color the text
<bslsk05> ​'2d and 3d demo' by michael bishop (00:00:21)
<clever> this is the non-textured demo
<clever> its a single triangle, with the cpu using sin/cos to rotate the pre-shaded vertex XY's
<clever> and then RGB in the varyings
<clever> so the interpolator will give a smooth scale from 100% red to 0% red
<doug16k> that is doing the grunt part of rendering. even if you did all the transform and projection on the cpu, it would be nothing for the cpu
<bslsk05> ​github.com: lk-overlay/v3d.c at master · librerpi/lk-overlay · GitHub
<clever> this is where that sin/cos happens
<doug16k> cpu has simd right?
<clever> yep
<clever> for (int i=0; i<16; i++) { int temp = a[i] * b[i]; if (store) c[i] = temp; if (accumulate) accumulator[i] += temp; }
<clever> the VPU can run this entire line of code in just 2 clock cycles
<clever> but its integer only, no float in that mode
<clever> mult only allows 16bit inputs, most other opcodes can work on a full 32bit input
<doug16k> that's not really simd though, but yeah cool, vectorization support
<bslsk05> ​'vpu accelerated mandelbrot, final version' by michael bishop (00:00:18)
<clever> this is an example of what you can do with the VPU
<doug16k> I guess it is simd
<bslsk05> ​'non-accelerated mandelbrot' by michael bishop (00:00:13)
<clever> and this is doing the same function (but with floats) in scalar mode
<clever> what kind of fps does the non-accelerated one have?
<doug16k> yeah but mandelbrot can be anything from trivial to brutal
<doug16k> what zoom is that at?
<clever> the non-accelerated was the default from LK, which i think is just showing the whole thing
<doug16k> not fancy multiword precision? just ieee floats?
<clever> the accelerated one, is dynamicaly changing the zoom over time, because its fast enough to run at ~20fps
<clever> non-accelerates is just standard 32bit float
<bslsk05> ​github.com: lk/gfx.c at master · littlekernel/lk · GitHub
<clever> the non-accelerated source
<doug16k> ok, see what I mean right? it might be a super fast float one or crazy gigantic precision
<clever> its the hardware float opcodes, which take 21 clocks for some operations
<clever> and its doing 1 pixel at a time
<bslsk05> ​github.com: lk-overlay/core.S at master · librerpi/lk-overlay · GitHub
<clever> vs this code, which is doing most operations in 1-2 clocks, and its doing 16 pixels at once
<doug16k> is it way faster?
<clever> yes
<doug16k> most interesting is the latency of add and mul
<clever> 10 seconds for float, ~90ms for vectorized int
<clever> several orders of magnitude
<doug16k> like python/mysql stuff
<doug16k> you go in and it's 14 seconds and when you are done it is 48ms
<doug16k> someone had two nested for loops with a cursor.execute inside
<doug16k> never heard of a join
<clever> heh
knusbaum has joined #osdev
jack_rabbit has quit [Ping timeout: 244 seconds]
<doug16k> it's fun though. the improvements are thousands of percent when you fix bad sql stuff
<clever> yeah
<clever> ive identified sql problems before, without even seeing the sql
<clever> i could tell what the developer did wrong, just from how the api's responded
<clever> it was an ingame mail box
<clever> and the more mail you had, the longer it took to load a single page of mail
<clever> there was an index on account, but no index for the limit clause to act on
<clever> and it got bad enough, that the default request timeout in the client would just give up
<doug16k> yeah, he made it so it has to "think about" all mail, then extract a subset
<doug16k> it's so tempting to do pagination with LIMIT
<doug16k> instead of continuation based
<doug16k> select ...everything... LIMIT 1024,16 sucks compared to select ...everything after id N... LIMIT 16
<clever> but what about when your ID's have holes, due to deleted emails, or ID's getting assigned to other users?
<doug16k> you don't care about the holes. they tell you the last id in the last response they got, and you select > that
<clever> ah, but thats fine, as long as you know the id from the previous page
<clever> yeah
<clever> but that forces the client to know what the previous page ended on
<clever> what if i want to skip to the 10th page?
<doug16k> right
<doug16k> so you can't make it parallel, but it can go on forever without timeout problem
<doug16k> the LIMIT 1024,16 one can spray a ton of different ones in parallel
<doug16k> in your fantasies, the server handles it really well
<clever> now it makes sense, why github's API is the way it is
<bslsk05> ​github.com: lk-overlay/v3d.c at master · librerpi/lk-overlay · GitHub
<clever> this is the shader for the spinning rgb triangle
<clever> its a VLIW, where the left column is add only operations, and the middle column is mult only operations
<clever> 4 inputs, 2 outputs, computed independantly, in parallel
<clever> first, we pop one varying, and we set the 4th(d) byte of r3 to 1.0 (auto-converts to 255)
<clever> then we add r5 to the popped varying (interpolation is half done, this finishes it), and pop the 2nd varying
<clever> then we add r5 to the popped varying (interpolation is half done, this finishes it), and pop the 3nd varying
LostCarcosa has joined #osdev
<clever> then we add r5 to the popped varying (interpolation is half done, this finishes it), copy the 1st varying to the 0th byte (a)
<clever> then copy the 2nd and 3rd varying to bytes 1&2(b&c)
<clever> and finally, move that to the output color register, and signal the end of the thread
<clever> dead-simple, and makes use of how you can pipeline an operation by mixing the 2 ALU's
<clever> i think every QPU opcode takes 4 clock cycles
<clever> and that program is 9 opcodes long, so 36 clocks
<clever> and at 250mhz, a single QPU can do 6.9 million pixels/second
<clever> thats entirely ignoring the fact that the QPU is a 16 lane vector core
<clever> assuming every pixel is touched once, it could render a 640x480 frame at 21 fps, ignoring the fact that its vectorized, and there are multiple QPU
<doug16k> wow look how dumb gcc is https://godbolt.org/z/jn1E5Mrq8
<bslsk05> ​godbolt.org: Compiler Explorer
<doug16k> what the hell is it doing?
<clever> i think its doing stores, of 1 byte each
<clever> but that is a rather verbose way of doing it
<doug16k> is that not just one store then add 4 to ptr and write back?
<clever> i'm not familiar with x86 abi, so thats a bit of a mess
<clever> i think maybe, its not caching the list pointer?
<doug16k> oh it's awful. I can write that in a few insns
<clever> and its re-reading it?
<doug16k> ah you need restrict
<doug16k> yep that's it
<clever> insert it where?
<bslsk05> ​godbolt.org: Compiler Explorer
<clever> wow
<LostCarcosa> clang code does not improve with the restrict, interesting
<LostCarcosa> I mean it improves a bit, but not that much
<doug16k> it isn't sure that *list is itself
<doug16k> without restrict
<doug16k> osm
<doug16k> oops, isn't itself
<doug16k> what if *list = &list ?
<doug16k> right?
<doug16k> it's scared to death with char *
<clever> lol
<clever> this is just a primitive for appending 8/16/32bit values onto a byte array
<clever> i could instead memcpy struct's into the array
<clever> mesa has a fancy thing for doing just that
<doug16k> wait, it's not that
<doug16k> entirely. if it did the trick with big load it isn't sure that it will be affecting its input with those writes
<doug16k> isn't sure it will not be affected by its stores I mean
<doug16k> it's the same thing with vectorization
<doug16k> it looks like it could vectorize, but you didn't swear that the input and output don't overlap, so it is afraid
<clever> what exactly does restrict do?
<doug16k> restrict means, I guarantee that this isn't aliasing a variable
<clever> ah
<clever> platform/bcm28xx/v3d/v3d.c:170:3: error: dereferencing type-punned pointer will break strict-aliasing rules [-Werror=strict-aliasing]
<clever> uint32_t d = *((uint32_t *)&f);
<clever> ive also got this new error, it cant cast a float to an uint32
<doug16k> use memcpy
<doug16k> that is dumb and archaic
<doug16k> builtin memcpy will see wht you mean
<doug16k> memcpy(&d, &f, sizeof(d)
xenos1984 has quit [Read error: Connection reset by peer]
<doug16k> compiler is guaranteed to understand
<clever> i think i can memcpy that addword too
<doug16k> no UB
<moon-child> 'what if *list = &list' then you suck
<moon-child> :P
<moon-child> I mean, strict aliasing also sucks
<moon-child> but
<clever> as long as this code is on an LE system
<clever> but also, doing so, defeats the whole reason i was just about to compile this :P
<doug16k> I love strict aliasing because when I go look at the assembly, I don't think the compiler is screwed up
<doug16k> it does something like what I would do
<clever> ok, just one deleted func to fix...
<doug16k> not rereading things all paranoid delusional
<clever> now it compiles
<doug16k> strict aliasing gives it a chance of not thinking a store through a pointer invalidated every register variable
<clever> doug16k: https://gist.github.com/cleverca22/eedba1589a616d995d8582c868ec7552 is what the VPU compiler gave, without restrict
<bslsk05> ​gist.github.com: example.S · GitHub
<clever> load list into r2, add 1 byte, store back to list, store a byte from int to the pre-incremented value
<clever> load list again, increment again, shift the arg by increasing amounts, store a byte
<doug16k> is *list aligned?
<clever> the 32bit word can land at un-aligned addresses
<clever> because its in a control list, with a mix of 8, 16 and 32bit things
<doug16k> can you nop into alignment?
<clever> unlikely
<doug16k> it would be so good to not do all that for one 32 bit thing
<clever> opcodes 16/17 for example, are 8bit + 32bit
<clever> so you would need to nop it into mis-alignment, so the 8bit opcode puts it back in
<clever> but then opcode 32, is 8bit + 8bit + 32bit + 32bit + 32bit
<clever> so you need a different number of nop's to pad each opcode
<clever> at least i dont see any peverse 32bit + 8bit + 32bit cases, lol
<doug16k> can't you test the low 2 bits and conditional branch to 32 bit store? it would hit it 25% of the time
<clever> i could
<doug16k> depends on how many cycles a mispredict is
<clever> which reminds me of what ive seen in the official firmware's memcpy
<clever> it will test to see if the lower 2bits of the src&dest match
<clever> if they are similarly mis-aligned, you can do byte-wise for the start to get into alignment, then 32bit for the bulk
<clever> next, i believe the v3d expects everything in LE, and the VPU is LE always
<doug16k> if low two bits are 0, branch to whole 32 bit, else if low bit is 0, branch to 16 bit halves version, else sucky version
<clever> and arm is LE by default
<clever> so yeah, i can just memcpy the int32 over, just like the float
<doug16k> then it's 25% awesome, 50% pretty good, 25% sucks
<doug16k> I think
<doug16k> no wait, 00 = 32 bit, 01 = 8 bit, 10 = 16 bit, 11 - 8 bit
<doug16k> the low 2 bits of address
DonRichie has joined #osdev
<clever> refresh the gist i linked earlier
<clever> memcpy makes the C look simpler, but the asm is far more complex
<doug16k> you must have -fno-builtin
<doug16k> add -fbuiltin-memcpy
<doug16k> or you have -ffreestanding
<doug16k> which implies fno-builtin
<doug16k> or wait, if this cpu can't do misaligned, it has to do memcpy
<doug16k> but still
<doug16k> you should -fbuiltin-memcpy
xenos1984 has joined #osdev
<clever> cc1: error: unrecognized command line option ‘-fbuiltin-memcpy’
<doug16k> really? wow, what version?
<clever> vc4-elf-gcc (GCC) 6.2.1 20161217
<doug16k> maybe not implemented
<doug16k> surprising, builtin-memcpy helps C code massively
<clever> i also see room for improvement
<clever> lib/libc/string/arch/arm/arm/memcpy.S:FUNCTION(memcpy)
<clever> LK comes with asm copies of things like memcpy
<clever> i dont think i ever wrote one
<clever> lib/libc/string/memcpy.c:void *memcpy(void *dest, const void *src, size_t count) {
<doug16k> yeah it should for arch that don't like misaligned
<clever> so it falls back to whatever gcc did with this fallback
<doug16k> I have a pretty good memcpy
<doug16k> it does the textbook thing, get destination misaligned, copy biggest chunks possible
<doug16k> er, get destination aligned
<clever> that sounds similar to what ive seen in the official firmware
<bslsk05> ​github.com: lk/memcpy.S at master · littlekernel/lk · GitHub
<bslsk05> ​github.com: lk/memcpy.c at master · littlekernel/lk · GitHub
<doug16k> it works its way up to biggest chunks, does main loop in biggest chunks, then work your way down to smaller until byte
<clever> aha
<clever> the c code is doing exactly what i said earlier
<clever> byte-wise copy until both are 32bit aligned, then word-wise copy
<clever> its just a matter gcc not compiling that in an optimal manner
<clever> and i kinda dont want to look at the official firmware, that feels too much like copying then :P
<clever> i forget exactly how they did it
<geist> ah yes that memcpy
<geist> i was quite prouid of it
<clever> it looks to be the exact same algo as what ive seen in decompiles of the official firmware
<geist> i also wrote the one in darwin for arm32. i think it's still there
<clever> so i feel less bad about copying the algo in general
<clever> but the actual asm, i dont really want to copy
<geist> https://github.com/littlekernel/lk/blob/master/lib/libc/string/arch/arm/arm/memcpy.S#L124 is the real money shot. a fun trick you can do with arm32
<bslsk05> ​github.com: lk/memcpy.S at master · littlekernel/lk · GitHub
<clever> *looks*
<clever> why are you touching CPSR?
<clever> that feels very weird for memcpy to do
<clever> enless, are you abusing it, and conditional execution?
<clever> so you can skip some of the stores?
<bslsk05> ​github.com: dgos/isr.S at master · doug65536/dgos · GitHub
<geist> clever: that's the trick!
<clever> neat
<doug16k> too easy on x86 though
<clever> geist: the VPU could potentially pull off the same trick
sympt has quit [Ping timeout: 240 seconds]
<clever> since i can manipulate sr like that, and it has conditional execution
<geist> yeah., note this is all about trying to get things aligned properly
<geist> so that you can then do fast wordwise copies (via ldm/stm)
<clever> yeah, when you need to copy 1-3 bytes, to get both into alignment
<geist> since on arm32 that sort of thing mattered
<geist> this code actually copies up to 15 bytes, to align it on a 16 byte boundary
<geist> using that trick
<clever> i dont think vpu will benefit from anything more then 32bit alignment
<bslsk05> ​github.com: xnu/bcopy.s at master · darwin-on-arm/xnu · GitHub
<doug16k> I'd get it cache line aligned if it is going to do big blocks
<geist> still there, hah
<clever> heh
<clever> i still need to get around to trying to build xnu
<clever> and a userland
<clever> but now that you mention 16 byte copies....
<clever> i can do a 4096 byte copy, in just 2 opcodes....
<clever> at the cost of trashing the vector regs
<clever> and i dont think the vector stuff has any real alignment requirements
__xor has joined #osdev
<geist> in the arm case it only really needed 4 bytes, but the 16 bytes you get somewhat for free, and the inner copy (.L_bigcopy) moves 32 bytes at a time
<geist> using ldm/stm
__xor has quit [Client Quit]
<clever> yeah, your mention of ldm/stm is what reminded me about using vector copies
<geist> arm64 defacto memcpy algorithm uses a whole different strategy, and doesn't concern itself much with alignment
<geist> since arm64 is intrinsically able to do unaligned access, and is generally not penalized any more than say x86 is. ie, it's okay to do unaligned and in most cases is probably just as fast
<clever> just the issue about the load/store not being atomic?
<geist> that is definitely the case, and where load/stores stride a cache line it may take an extra cycle or so, etc
<clever> i think youve mentioned before, that say a 32bit write, that is 32bit aligned, will be seen by other cores, as either having happened or not happened
<clever> yeah
<geist> that's right
<clever> and the exact numbers vary by core
<geist> both x86 and arm have a wordwise atomicity guarantee that's actually written down. but most of that only applies to native units that are aligned
<clever> for example, if i write 64 bits with an stm, on a 32bit core (so its storing 2 regs), what can a 2nd core observe?
<geist> no the architectuyre is more strict than that, it's not per core
<clever> so, could that 64bit write get shorn in half?
<doug16k> the only torn access I have heard of on x86 is some MMX
<geist> you are only guaranteed up to probably the native register size. stm/ldms are generally considered to be functionally equivalent to a series of load/stores in order
<geist> doubleplus so since the cpu can literally be interrupted in the middle of it
<clever> yeah, so that ldm could get torn, and effectively is just 2 seperately atomic 32bit writes
<geist> well, be careful throwing around words like 'atomic' here, since that has different meaning
<clever> in terms of what another core can see if it reads that addr during the write
<geist> also weak memory model, etc. but what you are guaranteed not to see in particular cases are torn writes
<geist> based on alignment, word size, etc
<clever> but what if i was to do that, on an aarch64 core, in aarch32 mode?
<geist> but due to weak memory model it doesn't guarantee that you'll see it in order, or at all
<geist> but you wont see half of it
<clever> would it still be treated as 2 32bit writes? because its from say r2 and r3
<geist> what case is this specifically?
mzxtuelkl has joined #osdev
<clever> an aarch64 core, but in aarch32 mode, doing an stm, to save r2+r3 to ram
knusbaum has quit [Ping timeout: 244 seconds]
MiningMarsh has quit [Quit: ZNC 1.8.2 - https://znc.in]
<geist> you are guaranteed according to the arm32 memory access rules
knusbaum has joined #osdev
<geist> will it do more? possibly. but that's not specced
<clever> makes sense
<clever> it only has to meet aarch32 rules, because its claiming to be aarch32
<geist> the obvious case is say ldm/stm that starts at offset 0xc
<geist> and it crosses into some new cache line
<geist> or more specifically, offset 0xffc
<geist> so you store a word at 0xffc and another at 0x1000
<clever> i'm assuming its also 32bit aligned
<geist> in arm32 rules that means two writes, and it would still be the case in arm64
<geist> but in arm64 if you stored a single 64bit word at 0xffc it wouldn't be guaranteed, because it was unaligned 64bit
<clever> yeah
<geist> note x86 has pretty much the same rules. it's just strongly ordered
<geist> but you can get torn writes based on misalignment
<clever> and by ordered, you mean that an arm core could see either write as having happened first?
<geist> (and exceptions with various SSE things that have weaker rules)
<geist> for arm32 two writes? absolutely
<clever> while x86 always observes the writes happening in program order
<geist> right
<clever> any kind of rules, on how arm can reorder things?
<geist> so on x86 two 32bit writes back to back could show up as -- A- AB
<geist> yes. it can do what the fuck it wants, and barriers and things that generate barriers force it
<clever> but on arm, they could show up as -- A- AB -B ?
<geist> (though the rules are hella complicated)
MiningMarsh has joined #osdev
<geist> yes
<clever> where might i find the rules on how it re-orders things?
<geist> with arm the general base rule is 'outside of a barrier it can emit writes in whatever order the cpu wants to, but it must be internally consistent with itself'
<geist> but then there are a bunch of sub rules about how to get it to not do that in specific cases, but it's programmers problem
<clever> i would assume that a pair of back2back 32bit writes, would prefer to occur in order of address?
<geist> it doesn't. it doesn't say how it reorders things, it basically says 'the cpu is free to reorder things how it wants to (with some exceptions to that)'
<clever> so they can turn into an axi burst, for example
<doug16k> if it has a cache line already it can just stick the store in the line, even if there is a previous store that missed and must wait
<geist> exactly. it allows it to be much more lazy about how cache liens are filled, written back, etc. it's an implementatino detail
<clever> doug16k: and that can make the missed store become visible after the others
<geist> removes a lot of complexity on the back end of the cpu
<doug16k> yep
<geist> *however* the rules also state that the cpu is ordered relative to itself
<clever> but if i'm doing a pair of 32bit writes, to the same cache line, they are either both going to go thru, or both miss
<geist> so it cannt move reads before writes, etc
<clever> assuming another core doesnt steal the cache line mid-write
<geist> well, reads of the same line, before writes. etc. it has to act from a single cores poiunt of view that it's in order
<geist> even if the cpu is highly OOO
<clever> if your doing reads, against a pending write, does it peek into the write queue, and act like the write had already completed?
<geist> clever: als dont assume the cpu is running the instructions in order. it could have completely rearranged the sequence it ran the two stores in
<clever> so it can see its own writes in program order
<geist> based on what registers the stores depend on, etc
<clever> yeah, that starts to complicate it even further
<geist> if you had two stores for address A and B, and the A store writes a reg that depends on some complex logic it may 'get to' the B store immediately, issue it, and then the A store happens later in the pipeline
<geist> since it has detected that the two stores dont depend on each other
<geist> but this is where the memory barriers in ARM come in: DMB and DSB instructions. you're inserting barriers that have various semantics about drawing lines in the sand, saying this must happen before that, etc
<geist> thats a DMB in general. a DSB is more aggressively dumping all of the load/stores, in general
<clever> lets take xhci as an example, each message in the command rings has an "is valid" flag at some point
<clever> for that, you would write the message minus the valid, then flush the cache and issue a barrier?
<clever> then write the valid, flush the cache, and go on?
<geist> yes. also flushing the cache implicitly has barriers in it as part of the algorithm
<doug16k> barrier before writing the last 32 bits, then write last 32 bits
<clever> doug16k: but if its cached memory, you need to flush, and then there is the question of what order the flushing writes to ram....
<doug16k> yeah
<bslsk05> ​github.com: lk/cache-ops.S at master · littlekernel/lk · GitHub
<clever> but, if this command ring is write-only, you want write-combined instead?
<geist> it forces everything out so that after that point things are cool
<clever> like a framebuffer
<doug16k> point is, you are guaranteeing that the previous 3 stores actually are globally visible before you even set the valid bit
<doug16k> the cache flush kinda wrecks the example :D
<geist> yah ignoring the cache part, if you want to make sure what you wrote on cpu A is observable by other entities on the bus that participate in cache coherency, you issue a DSB to flush it out
<geist> that makes sure all stores that are pending actually make it out of the cpus write buffer, which you can kinda thing of as a L0 cache
<clever> oh right, i'm assuming the xhci isnt coherent with the arm caches
<geist> note this is all when using 'normal memory' which is fully cached, etc. when you're reading/writing to pages you have mapped as 'device' or 'strongly ordered' theres a pile of additional rules that happen, and they're much more strict
<geist> but a lot of complexity WRT the ordering of outstanding 'normal' memory transactions and new uncached (device/strongly ordered) memory
<geist> that's where a ton of the subtle rules come in
<clever> lets assume that the xhci isnt coherent with the arm caches at all, and i choose to use write-combined, because this block of memory is write-only
<geist> if evertything is just plain cached memory and you're not worrying about entities that dont participate in cache coherency (ie you're only thinking about other cpus and mmu TLB fetchers) then you're playing with normal memory barriers and weak memory model
<clever> write-combined, means the arm only stores 1 cacheline worth of data? and tracks what bytes are dirty, so it can flush it out safely, without knowing the old contents?
<geist> clever: then that's a completely different kettle of fish that i honestly dont remember the rules for
<clever> ah
<geist> no. write combined is a form of uncached
<geist> ie not 'normal memory'
<doug16k> clever, what you said is x86 WC memory
<doug16k> with the byte enables and no write allocate
<clever> doug16k: ahh, i was mostly guessing how it worked
<geist> it's one of the nGnRnE variants i think
<geist> basically, uh, i think nGRE? i forget
<clever> geist: i did recently read a doc you linked, that explained that tangle of letters, let me dig it up again...
<geist> though actually i think i'm wrong. there's a variant of 'normal memory' that i think covers what you want. lemme find it in fuchsia code
<bslsk05> ​fuchsia.googlesource.com: zircon/kernel/arch/arm64/include/arch/arm64/mmu.h - fuchsia - Git at Google
<geist> it's a variant of 'normal memory' but treated as uncached + write combined
<clever> cross-referencing to my armv8 docs....
<geist> (for those that know x86, the MAIR register is basically PAT on x86. each page table has a 3 bit index into the MAIR which has a list of 8 different types of memory)
<geist> anyway like i said there's a bunch of complex rules with regards to how outstanding memory transactions are sorted with different cache properties
<geist> and those i always have to look up
<geist> and the general safe rule is to assume they're not sorted and insert barriers as appropriate
<clever> ok, i see an MAIR0 register...
<clever> AttrIndex[2] says if its reading MAIR0 or MAIR1
<clever> ah, but this is just a 32bit compat thing
<geist> MAIR has 8 fields of 8 bits, each describes a particular memory type you can define
<clever> on the 64bit side, its just MAIR_EL1
<clever> yeah, i see that in the 64bit reg
<geist> yah that's just because there are 3 bits in the page table entry that point to one of the 8 fields in the MAIR_EL1
<clever> and those 8 fields cant fit into a 32bit reg
<clever> so aarch32 cut it into 2 regs
the_lanetly_052 has joined #osdev
<geist> right. and though those 8 bits per field let you describe a ton of combinations of cache/uncached/device/etc bits, in practice only about 4 combinations are useful
<geist> and it's the 4 fuchsia has in the link
<geist> i think those are basically identical to linux's and freebsds
curi0 has joined #osdev
<clever> this reminds me of a remap thing (might be the same thing) that i saw in the paging tables
<clever> where originally, those 3 bits where the mode itself
<geist> you might want to define a read-allocated + write through variant, which you can
<clever> but now it has enough modes, that it needs 8bits to describe the mode
<geist> right that happened somewhere in the armv6/armv7 days
<geist> i think it was something like 'TEX remap' or whatnot
<clever> so its instead using the 3bit as an index, to one of 8 modes
<clever> kind of like a palette in an image
<geist> right
<geist> ie, the MAIR
<geist> 'memory attribute something register'
<clever> where you might only have 2 bits per pixel, but you can then assign 4 unique 32bit colors
<clever> MAIR_EL1, Memory Attribute Indirection Register (EL1)
<clever> so in the fuchsia code you linked, your assigning slot 3 to 0x44, and then creating a constant that says to just shove 3 into the paging tables
<bslsk05> ​github.com: linux/proc.S at eaa54b1458ca84092e513d554dd6d234245e6bef · torvalds/linux · GitHub
<geist> right
<curi0> whats the typical process for moving a BAR of a PCI device like ?
<curi0> i'm trying to understand how linux does it
<clever> ah, and your clearly defining all 8 codes, on seperate defines, while linux mashed them all into 1
<geist> curi0: in general you just write to the BAR, but the hard part is knowing what to put there, and allocate it
<geist> the hardware itself you simply write to it and it takes effect immediately
<doug16k> curi0, it describes it in the PCI spec. there is a procedure to autodetect how big it needs to be and that tells you its alignment too
<geist> correct. annoyingly since they were trying to be compact they didn't just define another config field that says 'its this big' which would have been really damn amazing
<geist> since the only way to determine its size is to temporarily write all 0xfffs to it and read back which bits are unimplemented, really annoying
<clever> ive done the same thing to figure out how some VPU stuff needs to be aligned
<doug16k> yeah. I guess they figured it was clever to just let the unimplemented bits tell you
<clever> handy trick, when you can read the value back, and they dont implement bits that you shouldnt be setting
<curi0> what about moving it away from what the BIOS assigns ? i can see this in my kernel log for BAR 0 "releasing [mem 0x600000000-0x60fffffff 64bit pref]" and "assigned [mem 0x400000000-0x5ffffffff 64bit pref]"
<geist> we actually have an oustanding bug in fuchsia where the user space pci driver is probing the PCI bus and for a fraction of an instant must overwrite every BAR
<curi0> its also resizing it but thats not relevant for me rn
<geist> which if the kernel happens to use that device implicitly in that window. boom
<clever> geist: mutex time?
<geist> hard, because user space
<clever> yeah
<geist> have to basically freeze all cpus
<clever> thats the kind of thing that pci in kernelspace gets for free
<geist> clever: and when pci in kernelspace scans things when its still single cpu
<clever> that makes it even simpler!
<doug16k> why not make a shadow copy of all the config spaces in ram and use that?
<geist> you could bump into it if you say were running more than one cpu and one cpu is scanning the pci bus and the other one is writing to the framebuffer
<geist> or a pci serial card (the case we hit in fuchsia)
<doug16k> with the 1111 masks in the BARS I mean
<curi0> my default my BIOS assigns a 256MB sized bar at 0x600000000. however there is not enough room there for it to resize to 8GB. so the amdgpu driver tells kernel to find 8GB free and it does at 0x300000000
<clever> if you where scanning a device after smp is up, you would need some kind of mutex over the whole pci card
<geist> doug16k: how would that help? you have to write to it to see it, and in that instant the BAR is unconfigured
<clever> and the serial/framebuffer code would have to grab it every time it touches the device, yuck
<curi0> for now i just want to figure out how to move it and resizing i will figure out later
<curi0> *by default my BIOS
<clever> curi0: i assume you would just look for a hole after ram, and shove it there?
<geist> curi0: well, like i said you can simply do it but you need to allocate the space. the allocation is the hard part
<doug16k> geist, oh I figured you could intercept their config accesses and know what to present them
<geist> especially if the device is on the other side of a bridge, because you also have to adjust the bridge to cover the new zone
<clever> i would assume the BAR can go anywhere in the 64 (or 48?) bit addr space
<curi0> clever, yup
<curi0> yes it adjusts the bridge too
<geist> clever: there are some limitations there. specifically bridges only can bridge 64bit 'prefetchable' memory (there's a separate config field for that)
<curi0> do i need to copy memory from the old bar location to the new one or not ?
<doug16k> curi0, that isn't what BARS mean
<geist> so anything that is 64bit must be either on the root bus (and thus behind no bridge) or intrinsically prefetchable, because if it son the other side of a bridge it has to also be prefetchable
<doug16k> BARS don't say where to get it from RAM
<clever> geist: ah, here is the thing you linked earlier, that explains the letter soup in the linux source: https://developer.arm.com/documentation/100941/0101/Memory-types?lang=en
<bslsk05> ​developer.arm.com: Documentation – Arm Developer
<doug16k> BARS say where to the device address range into the address space
<doug16k> where to insert*
<curi0> thanks for explaination
<geist> yah in this case think of the video card as having 256MB (or 8GB) of internal memory that it's dumping onto the cpu's memory bus
<doug16k> it configures what range of addresses to make the device go "oh oh that's me!" on the PCI bus
<geist> and the BAR says how big the window is and where to put it
<clever> curi0: my understanding, is that the BAR maps a chunk of memory on the pci device, to some physical address, so changing the BAR, just moves the ram to a new addr, and whatever data was in that ram, is moved along with it
jack_rabbit has joined #osdev
<geist> normally things like say an ethernet card may have fairly small bars, like 4K or 16K because all they're doing is presenting some memory mapped registers
<Andrew> Turns out that cross compilers are inevitable
<clever> i did also recently see mention, about how a GPU may have 8gig of ram, but the BAR is only 256mb in size, and can never grow bigger
knusbaum has quit [Ping timeout: 244 seconds]
<clever> and you use MMIO to change where in gpu ram, that 256mb window points
<doug16k> clever, you use DMA for everything
<geist> right, but theres some new resizable bar feature that some newer cards like that i think expands that notion
<clever> however, if bus mastering is enabled, the gpu can read any host ram
<clever> doug16k: yep, exactly
<curi0> mine does become 8GB
<curi0> resizable BAR capability
<geist> so presumably there is some newer feature that actually likes to have large windows into gpu ram
<curi0> but im not interested in doing that for now
<clever> so instead of moving the 256mb BAR to point to different gpu ram
<clever> you tell the gpu to dma stuff from host ram to gpu ram
<clever> and it fills itself
<doug16k> the CPU can just memcpy into the GPU instead of scheduling DMA
<Andrew> gcc -m32 -nostlib -nostdinc -fno-builtin -fno-stack-protector -no-pie -fno-pic -c kernel.c -o kernel.o && ld -m elf_i386 -Tlink.ld -o kernel.elf kernel.o # ld gives some weird 'i386 input of kernel.o incompatible with i386:x_64 output'
<doug16k> if they lock a buffer, you just access it where it really is, if you want
<Andrew> I finally understood why people say that not using cross compilers -> huge pain
ddevault has quit [Write error: Connection reset by peer]
jleightcap has quit [Read error: Connection reset by peer]
patwid has quit [Read error: Connection reset by peer]
gjnoonan has quit [Read error: Connection reset by peer]
alethkit has quit [Read error: Connection reset by peer]
exec64 has quit [Read error: Connection reset by peer]
tom5760 has quit [Read error: Connection reset by peer]
sm2n has quit [Read error: Connection reset by peer]
milesrout has quit [Read error: Connection reset by peer]
<geist> Andrew: yeah it's much simpler to just have a cross compiler that directly does what you want. doesn't mean you can't force your native one to generate code how you want, but it's an extra thing you have to futz with
ddevault has joined #osdev
gjnoonan has joined #osdev
<doug16k> if they did the lock where it is write only and discard so it can just do the stores right into GPU memory with write combined stores of all line bursts
<geist> and it's harder to get people to help you because your setup is intrinsically a special snowflake in many cases
tom5760 has joined #osdev
milesrout has joined #osdev
patwid has joined #osdev
jleightcap has joined #osdev
exec64 has joined #osdev
alethkit has joined #osdev
sm2n has joined #osdev
<geist> it's a bunch of extra variables that enter the picture that may be specific to your setup
<clever> geist: aha, found all of the right sections, so your 0x44, says its device, non gathering, non-orderable, with early write ack, nGnRE, but in the case of xhci command rings, i think gathering wouldnt be an issue?
<geist> with appropriate barriers yes
<clever> G vs nG seems like a thing for true mmio
<geist> i mean probably. honestly my brain is turning off
<clever> where the write itself, has side-effects
<geist> it requires full attention to grok arm cache shit
<clever> yeah
<bslsk05> ​weblog.jamisbuck.org: Buckblog: Maze Generation: Eller's Algorithm
<geist> was thinking of bashing together some C code for that. i have the old MAZE.BAS here on my altair
<geist> i used to love this thing, it's a cool little algorithm
<curi0> so to change the BAR address do I just have to write the new address to the BAR register in PCI configuration space (after configuring the bridge) ?
<clever> so if i'm understanding things right, i could change the MAIR field to 0x4c, and it may slightly improve performance, but i havent checked what the other 4 does
<doug16k> curi0, not really, but if you read it back, and it is still the same value, then sure
<geist> clever: well, you had darn well better know what you're doing
<doug16k> if not then you are violating alignment or something
<geist> there are no free lunches
<curi0> whats the process for changing the address then ?
<curi0> i tried looking through the linux kernel and couldnt find
<clever> geist: yeah, i would also want barriers, and to test things heavily
<clever> hmmm, but then the early-ack, does that lie to a barrier opcode?
<geist> again the process is easy: you just write the new value. the hard part is knowing the value, allocating it and making sure it doesn't overlap, and making sure that particular hardware can handle it, and making sure any bridges in front of it can handle it
<doug16k> curi0, you will luckily give an address that is aligned properly with the right bits set in low 3 bits, or you will not do it right, and it will not read back what you wrote
<curi0> thanks i'll try these in a vm
<clever> geist: oh, and now i realize, i decoded your 0x44 wrong already, starting over!
<geist> we're not being dismissive. it's just complicated
<geist> clever: yeah 0x44 is a variant of the normal memory stuff
<clever> yep
<geist> curi0: is this in the context of your kernel or something?
<curi0> im just trying to understand how linux moves BARs
<curi0> nothing else really
<clever> geist: which is just normal memory, outer non-cacheable, inner non-cacheable, so the letter soup i decoded was meaningless!
<geist> sure but how are you going to test it?
<curi0> efi program
<curi0> already reading stuff with it'
<geist> and in general linux doesn't move bars unless it has to
<geist> usually it leaves it alone, (on an x86 machine at least)
<curi0> for me it does because thats the only way it can find space for an 8GB bar on my system
<clever> 2022-07-07 03:29:34 < curi0> what about moving it away from what the BIOS assigns ? i can see this in my kernel log for BAR 0 "releasing [mem 0x600000000-0x60fffffff 64bit pref]" and "assigned [mem
<clever> no such messages on my system
<geist> what hardware is this for?
<geist> probably a fairly new vid card with the resizable bar stuff
<clever> and my largest bar is 256mb
<curi0> amd rx 580 8gb and i5 3470
<geist> ah yeah it's fairly new i think
<curi0> 5 years
<geist> i'll have to check mine (geforce 1080ti) as far as i can tell that gen nvidia didn't like to do that
<geist> but possible there's some new driver bits that do it
<curi0> i think it works all the way back to R9s from 2013
<curi0> amd has had support for a while on linux
<geist> anyway, like we said moving the bar is easy, allocating it and knowing it's okay to is harder. linux has a whole pci bus driver that has a holistic view of the world, so it knows where to allocate new space
<curi0> yeah im hoping uefi has anything that will make that easier (i havent found anything so far)
<geist> yah also keep in mind it's likely that the video immediately explodes if you move it
<geist> because now the framebuffer will be in a new spot
<geist> unless it's in a separate BAR
<doug16k> is it integrated?
<geist> not if it's an fx 580
<geist> and a bridge was mentioned. integrated stuff tends to be on the root bus
<doug16k> yeah if code is expecting the framebuffer and stuff at one place and you move it, then it will not work of course
<geist> i was wondering how mundane devices get 64bit mapped recently when writing a full pci driver for LK but found out that basically they were all on the root bus
<geist> ie, some derpy AHCI device or e1000 can sit in 64bit range non-prefetchable because they dont have to sit behind a bridge
<geist> which would force the 64bit range to be prefetch
<geist> honestly surprised PCI hasn't specced some sort of new bridge type that fixes that bug
<doug16k> what device is it where you care if it is prefetchable?
<geist> only true prefetchable bars i've seen are for vid card BARs
<doug16k> oh wait, you mean it makes the accesses prefetchable when they weren't?
<geist> but presumably something like e1000's mmio bars should *not* be prefetchable
<geist> but if they were behind a bridge they'd have to be forced to if using a >4GB address
<geist> due to the limitation of the legacy bridge spec, that only gets 64bit addresses for prefetch regions
<geist> i was surprised to discover this
<geist> so what i've seen is e1000s and whatnot that declare they have a 64bit non prefetch BAR, whcih means functionally they're forced to be 32bit BARs unless they're on the root bus (ie, built into the soc)
<doug16k> BARs that are huge tend to be prefetchable (framebuffers)
<doug16k> non-prefetchable tend to be tiny
<doug16k> so I guess nobody is worried that non prefetchable have to be < 4G
<clever> and the smaller it is, the more easily you can pack it into the <4gig
<clever> although, wont those BAR's be covering up ram?
<doug16k> not necessarily
<doug16k> the hole can be remapped to the end of ram
<clever> how?
<geist> in the very early days of x86 machines getting close to 4GB ram, yes. but pretty quickly x86 machines started remapping top of ram that would be covered up to >4GB. so you get a discontiguity in ram
<geist> it's intel or AMD specific, but there are various SOC control registers that let you set where the top of 'low' memory is
<geist> (TOLUD, etc)
<clever> ah
<doug16k> yeah, people hardly care anymore about the lost ram
<doug16k> at first everyone was freaking out because microsoft used PAE to coerce you into getting server version
<geist> which functionally is telling the intel/amd socs where to stop decoding DRAM for the <4GB hole, and then another one that says where to stop decoding DRAM > 4GB
<clever> so you could set the "low" (<4gig?) memory to end at say 1gig?
<clever> and then you have a 3gig hole, and the rest of the ram is >4gig?
<doug16k> so there was little point in remap
<geist> so on a machine with say 0-3GB, then 4GB - 11GB there may be two control registers somewhere: one set to 3GB and another one at 11GB
<clever> yeah, makes sense
<geist> then that's intrinsically priming the built in address decoders where to redirect transactions to the dram controller vs everything else
<clever> so you can artificially create your own hole in ram
<clever> and then shove all of the pci stuff in that hole
<geist> yah this is part of what the bios has to set up
<geist> you can actually find these registers, they're somewhat documented. i think the AMD ones are called TOLUD and TOLUD2. iirc. they're MSRs
<geist> intel has another similar set. i think in pci device 0:0.0
<geist> or something like that
<clever> in the latest linus tech tips, they showed off some more of what intel developers are doing
<clever> and they modified the cpuid registers via a debug interface, WHILE WINDOWS WAS RUNNING
<geist> or maybe the other way around and the intel one is TOLUD and the AMD one is something else
<clever> then re-opened cpu-z, and the cpuid said LTT
<doug16k> TOPMEM?
<geist> yah, TOPMEM
<geist> that's the AMD ones. right?
<doug16k> think so yeah
<geist> clever: re: the cpuid stuff, AMD at least documents a lot of that. there are some control MSRs that let you set various 'hard coded' stuff that shows up in cpuid
<geist> basically things that the bios can do to override features so the os doesn't think it's there, etc
<clever> geist: except this was going via a backdoor debug channel, and the os wasnt having to co-operate
<geist> that too
<geist> lots of those bits are not necessarily hard coded. filled in with microcode, or MSRs or whatnot. probablyu just sram cells sitting there
<clever> yeah
<bslsk05> ​'I saved the best for last' by Linus Tech Tips (00:18:33)
<geist> i mean i dont know what they did but based on what i've seen and heard, i'm totally not surprised
<clever> they did also mess with the overclock over that port
<clever> which ive done too
<geist> reading the BKDG for AMD 15h is one of the most informative things about how the sausage is made i've read in a while. it really goes into pretty intricate detail
<geist> and one can assume that intel has similar amounts of control knobs
<clever> my desktop motherboard is for "gaming", and a special usb-host port, that can identify itself as an HID device if you put a button
<geist> anyway off to bed i go
<clever> and then using a closed-source app, you can mess with the clock stuff
<geist> i meant to go like an hour and a half ago
<geist> damn you clever!
<clever> same
<clever> damn you too!
<clever> 4am
<doug16k> I can't stop thinking about meltdown during this video
the_lanetly_052 has quit [Ping timeout: 244 seconds]
<doug16k> I'd ask them how many performance counter errata are going to be in 12th gen
<Andrew> Building binutils at the moment
<Andrew> Wish me luck :D
toluene has quit [Read error: Connection reset by peer]
<doug16k> hmmm, I wonder if any virus scanners detect an infinite loop that increments a variable as a precision timer for sidechannels
toluene has joined #osdev
<moon-child> if you're native code, presumably you can just rdtsc
<moon-child> and don't need a spincrement loop
<doug16k> there are facilities to put a divisor on it right?
<moon-child> maybe?
<moon-child> I haven't heard of any, but that doesn't mean they don't exist
<doug16k> I think so. anyway, you can disable the whole thing too
kaichiuchi has quit [Ping timeout: 244 seconds]
kazinsal has quit [Read error: Connection reset by peer]
danlarkin has quit [Read error: Connection reset by peer]
kazinsal has joined #osdev
kaichiuchi has joined #osdev
danlarkin has joined #osdev
arch-angel has quit [Read error: Connection reset by peer]
<doug16k> makes the cpu do a tsc * 16.48 fixedpoint scale factor and give that to guest
xenos1984 has quit [Read error: Connection reset by peer]
<doug16k> host decides how much tinfoil to use
<doug16k> can use so much tinfoil, it increments every 22 hours with 3.5GHz base
<doug16k> or make it seem like you base frequency is 229 terahertz
<doug16k> not sure it allows 1 in top 16 bits, it would be funny, I'm kidding
<doug16k> makes me wonder if that is for hosts to better falsify CPUID :D
<doug16k> pretend it is some better cpu, make the base frequency look right
<doug16k> that would be what the upper 16 bits are for. lower 16 would be for scaling down precision to prevent usermode spectre
<doug16k> spamming increment in thread would be the workaround
<doug16k> lower 48 I mean
arch-angel has joined #osdev
curi0 has quit [Remote host closed the connection]
<doug16k> funny it is for VM guests and not in general
xenos1984 has joined #osdev
liz has joined #osdev
LostCarcosa has quit [Quit: Leaving]
GeDaMo has joined #osdev
<Andrew> Ahhhhh
<Andrew> Using cross compilation makes things so much better
<zid> yes, yes it does
<zid> makes what better? :P
arch-angel has quit [Quit: Leaving]
arch-angel has joined #osdev
arch-angel has quit [Client Quit]
<mjg_> real os developers write bytecode by hand until they have a self-hosting kernel
<Mutabah> speaking of that... did you see the GDQ "Triforce%" run?
<Mutabah> (bootloaders/bytecode... and by "hand")
<mjg_> wut?
<Mutabah> Summer Games Done Quick featured a showcase of Ocarina of Time
<zid> If you want me to explain any of it I probably can mutabah
<mjg_> i only find "live reaction" videos
<Mutabah> Arbitrary code execution, used to show off beta content left on the cartridge... and then some
<zid> I know a fair amount of OoT
<Mutabah> zid: Oh, I saw it live, and a nice breakdown of how it was performed
<mjg_> heh, solid. i'll take a look later, fortunately i never played any of the zelda games
<zid> performed sure
<zid> but I know what it *does* :p
<mjg_> what cpu was that running on?
<zid> mips
<Mutabah> The breakdown video mentioned that they limited the interface to what could be done with a physical controller
SGautam has joined #osdev
<Mutabah> i.e. didn't use two bits that didn't have buttons, but controllable over the controller interface
<zid> so you're not going to ask how SRM works? :( *sadge*
<Mutabah> :) I've watched enough speedruns of Zelda64 to know :D
<zid> boo
<bslsk05> ​'Ocarina of Time TAS by dwangoAC, TASBot, Savestate, Sauraen in 53:05 - Summer Games Done Quick 2022' by Games Done Quick (01:12:43)
<zid> go on then, you explain it
<zid> I'm waiting :D
<Mutabah> and triggered enough use-after-frees :)
<zid> nod, they have a bunch of silly internal jargon, but it's technically a use-after-free that they're exploiting
<Mutabah> You pick up an object while it's not being rendered (through camera manipulation), which ties its position/rotation to link's position (by updating it every frame)
<zid> (superslide teleporting through a load volume with a strength upgrade usually)
<zid> but there's other methods that achieve the same result like remote camera
<Mutabah> You then leave the map, causing it to be unloaded while still held (... probably because it wasn't being properly tracked due to being off-screen originally)
<Mutabah> Next time you transition maps, a new object is loaded over it... but link is still "holding" the object, so this new random object (or even random blob of code) gets clobbered with the position/rotation updates
<zid> It's just a disconenct between link being able to or knowing he should, drop things, and things freeing
<zid> the superslide teleport locks link into the grab
gildasio has joined #osdev
<zid> remote camera there's just no reason for link to stop holding the thing anyway
<Mutabah> mjg_: That was the run, https://www.youtube.com/watch?v=qBK1sq1BQ2Q is the breakdown
<bslsk05> ​'Finally Obtaining the Triforce in Ocarina of Time: Triforce Percent Explained' by Retro Game Mechanics Explained (00:34:24)
<zid> offset in the object (actor) they tend to overwrite is the draw function pointer
<mjg_> Mutabah: danke
heat has joined #osdev
<mjg_> Mutabah: so is the triforce thing real or something injected by tas?
<mjg_> Mutabah: started watching the explanation and it makes a suspicious statement in the second minute
<mjg_> Mutabah: well i watched another minute and now i know :-P
dennis95 has joined #osdev
gildasio has quit [Remote host closed the connection]
gildasio has joined #osdev
<zid> I've invented an amazing chocolate delivery system
<zid> You forget it's in your pocket, then open one end and suck
<Andrew> zid: I tried with weird -m32 flags, etc and ld complains about expecting i386:x86_64, rejecting gcc's i386 output
<zid> yup, that's one of the annoying things about trying to make combined bootstrap + kernel images
<zid> getting 32bit code inside a 64bit elf
<zid> It's *much* easier in assembly in that respect, because you can just [bits 32] -felf64
SGautam has quit [Quit: Connection closed for inactivity]
<mjg_> i stopped eating chocolate few weeks ago
<mjg_> prompted by getting myself from ~70 kg to 80 :[
<mjg_> at this point i look pregnant
<zid> I weigh like.. 50kg? *does math*
<zid> oh 60
<mjg_> posture-wise i look like this guy https://www.youtube.com/watch?v=42FLAr86hbI
<bslsk05> ​'Squat Cobbler HSC - Beaches 'n' Peaches | Better Call Saul Extras' by Movies Breaker (00:03:29)
<mjg_> zid: what's your height
<zid> 6
<mjg_> foot?
<zid> mete- yes
<mjg_> you sound underweight then
<zid> I don't know what I weigh, but it isn't 50 or 70 :p
<bslsk05> ​www.nhlbi.nih.gov: Calculate Your BMI - Metric BMI Calculator
<zid> I need like 12 more sets of 5kg scales
<mjg_> funny trap with "fixing your diet": it is very easy to end up with something atrocious
<zid> Idk how to be fat
<mjg_> and funny trap with exercise: common advice is beyond garbage and will leave you injured
<mjg_> how old are you man
<zid> mid 30s
<mjg_> in my mid 20s i was the "can eat anything" person, or so i thought
<mjg_> so am i
<zid> BMR doesn't really change much
<mjg_> intersetingly i started getting overweight *post* pandemic
gog has joined #osdev
<GeDaMo> I was around 25 when weight just started accumulating :/
<zid> It goes from like 25 to 21 from 20 to 80
<mjg_> you might have unknowingly changed your eating habits
<mjg_> for example people have no idea how much they snack on crap
<mjg_> and that's one of the major weight gainers
<zid> cal_in - cal_out, the rest is just strategy for if you have issues with it
<zid> like eating low calorie foods so you can still overeat
<zid> or becoming an olympic swimmer
<mjg_> from what i hear cico is oversimplified
<zid> it's just physics
<mjg_> namely your body's ability to extract calories depends on the particular food
<mjg_> and can be significantly less than the supposed calories indicated on packaging
<zid> that's good not bad :P
<zid> I mean, bad for me, good for you
<mjg_> well it is in your favor so to speak
<mjg_> but you may find yourself in nutrient deficit
<zid> that's nearly impossible
<mjg_> i don't rmember the recommended "safe" deficit
<mjg_> for long term weight loss
<zid> the only thing people are commonly lacking is vit D, and that's because we can't synthesize it or eat it (mostly)
<zid> you have to photosynthesize like a damn plant
<zid> and women can be lacking iron, for obvious reasons
<zid> but nobody's getting scurvy unless they have eating disorders where they can only eat burnt chips or whatever
<mjg_> i'm pretty sure you would get yourself into a solid deficit with sufficeitnly shitty diet, which is "obtainable"
<zid> There was a scottish guy, lost like 100kg just drinking water and eating vitamin pills, they had him on regular blood tests and he was fine
<mjg_> huh?
<bslsk05> ​en.wikipedia.org: Angus Barbieri's fast - Wikipedia
<mjg_> Died 7 September 1990 (aged 50–51)
<mjg_> i'm sure this had nothing to do with it
<zid> he's scottish, that's way above average
<zid> actually, studies show that calories kill
xenos1984 has quit [Read error: Connection reset by peer]
<zid> if you wanna live a long rat life, stop eating
<zid> I assume it's just "more machinery churns, so more machinery wears out" in a very abstract sense
<bslsk05> ​en.wikipedia.org: Calorie restriction - Wikipedia
<klange> mjg_: 35 years after the fasting
<klange> if anything, it was probably complications from his weight _before_ the event that eventually did him in...
<zid> scottish life expectancy is only 61, in 2020
<zid> and that was 80 years ago
<zid> it's the lowest in w. europe
<bslsk05> ​www.badspacecomics.com: The Suit - Bad Space Comics
<mjg_> > In Scotland between 2018-2020: Male healthy life expectancy was 60.9 years. Female healthy life expectancy was 61.8 years.
<mjg_> i thought you were joking
<zid> scotland is a silly place
<mjg_> i have to take back my comment then
<zid> ohh it's THAT one, seed that GeDaMo it's fun
<GeDaMo> :P
<zid> (took me a while to figure out how to make it load)
<mjg_> GeDaMo: ouch
<mjg_> GeDaMo: have you read "i have no mouth and i must scream"?
<mjg_> about 15 minutes read afair and right up your alley i think
<GeDaMo> I assume I have, I know the name but I can't remember exactly what it's about
<mjg_> ellison
<mjg_> people vs a computer
SpikeHeron has quit [Quit: WeeChat 3.5]
<mjg_> an almighty one
<zid> I prefer I have no feet and I must sock
<GeDaMo> Oh yeah, I remember it now
<mjg_> cmon man, how can you forget such a classic
<klange> fuck me that's disturbing
<bslsk05> ​rowrrbazzle.blogspot.com: Perkin Worbeck's Magic Newt: "Answer" by Fredric Brown (1954) (complete short-short story)
<mjg_> klange: *do not* read the story i recommended :-P
<zid> I'm currently working on a basilisk
<mjg_> GeDaMo: have you read "the last question"?
<GeDaMo> Yes
SpikeHeron has joined #osdev
<GeDaMo> My exit message is currently "There is as yet insufficient data for a meaningful answer." :P
<zid> link "they're made of meat" next
<zid> if we're doing "famous short stories nerds know"
<klange> How about a fun one?
<klange> "The Road not Taken"
<GeDaMo> Ah, I like that one
<bslsk05> ​xkcd - Team Chat
<mjg_> you reminded me i need to read the thing from outer space
<mjg_> i never got around to it
<zid> After that, read ascendance of a bookworm
<zid> that's a nice little short story
<bslsk05> ​www.badspacecomics.com: Grounded
<zid> (19 novels so far)
<mjg_> noted
<mjg_> have you read sandkings?
<GeDaMo> "thing from outer space"?
<mjg_> not sci-fi though
<mrvn> mjg_: calories indicated on packaging is like MTBF on harddisks.
<mjg_> GeDaMo: or whatever the title was, the original material for the "the thing" movie
xenos1984 has joined #osdev
<mjg_> GeDaMo: i'm pretty sure it has 'thing' in the title :>
<GeDaMo> Ah, the thing from another world
<mjg_> oh maybe that
<bslsk05> ​en.wikipedia.org: Who Goes There? - Wikipedia
<mjg_> man, 0/2
<mjg_> ooh Its extended novel version, found in an early manuscript titled Frozen Hell, was finally published in 2019.
<mjg_> i did not know that's a thing
<GeDaMo> The first film was called "The Thing from Another World"
<GeDaMo> I also did not know that :|
<mjg_> makes me happy i did not read the original :-P
<GeDaMo> And apparently there's going to be a new film based on the full novel
<mjg_> :O
<mjg_> nice
<bslsk05> ​bloody-disgusting.com: Universal and Blumhouse Developing New Version of 'The Thing' That Will Adapt Long Lost Original Novel! - Bloody Disgusting
<mjg_> although i have to note "the thing" by carpenter does seem like the perfect movie
gog has quit [Ping timeout: 260 seconds]
* mrvn wonders about the uniform rules on ST Strange new Worlds: Gold for command, Blue for science/medical, red for everyone else making red-shirts hard to spot and white for nurse Chapel?
<mjg_> as in i don't know what you can do to improve on it
<GeDaMo> Yeah, same
<mrvn> if I knew how to make it better I would be rich and famous.
<mjg_> mrvn: let me restate, i don't see any weak points
<mrvn> it's a pretty bad commedy :)
<mjg_> well there is one bit, i did not like how the main character (what has the fucker's name?) poured wisky (or whatever) into a computer
<mjg_> the chess game they showed afair did not add up
<GeDaMo> MacReady
<mjg_> as in blatantly different positions between shots
<mrvn> OMG, what has the computer ever done to hime? Or: What a waste of good whiskey?
<mjg_> why not both
<mrvn> mjg_: hehe, continuity errors are fun. They probably had to reshoot a the scene and some intern had to set up the chess board.
gog has joined #osdev
gog has quit [Ping timeout: 272 seconds]
mzxtuelkl has quit [Read error: Connection reset by peer]
mzxtuelkl has joined #osdev
<sbalmos> mrvn: Pike?
<mrvn> sbalmos: yes
<sbalmos> mrvn: I haven't seen the ep for today yet, so I'm guessing that's the whisky reference.
<mrvn> sbalmos: no, that was from The Thing
[itchyjunk] has joined #osdev
<sbalmos> mrvn: ah, crap, sorry then. I lose again.
<mrvn> sbalmos: I'm justr wodnering why everyone is wearing an uniform except nurse Chapel
<sbalmos> mrvn: She's a civilian
<sbalmos> mrvn: See Ep 1. She's a civilian from I believe Carnegie-Mellon on a Starfleet medical trial program or such. I'd have to go look back up the exact wording they used.
<mrvn> that explains it.
<zid> oh, westworld is back
<zid> it lost the plot a bit but it's still pretty
ornxka has joined #osdev
<ornxka> why my memory gets corrupted/??
<zid> you corrupted it.
<ornxka> ah yes, it figures that the source of all of my other problems would be at fault here too
<zid> not a lot anybody can do to help you with "why wrong!?!?//11one"
blockhead has quit []
<ornxka> its one of those "looking for moral support and commiseration rather than concrete suggestions" sort of things
<sbalmos> life sucks, computers are unforgiving. But the good thing is, they do exactly what you tell them to do, and nothing more.
<sbalmos> The bad thing is, they do exactly what you tell them to do, and nothing more.
<zid> computers were a mistake, try becoming a hermit
<sbalmos> zid: you know, that mountain cabin is looking more tempting all the time, and not just because of computers
<ornxka> normally i hate that sentiment because i didnt write 99.99% of the code that runs on my computer and thus they do not do "exactly what i tell them to do" and its more a consensus between me and several thousand other people
<ornxka> but in this instance it is indeed doing exactly what i told it to do
<ornxka> and there is no one to blame but myself..
<ornxka> being a hermit sounds nice, imagine all of the time you would have to spend on hobbyist osdev
<zid> very little, too busy smelting copper
<zid> and mixing clay
<mrvn> ornxka: residual radiation from satelite debries entering the atmosphere
<mrvn> That mountain cabin needs netflix
<ornxka> tired: accidental use after free bug, wired: my gangstalkers are beaming gamma rays at my dev machine and flipping my bits to impede my development progress
lg has quit [Ping timeout: 244 seconds]
<heat> get kasan
<heat> never worry about use after frees again
terminalpusher has joined #osdev
gog has joined #osdev
<zid> gog omg you can't do that wtf
<zid> oh sorry I mistook you for heat
<gog> what
<gog> I'll do it anyway idc
* gog does it
<zid> Okay you'll need my paypal info then
<heat> just do it gog
<heat> like nike
<heat> and shia labeouf
<gog> yes
<zid> I do wonder *why* heat keeps doing it though
<zid> I'd imagine the chafing would be incredible
lg has joined #osdev
ethrl has joined #osdev
hysv has joined #osdev
terminalpusher has quit [Remote host closed the connection]
<clever> mrvn: looking at the opengl specs for glDrawArrays, its almost exactly maps ontop of "Vertex Array Primitives" from v3d, but v3d wants everything in one bit array, while glDrawArrays wants multiple arrays
<clever> so mesa would still have to do a somewhat complex fetch from n arrays, write interleaved to one array
X-Scale` has joined #osdev
<zid> I'm not sure I've actually used glDrawArrays before, thinking about it
<zid> glDrawElements for life I guess
<clever> *reads*
X-Scale has quit [Ping timeout: 276 seconds]
<clever> zid: ahh, that maps closely to the "Indexed Primitive List" that ive already been using
X-Scale` is now known as X-Scale
<zid> elements is just the same thing but with *another* array :p
<zid> I basically never end up with flat geometry unless I'm hand-generating it
<clever> zid: but, is that array pointing into fully assembled vertices, or seperate arrays?
<zid> It's identical, but with an extra array of indicies
<zid> glDrawArraysIndirect :P
<clever> ah, so same problem remains, and that seems even worse for the host cpu
<clever> v3d expects an array of structs, each one describing a single vertex fully
<zid> In practice it might not be too bad? most people tend to just use a single array anyway
<zid> but with glVertexAttribPointer to chop up that single array
<clever> but the way the docs describe it: you can prespecify separate arrays of vertices, normals, and colors and use them to construct a sequence
<clever> it sounds more like structs of arrays
<zid> yea I'm not sure anybody ever really does that
<zid> It's generally just a huge GL_BUFFER full of 18 element verts
<zid> rather than 6 arrays of triples
<clever> ah
<clever> and then how do you tell gl what the layout is within those buffers?
<zid> >but with glVertexAttribPointer to chop up that single array
<zid> entity.c: glVertexAttribPointer(0, 3, GL_FLOAT, GL_FALSE, sizeof(GLfloat[8]), 0);
<zid> entity.c: glVertexAttribPointer(1, 3, GL_FLOAT, GL_FALSE, sizeof(GLfloat[8]), (GLvoid *)(sizeof(GLfloat[3])));
<zid> entity.c: glVertexAttribPointer(2, 2, GL_FLOAT, GL_FALSE, sizeof(GLfloat[8]), (GLvoid *)(sizeof(GLfloat[6])));
<zid> That's an 8 element vert, split into 3/3/2
<zid> (xyz, rgb, uv I think in that case)
lg has quit [Ping timeout: 244 seconds]
<clever> was going to say, that v3d wants its pre-shaded vertex data in a very specific layout
<clever> then i remembered, if you have vertex shaders, that goes out the window
<zid> The shader ends up with vec3, vec3, vec2 `in` data
<clever> and you can lay them out in any order, and then just select the right attributes
<clever> behind the scenes, your 8 element vert, becomes an 8x16 matrix in the VPM, all 8 elements for 16 vertices
<zid> layout(location=0) in vec4 pos; layout(location=1) in vec3 norm; layout(location=2) in vec2 uv;
<clever> and the shader compiler can just map coord.x to the right row in the VPM
<zid> That's what it looks like in glsl ^
<clever> so the ordering within the buffer doesnt matter
<clever> so attributes solve the problem i was expecting
<clever> you just have to adjust the compiled shader, to agree with the glVertexAttribPointer settings
<zid> and then there's "..Instanced" I think which adds another field that's a sequential int and how many times to render the same geom over and over so you can index it into a texture to do different things per copy
<zid> useful for like, minecraft
<clever> ive not seen any signs of v3d supporting instancing
<clever> so the gl drivers would have to duplicate the data for you
<zid> where your data is always going to be a single quad with the same UVs, and all you care about is varying an offset
<zid> single cube*
<clever> yeah, mrvn mentioned using instancing for drawing text lastnight
<zid> It saves pci-e trips to let the card know directly that you're rendering the same geom 32768 times
<zid> rather than 32768 calls
<zid> and having to blow up your memory usage 32768 times
<clever> in the case of v3d, its not over a pci bus, so its just a question of how big the vertex attribute arrays become
<clever> oh wait, i just remembered something
<zid> otherwise you'd have to duplicate all 18 verts instead of adding a single input buffer containing the array you're indexing
<clever> yeah, that actually fits, i forgot about that
<clever> the shader state, wants 8 pointers
<clever> to the start of 8 attribute arrays
<clever> so it could work with the other struct of arrays i saw earlier
<clever> and it wants 8 strides, so you can interleave it however you want
<zid> happy that you're happy, I don't know your thing so I'm just throwing out info about how gl programs "can work" in case anything matches your hw's caps cleanly
<clever> yeah
<clever> the only other problem i can forsee
<clever> is that each shader (coodinate, vertex, and i think fragment), has an 8bit mask
<clever> to select which of the 8 attributes its fetching from
<clever> what happens if you need more then 8?
<zid> than
<zid> 8 is a lot
<mrvn> "We have no lawyers here, that is why this is an utopia."
<zid> even for PBR
<clever> zid: so your 3/3/2 is an extreme edge case?
<zid> xyz, normal uv, texture uv, displacement map uv, pre-baked lighting uv, etc
<zid> that's 3
<zid> attribute 0, attribute 1 and attribute 2 are in use
<clever> oooooo
<zid> I still have 5 free if your limit is 8
<clever> that might be what i was mis-understanding
<clever> and explains why the attributes have a size on them
<clever> so, i can put the entire `vec3 pos` into attribute 0, set the size to 12 (3 floats), address to the first pos in the interleaved array, and stride to the distance between 2 pos's
<clever> that makes far more sense
<zid> 16 is the max on pc class hw apparently
* mrvn just wants flat shading, maybe a texture for advanced graphics
<zid> and the stride value has a max which is no less than 2048
<zid> so try not to have verts bigger than 2kB :P
<clever> so that just leaves 2 mysteries
<clever> > Attribute Array [n] Vertex Shader VPM Offset (from Base Address)
<clever> > Attribute Array [n] Coordinate Shader VPM Offset (from Base Address)
<zid> what's a coordinate shader, what's a VPM
<clever> my guess, is that this is a byte offset, from the starting address, so you can mis-align the attributes
<clever> a coordinate shader is just a vertex shader, with the vary[] part deleted
<clever> its job is to only compute screen xy coords
<clever> the VPM is a chunk of memory that is used to send attributes to shaders, and temporarily store shaded vertex data
<zid> so some internal boofer?
<clever> until the polygon has been fully drawned
<clever> yeah
<clever> my first guess, is that you could use it as a byte offset into the vertex attributes
<clever> so the 8 attribute mask, selects a different 8 attributes
<clever> then you could have an attribute array of 1234 5678 9,10,11,12, and then one shader uses 1-8, while another shader uses 5-12
<clever> so each shader is limited to a max of 8 consecutive attributes, but that is a sliding window over the entire attribute selection?
<zid> sounds weird
<clever> i'm just guessing, i could be wrong
<clever> mesa also hides coordinate shaders from you
<clever> the compiler will just delete all of the vary[] outputs from a vertex shader, and then delete any computation with unused outputs
<clever> and boom, there is your coordinate shader
<clever> that just leaves the extended attribute array
<clever> i think its just defining the stride for fetching extra attributes beyond that
<clever> but its not clear how exactly, i should cross-refernece to mesa
lg has joined #osdev
gxt__ has quit [Remote host closed the connection]
foudfou has quit [Remote host closed the connection]
gildasio has quit [Write error: Connection reset by peer]
foudfou has joined #osdev
gxt__ has joined #osdev
gildasio has joined #osdev
Geertiebear has joined #osdev
dennis95 has quit [Quit: Leaving]
<heat>
gildasio has quit [Quit: WeeChat 3.5]
nyah has joined #osdev
dennis95 has joined #osdev
frkzoid has joined #osdev
<frkzoid> looks like M$ has reached the extinguish phase: https://www.phoronix.com/scan.php?page=news_item&px=Systemd-Creator-Microsoft
<bslsk05> ​www.phoronix.com: Systemd Creator Lands At Microsoft - Phoronix
<GeDaMo> Maybe Windows is going to adopt systemd :|
<vdamewood> GeDaMo; That would be funny.
<GeDaMo> Yes, I'm sure we'd all laugh at that :|
* vdamewood installs systemd on GeDaMo
<mrvn> "I just hope that things will work out and eat a stgeady flow of pizza until they do." I can get behind that philosophy.
<mrvn> GeDaMo: the big question then is: will it get better or worse?
<GeDaMo> systemd or Windows? :P
<vdamewood> Yes!
<gog> hot take: i like systemd
<gog> i find it to be less fragile than various init systems I used over the years
<zid> I've never really used an init system
<zid> beyond /etc/init.d/net
foudfou has quit [Remote host closed the connection]
foudfou has joined #osdev
* vdamewood gives gog a fishyd
<vdamewood> Personally, I like systemd, too.
* gog fishyctl eat
<geist> well, if systemd is fragile enough that losing a key member of their team at this point is fatal, then it's not well run
<gog> is that what happen?
<geist> but probable they'll just keep working on it
<geist> MSFT is a different company nowadays
<zid> maybe they want windows to be bootable into wsl2
<zid> via systemd
<geist> yah totes
[itchyjunk] has quit [Ping timeout: 244 seconds]
<zid> kernel does bringup then runs init which is systemd's init, instead of running explorer or whatever
<gog> maybe it's time to replace systemd with a compatible but less hulking alternative
<zid> (I have no idea what windows' init process is)
<geist> i used to know, and it was a complicated set of this spanws that with the services and whatnot
<gog> like they're doing with pulse audio/pipewire
<geist> but no idea if any of that is the same now
<geist> reminds me. rant: work is forcing me to switch my work computer from cinnamon to GNOME and i hate every part of it
<gog> boooooo
<geist> plain GNOME is such a stupid backstep in functionality in the interest of looking nice
<gog> I've not used new gnome before
<geist> i have to install like 8 extensions to get it kinda halfway back to what i want
<vdamewood> geist: Can you install gnome-shell plugins/extensions Whatever they're called?
<geist> yes. which is what i've been doing. one of them is panel to bar or something, which is the biggie
<geist> problem is the extensions are kinda fragile, seems the more you put in there the more possibility of something colliding with something else, etc
<geist> and things like you have problems with > 4 workspaces because you can't set global keyboard things for 5 and 6, etc
<geist> it's really lame
<vdamewood> I remember RHEL 6(?) included a bunch of extensions by default to make GNOME 3 more like GNOME 2.
<geist> the biggest one: there's no goddamn desktop icons
<gog> can you use plasma
<geist> apparently that's a choice by the designers, they thought it was messy
[itchyjunk] has joined #osdev
<geist> there are a few extensions for desktop icons but they seem to involve essentially running an instance of chromium to drive it
<geist> i'm a huge messy desktop icon user. i keep everything i'm doing right there in little clusters
<gog> yeah the gnome designers have a real obsession with "clean" design but at the expense of configurability
<vdamewood> geist: That... sounds terrible. @ chromium
<geist> even the file menus dont have a shortcut for ~/Desktop because it's not a real folder that gnome cares about
<geist> grar. it's so lame. i used to not mind it but i had forked off their train into MATE and cinnamon years ago because i didn't like where it was going
<vdamewood> Can we smash GNOME and replace it with something better?
<geist> gog: what is plasma in this context?
<gog> kde plasma desktop
<gog> probably not eh
<gog> if they're forcing you to use gnome
<geist> ah no. it's some work thing where they only want us to use gnome
<geist> now that i spent half a day working with gnome i know that i can make it 75% back to what i want, so i'll just stick with cinnamon until they really really force me
<gog> yes
<vdamewood> I can work with GNOME as soon as I get the terminal added to the dock.
<gog> is there any good business reason to force you to use a particular desktop environment? seems like it'd damage productivity
<GeDaMo> Getting the new one to work 75% as well as the old one seems to be the norm now :/
<vdamewood> gog: Makes the systems easier to maintain, if it's a work computer.
<gog> true
<gog> and security review
<vdamewood> gog: Makes it easier to share systems.
<gog> yeh
<geist> yah it's easeir to maintain
<psykose> i remember having to do stuff with some restrictions similar to that and safe to say i have no interest in ever doing it again
<geist> yah thats my mini rant, but so it goes. like all thingsl ike that the change of shortcut keys or ui differences you usually just get to, it's loss of functionality that's a real drag
mzxtuelkl has quit [Quit: Leaving]
<gog> can you get access to classic mode?
<gog> or convince them to install the "desktopfolder" package?
<geist> well i have root access, so i can install pretty much any package i need
<gog> oh ok then
<geist> so it's also possible i can just keep using cinnamon forever, i'm just off the support chain
<gog> yeh
<sbalmos> is the support chain even worthwhile?
<geist> but they were insistent enough that there's actually a login popup that says 'you need to switch to GNOME by date X'
<geist> well, to be honest yes, they're quite good. or at least if the machine is having trouble you go through em
<gog> cause if you dont support me now/you'll never support me again/i can still hear you saying/you broke the support chain
<geist> you know the rules (and so do i)
<geist> a full commitments what i'm thinking of
blockhead has joined #osdev
<gog> wrong song :p
gildasio has joined #osdev
<bslsk05> ​'Life Is a Long Song (2001 Remaster)' by Jethro Tull - Topic (00:03:17)
foudfou has quit [Remote host closed the connection]
foudfou has joined #osdev
<geist> ah jethro tull. one of my favorite vinyls i have is Aqualung
<geist> really a great vinyl in general since it has a fair amount of dynamic range, etc
<geist> a modern mix sounds much more compressed as usual
<geist> huh looks like qemu is getting support for LoongArch
<geist> though last i had looked it wasn't that interesting. some sort of mips like derivative
<GeDaMo> It's MIPS-based
<GeDaMo> I think there are some instructions to assist in emulating x86
<geist> but i guess has divered enough to be its own arch
<GeDaMo> «In August 2021, Linux maintainers complained that submitted LoongArch code is "...a blind copy of the MIPS code...", however "only with a different name".» https://en.wikipedia.org/wiki/Loongson
<bslsk05> ​en.wikipedia.org: Loongson - Wikipedia
vinc has joined #osdev
<GeDaMo> Possibly due to lack of documentation
heat has quit [Ping timeout: 264 seconds]
doug16k has quit [Remote host closed the connection]
CaCode has joined #osdev
vdamewood has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]
mrvn has quit [Ping timeout: 256 seconds]
vinc has quit [Read error: Connection reset by peer]
terminalpusher has joined #osdev
<ddevault> let's say I have a scenario where I have two userspace processes exchanging data by having the kernel copy it between buffers in their respective address spaces
<ddevault> would it be reasonable to statically allocate an extra set of page tables for the kernel to map any physical address into its address space temporarily?
<ddevault> am I going to take a big performance hit from doing that?
X-Scale` has joined #osdev
vdamewood has joined #osdev
X-Scale has quit [Ping timeout: 255 seconds]
<ddevault> hm, I know linux does the reverse, unmapping the kernel while in userspace
X-Scale` is now known as X-Scale
<ddevault> it would be improved if I kept track of the last such mapping and avoiding invalidating the TLB if they're already okay, so if two processes are exchanging a lot of messages it's not thrashing it
<geist> depends. in general fiddling with the kernel address space on an SMP system is expensive since you have a global TLB shootdown to do (on x86 at least)
<geist> but, if you can use a temporary, per-cpu region of the kernel (say every cpu gets a 4MB region, or a 1GB region or whatnot)
<geist> *and* you can ensure that the code that does the map/copy/unmap runs on one cpu yuo can design a per cpu mapping scheme that avoids tlb shootdowns
vdamewood has quit [Ping timeout: 255 seconds]
<geist> but you may be trading some amount of global responsiveness and preemptability, depending on how you pin the thread/task to a single cpu
<ddevault> note that I have a non-preemptable microkernel, though SMP of course is still an issue
dude12312414 has joined #osdev
<geist> if it's non preemptable, how do you handle SMP? do you only allow one cpu in the kernel at a time?
<ddevault> maybe I can have a pool of page tables to work with to avoid modifying them too often and let old ones stick throughout several operations
dude12312414 has quit [Remote host closed the connection]
<ddevault> we don't have SMP yet, but no, it's only (going to be) non-preemptable on a per-CPU basis
<ddevault> there will still be multiple threads in the kernel at a time
<geist> so non preemptiable as in it'll be reentrant but non preemptable
<ddevault> but once a CPU enters the kernel it won't leave until the syscall or interrupt is done
<geist> reentrant in the sense that multiple cpus may be active at the same time
<ddevault> yeah, more or less
<geist> okay, so the problem remains then: how do you fiddle with kernel address space without excessive TLB shootdowns
<geist> other than 'get a better cpu'
<ddevault> hah
<geist> BTW I'm assuming you're on x86, correct?
<ddevault> aye, though we'll have riscv64 soon enough
<ddevault> I think a pool of temporary page tables might be the ticket
<geist> yah and sadly riscv has the same problem
<geist> the alternative is to do a per page lookup and then do memcpy against the raw page
<ddevault> so that you need to have >N processes doing IPC or such at the same time before the pool runs out and it starts having to deal with TLB misses
<geist> which you map into the kernel via a more global mechanism, including just linearly mapping all of memory (if you're on 64bit you can mostly guarantee that)
<ddevault> well, we do identity map 64 GiB
<ddevault> and to be fair I could just say "more than 64G of RAM is a problem for future me"
<geist> so yeah that's another strategy: find the mapping in user space and then copy directly into the physical mapping
<ddevault> that's another thing, we could also just short circuit the page mappings for any physical address which is identity mapped
<geist> if the copy source or destination is always active a time (IPC from or to the thread that's active) then only one end point of the copy has to be against the physmap
<ddevault> then only high memory has to worry about TLB
gog has quit [Quit: byee]
<ddevault> it's a rendezvous model
<ddevault> so one thread is blocked and the other is in a syscall
gog has joined #osdev
<geist> right then it turns into a different problem: instead of temporarily mapping buffers into the kernel, you're temporarily mapping physical pages into the kernel
<geist> the latter is a more generalized problem and cant be nicely solved
<geist> for example if you need page 1 and 27 and 32 you can use large pages to map 0-255 and happen to get it in one shot, etc
<ddevault> I think for now I'll just have the memory enumeration code peace out if the physical address is >64G
<ddevault> with a comment saying // So you want Helios to support more than 64GiB? Great! You can deal with the problems
terminalpusher has quit [Remote host closed the connection]
mrvn has joined #osdev
<ddevault> though, bleh, who says that a system with <64G of RAM won't map it at addresses <64G
terminalpusher has joined #osdev
<geist> yah note that on 64bit machines you have a lot of headroom there. realistically you can chew up a sizable chunk of the kernel to get you a TB or so before it starts getting tight
<geist> depending on how much space you want to reserve for it and what arch you're on
<geist> since 64bit systems usually have something like 47 or 48 bits of kernel address space
<geist> and can usually use 1GB pages to map stuff like that
<ddevault> yeah that's a point
<ddevault> are huge pages treated differently by the TLB?
<geist> i'm generally a fan of the physmap strategy, despite it being somewhat of a security issue in general
<geist> generally they're more efficient yes
<geist> use less TLB entries
<ddevault> nice
terminalpusher has quit [Remote host closed the connection]
<ddevault> how do those actually work, by the way
terminalpusher has joined #osdev
<ddevault> since a virtual address defines a series of indicies into page tables that ultimately siphons out 4K portions of address space
<geist> what part specifically?
<ddevault> so address + 4K is the next entry in the page table
<ddevault> do you have to allocate them sparsely or something if the page size is >4K?
<clever> ddevault: the page tables are a tree
<geist> what are these pronouns referring to precisely?
<ddevault> err, I see it now
<ddevault> the PD has the page size bit, not the PT
<clever> ddevault: so if you go one level up in the tree, each slot refers to a larger chunk of ram
<ddevault> yeah, thanks clever
<geist> right. what clever is saying. it's implied by the depth of the tree you're on
<geist> x86 has kinda silly terminology, i like more of the ARM strategy where they say it's a L0-L3 page table, and a page is a 'terminal page table entry'
<geist> so the level you're at where you hit a terminal entry is how large the page is
<geist> first level? 512GB. next level? 1GB. next level? 2MB. next level? 4K
<geist> which if you do the log 2 math is 12 bits = 4k. +9 = 21 bits = 2MB. +9 = 30 bits = 1GB. +9 = 39 bits = 512GB
<ddevault> the way I like to think about it is just dicing up each series of bits in a virtual address as an index into a page table
<ddevault> thinking of the address space more discretely than continuously
<geist> in the case of a 5th level it'd be 39 + 9 = 48 bits = 256 TB
<geist> yep. precisely. and the shift of 9 bits is because each page table (12 bits long in this case, because 4K) has 8 byte entries, which is 3 bits. so it's 12 - 9
<geist> ie, each section of the split of the address is 9 bits wide
<geist> 512 entries
<geist> s/12 - 9/12-3=9/
<ddevault> it did take me a while to grok page tables, though
<ddevault> for some reason they didn't click
<geist> yah i've seen it a lot. takes a while for a lot of folks to finally have it click
<geist> but usually they have an ah-ha moment
<clever> i opted to go the simple route, a single layer of paging tables
<bslsk05> ​github.com: rpi-open-firmware/mmu.c at master · librerpi/rpi-open-firmware · GitHub
<geist> helps for arches that support it, but lots of them dont support terminal entries at the top level
<clever> from memory, each slot in this layer is 1mb, 4096 slots total for 4gig of virt space
<clever> that way, i dont have to deal with allocating a bunch of tables for the next level
<clever> and the linker can allocate the 1st layer
<geist> i forget if riscv defines that 512GB pages work in SV48
gog has quit [Quit: byee]
foudfou has quit [Quit: Bye]
<clever> i'm cheating by using a 32bit system
<clever> so i dont have to worry about 512gig pages, lol
foudfou has joined #osdev
<clever> and on the subject of what we where discussing yesterday
<geist> note 4MB pages didn't come along for a while in x86
<geist> pentium era, iirc
<bslsk05> ​fuchsia.googlesource.com: zircon/kernel/arch/arm64/include/arch/arm64/mmu.h - fuchsia - Git at Google
<clever> i can see how this decodes as normal memory, uncached in both inner&outer
<clever> but i cant see where it says to be write combined
<geist> oh i dont remember, i think it's implied by it being normal memory vs device memory where it switches to a new model of the nGnRnE stuff
<PapaFrog> Pentium Pro, IIRC.
<geist> B2.7 if you have the latest ARM ARM, talks about "Memory Types and attributes"
<geist> which lays down the ground rules for the fundamental difference between normal memory and device memory
<geist> and then within those classifications what the different sub bits mean
<clever> *looks*
<geist> it says at some point that device-GRE is pretty much the same thing as normal uncached memory, *except* the cpu is not allowed to speculatively fetch it
<clever> The Normal memory type attribute applies to most memory in a system. It indicates that the hardware is permitted
<clever> by the architecture to perform Speculative data read accesses to these locations, regardless of the access permissions
<clever> for these locations.
<clever> for my version of the doc, it starts by stating that normal memory can be prefetched
<geist> right
<clever> and that doesnt seem to care if it can be cached or not
<geist> thats sort of the fundamental difference. the lowest tier of normal memory (uncached) is sort of like the least restricted version of device memory (device-GRE) except the latter cannot be prefetch/speculatively accessed
<geist> so they almost overlap
<clever> yeah
<clever> ive also noticed, MAIR has 4 different aliases
<clever> PRRR, MAIR0+MAIR1 (that one is known), and NMRR
<geist> that's all 32bit nonsense
<geist> never heard of the prrr and nmrr but not surprised
<mrvn> Except now comes ARM and has those contigous pages. E.g. 16k pages with 4k granularity that take up 4 entries in the page table.
<geist> yep
<zid> PRRRRRR is a good register
<mrvn> That is kind of the like the initial idea about having to space out entries.
<mrvn> -the
<geist> it's kinda a freebie, except it adds some amount of software complexity
<geist> so it's sort of an opt in
* clever reads PRRR, Primary Region Remap Register
<geist> it's some arm32 shit
<geist> basically MAIR_EL1 in 64bit mode cleans all that up
<clever> yeah
<geist> arm32 had a good 30 year run to build up some legacy as they added new features and had to cram bits in new registers, etc
<clever> it seems to be tied to whatever TTBCR.EAE is
<geist> and now arm64 has had a 12 year run to start picking up new stuff
<clever> TTBCR, Translation Table Base Control Register
<clever> Extended Address Enable. The meanings of the possible values of this bit are:
<j`ey> geist: soon arm64 will be a teenager
<clever> j`ey: how long until it can drink? lol
<j`ey> 6 years, since its in the UK :P
<sbalmos> j`ey: Does that mean, even in Supervisor Mode, you'll start randomly getting spasms and a new "you can't make me!" bit set?
<j`ey> hah
<sbalmos> or is that where the compiler says "I hate you! You're so stupid!" instead of compiler errors?
<clever> sbalmos: the ultimate "it wont let me"!
<j`ey> gcc and llvm are way past that!
<clever> i see that problem from a lot of noobs, who describe any error as "it wont let me" and dont bother saying what the error is
<sbalmos> clever: like "why's my memory corrupt"?
<clever> sbalmos: no, even dumber, they mkdir /mnt/data/foo, then ask why it wont let them mkdir /data/foo/bar
GeDaMo has quit [Quit: There is as yet insufficient data for a meaningful answer.]
<clever> but they omit enough details, that it takes an hour to realize that
<clever> bbl
<mrvn> clever: how do I create a directory without mkdir?
hysv has quit [Remote host closed the connection]
heat has joined #osdev
<clever> mrvn: i can see how you might do it with a text editor, gcc, the syscall function, and the right numbers, lol
<PapaFrog> Solution.. mkdir -p
<clever> run the mkdir syscall, without ever typing mkdir!
<PapaFrog> Maybe add sudo?
<heat> sup doofuses
<mrvn> clever: you missed the point. That's what noobs always ask.
<heat> that's an easy question
<heat> mknod
<heat> NEXT
<\Test_User> hexeditor on /dev/sda
<mrvn> How do I do X without that thing that was specifically made to do X because nothing else would do it?
<heat> well, they don't know if there's another way
<heat> the question is valid
<clever> \Test_User: better umount the disk first! lol
<\Test_User> my answer should work for basically any question about how to do x without x
<\Test_User> lol yeah
<\Test_User> "basically" meaning it doesn't do hexeditor without a hexeditor :P
<mrvn> heat: then they explain why they ask: I know 'mkdir' was specifically designed to create directories in the best way possible. But isn't there something better out there?
<heat> and there is
<heat> mkdirat
<heat> this is why these questions aren't quite stupid
<mrvn> for the purpose of this example that is the same
<heat> you need to give me a concrete example
<heat> also, not everyone knows quite as much as you
gildasio has quit [Remote host closed the connection]
foudfou has quit [Remote host closed the connection]
foudfou has joined #osdev
gildasio has joined #osdev
<mrvn> heat: I will let you know next time I see it on stackoverflow
<bslsk05> ​stackoverflow.com: c++ - unordered_multimap element output order is weird - Stack Overflow
<mrvn> Obviously it's called unordered because it has the items in-order of input.
<heat> ok that's easy
<heat> they thought it's called unordered because the map doesn't sort itself
<heat> this is not a stupid question, just a lack of knowledge
<mrvn> except the experimentally prooved it's not in input order
foudfou has quit [Remote host closed the connection]
foudfou has joined #osdev
dennis95 has quit [Quit: Leaving]
gog has joined #osdev
ethrl has quit [Quit: WeeChat 3.4.1]
<kingoffrance> ive seen "set" used as implying order and uniqueness, but that might be a math thing
<kingoffrance> *unique elements
<gog> yes
<mrvn> since when are math sets ordered?
<mrvn> "What do you want your password to be?" "How about me-ma's birthday? Janury 2nd 34." "So just 1234?"
* kingoffrance thats even better holds up road sign "good luck" arrows pointing in all directions
<kingoffrance> theres some movie a guy on a ship is directing the plane and waving it all over...cue plane crash lol
terminalpusher has quit [Remote host closed the connection]
<bslsk05> ​'Roger Roger - Airplane! (8/10) Movie CLIP (1980) HD' by Movieclips (00:01:37)
<geist> heh what a silly movie
<geist> though i still think Top Secret! is their best
<PapaFrog> I picked the wrong day to quit sniffing glue.
<sortie> surely airplane! is best
<PapaFrog> Yes it is, and don't call me Shirley!
<sortie> :D
<mrvn> PapaFrog: shirley you are joking.
<mats1> yes daddy
<gorgonical> Tantalizingly close to a working printk
<gorgonical> I should probably change these printks to early printk tho because cpuid isn't available yet and it's a nullptr deref
<gorgonical> Seems like spinlocks are wrong atm lol
<gorgonical> Locking to obtain console access and never passing the acquire
<mrvn> Your console shouldn't have any methods. The lock guard object you get from acquire should have methods so you can't call them without lock held at all.
<heat> this seems like linux
<gorgonical> It is stolen from linux, yes
<gorgonical> It isn't console.lock or anything though
nyah has quit [Quit: leaving]
zaquest has quit [Remote host closed the connection]
gildasio has quit [Quit: WeeChat 3.5]
zaquest has joined #osdev
<mrvn> Why isn't that a thing in the STL. Like `std::mutexed<T>` with `acquire()` that returns a proxy objkect holding the mutex and letting you cann `o->method(bla);`
<mrvn> s/cann/call//
<heat> cuz the stl is crap
<heat> there it is
<heat> I said it
<heat> why do STL types take deleters and allocates as template arguments instead of an argument you pass to the constructor
<heat> does your unique_ptr have a different deleter? not the same type, sorry!
<heat> have you guys compiled a modern linux kernel on 400MHz?
<heat> it's a great experience