michaelni changed the topic of #ffmpeg-devel to: Welcome to the FFmpeg development channel | Questions about using FFmpeg or developing with libav* libs should be asked in #ffmpeg | This channel is publicly logged | FFmpeg 7.1.1 has been released! | Please read ffmpeg.org/developer.html#Code-of-conduct
derpydoo has quit [Quit: derpydoo]
Mirarora has joined #ffmpeg-devel
<haasn>
ramiro: I just realized there is some overlap between OP_READ { u8, packed } and OP_READ { u32} + OP_UNPACK { 8, 8, 8, 8 }
<haasn>
but maybe it makes sense to keep packed reads around as a distinct op type if it makes it easier to compile them to a strided load
<haasn>
the real reason to have them is because rgb24 can't possibly be implemented as a packed read + unpack operation
<haasn>
without imposing additional restriction on e.g. the block sizes
<haasn>
and in general I want to strongly avoid pixel-mixing inside the pipeline
<haasn>
it would also introduce hte concept of a variable block size, which is a hedache I'd rather avoid for now
<BtbN>
The RGBX formats are pretty common for that reason
minimal has quit [Quit: Leaving]
<ramiro>
haasn: OP_READ { u8, packed } is one instruction on neon, it's better to keep it
<haasn>
I mean you can compile the sequence OP_READ + OP_UNPACK to the same instruction
<haasn>
it's just more trouble
<haasn>
compile() can consume multiple ops
<ramiro>
haasn: btw I just got matrix3+off3 working with asmjit. I haven't done dithering or clamping yet, so there's a tiny bit of loss in quality. but it's already 2x faster. that's pretty impressive...
<ramiro>
I still want to write the loop code (horizontal and vertical) in asmjit, so it's just one call and it doesn't have to keep looping in C. this way I can factor out some constants, which should make it even faster.
<haasn>
loop code?
<ramiro>
I'll leave dithering for tomorrow. I'm too tired right now.
<haasn>
you mean the run_op_func()?
<haasn>
I was thinking of ways we can reduce the per-dispatch overhead more; I did try moving the image pointers to registers so the read ops don't have to indirect load them after the calling code increments them from the previous iteration
<ramiro>
haasn: isn't it called run_op_pass? but yes, that one.
<haasn>
oh yeah
<haasn>
but I'm not sure if it did anything except make the asm code easier to write
<haasn>
the idea I just got was to have the kernel entrypoint be in charge of setting up the context
<haasn>
so that the calling code doesn't have to do anything except increment a register and call it
<haasn>
but I'm not sure that will accomplish anything
<haasn>
ramiro: you do technically know the image dimensions at compile time
<haasn>
so you have that advantage
<haasn>
could unroll the whole loop in theory
<ramiro>
haasn: unrolling might be a bit too much :)
<haasn>
well, for one line :)
<haasn>
oh, one advantage of having the first op do setup: we can skip incrementing pointers for unused planes :)
thilo has quit [Ping timeout: 248 seconds]
thilo has joined #ffmpeg-devel
Anthony_ZO has joined #ffmpeg-devel
Anthony_ZO has quit [Ping timeout: 252 seconds]
abdu67 has joined #ffmpeg-devel
abdu16 has joined #ffmpeg-devel
abdu67 has quit [Ping timeout: 240 seconds]
<haasn>
ramiro: I checked, on my laptop, pure memcpy does not appear to be faster than the x86 SIMD
<haasn>
for plane->plane copies
<haasn>
actually, for some reason, straight memcpy is _slower_
<haasn>
don't ask me why nor how
<haasn>
oh, I can explain why
<haasn>
because our code loads the source plane once and writes it multiple times
<haasn>
whereas memcpy has to load the source plane again on every copy :)
<haasn>
(this is for gray -> gbrp)
<haasn>
for yuv444p -> yuva444p, our code is ~3% slower than memcpy+memset
<haasn>
probably not worth maintaining a dedicated memcpy backend
<haasn>
then again, it is already written..
<haasn>
Of course, the real goal there would be to promote the memcpy to a refcopy
<haasn>
I think we can extend the API of SwsGraph slightly to make that a possibility
<haasn>
Just requires adding an AVBufferRef to the planes
cone-421 has joined #ffmpeg-devel
<cone-421>
ffmpeg Andreas Rheinhardt master:c94143350f49: avutil/libm: Only include intfloat.h when needed
<cone-421>
ffmpeg Andreas Rheinhardt master:9f0970ee35a5: tests/checkasm/videodsp: Don't use declare_func_emms
<Lynne>
wow, sws still didn't rely on refcounting
aaabbb has quit [Changing host]
aaabbb has joined #ffmpeg-devel
jamrial has quit []
pross has quit [Ping timeout: 252 seconds]
ahmedhamed has joined #ffmpeg-devel
derpydoo has joined #ffmpeg-devel
abdu16 has quit [Ping timeout: 240 seconds]
psykose has quit [Ping timeout: 244 seconds]
psykose has joined #ffmpeg-devel
System_Error has quit [Remote host closed the connection]
System_Error has joined #ffmpeg-devel
System_Error has quit [Remote host closed the connection]
System_Error has joined #ffmpeg-devel
System_Error has quit [Remote host closed the connection]
System_Error has joined #ffmpeg-devel
System_Error has quit [Remote host closed the connection]
System_Error has joined #ffmpeg-devel
cone-421 has quit [Quit: transmission timeout]
ahmedhamed has quit [Quit: Connection closed for inactivity]
IndecisiveTurtle has quit [Ping timeout: 265 seconds]
Anthony_ZO has joined #ffmpeg-devel
derpydoo has quit [Ping timeout: 252 seconds]
j45_ has joined #ffmpeg-devel
j45 has quit [Ping timeout: 260 seconds]
j45_ is now known as j45
j45 has joined #ffmpeg-devel
j45 has quit [Changing host]
abdu16 has joined #ffmpeg-devel
mkver has quit [Ping timeout: 252 seconds]
mkver has joined #ffmpeg-devel
Mirarora has quit [Quit: Mirarora encountered a fatal error and needs to close]
Mirarora has joined #ffmpeg-devel
minimal has joined #ffmpeg-devel
lemourin has joined #ffmpeg-devel
lemourin is now known as Guest2349
Guest2349 has quit [Killed (lead.libera.chat (Nickname regained by services))]
IndecisiveTurtle has joined #ffmpeg-devel
Anthony_ZO has quit [Remote host closed the connection]
<fflogger>
[newticket] catap: Ticket #11525 ([undetermined] 60fps stream from Elgato Facecam Pro is frozen) created https://trac.ffmpeg.org/ticket/11525
twelve has joined #ffmpeg-devel
IndecisiveTurtle has quit [Quit: IndecisiveTurtle]
twelve has quit [Remote host closed the connection]
<welder>
How can I switch to digest mode instead of receiving every single email? I can't fund such option in the web UI. I did some back and forth with the mailbot, sent two lines "set authenticate <my password>" and "set digest mime" but I did not get any reply and I'm still receiving every email
<welder>
I dont wish to completely unsubscribe, though
abdu57 has joined #ffmpeg-devel
<haasn>
ramiro: btw, I read that modern CPUs do register renaming for xmm/ymm regs; seems like ivy lake+ and zen2+
<haasn>
so a SWS_OP_SWIZZLE that just moves registers around should be almost free
Thulinma has joined #ffmpeg-devel
Thul has quit [Read error: Connection reset by peer]
rvalue- has joined #ffmpeg-devel
rvalue has quit [Ping timeout: 252 seconds]
rvalue- is now known as rvalue
<haasn>
for x86 SIMD, what's the best strategy for loading stride-3 data (e.g. rgb24) into separate vector registers?
<haasn>
a bunch of pshufb?
<Lynne>
rrrr, gggg, bbbb, in separate regs?
<nevcairiel>
24bit data just sucks, so shuffling is likely your only option
<Lynne>
nah, you'd be wasting a ton of potential if you took the easy way
<Lynne>
you have to load 3 regs worth of data (doesn't matter if they're 128bit or 256bit), then use a bunch of unpacks and strategic pshufw, vpermilps and blends to get the data in the right order
<Lynne>
you'll have a lot more ideal tools with avx512 but that's still a meme thanks to intel
<Lynne>
pshufb is expensive as it requires wasting a register and a load, whilst vpermil, blends and pshufws take an immediate
iive has joined #ffmpeg-devel
<Lynne>
you start off with: r1g1b1r2 g2b2r3g3 b3r4g4b4 r5g5b5r6 g6b6r7g7 b7r8g8b8 r9g9b9r10 g10b10r11g11
<Lynne>
hmm, pshufb would be the best first step
<Lynne>
since the pattern for r1r2r3... would mirror in the upper half of the reg to yield you g6g7g8...
<Lynne>
then you'd do a bunch of punpck to split up the words, and finally you'd split up the rrrrrrr/gggggg regs in half via vperm2i128
<Lynne>
since the very last component is cut off across different regs, you'd need to blend in the last value
<Lynne>
there's a crazy instruction that might be perfect for this which lets you shift one half of a register and glue it onto another
twelve has joined #ffmpeg-devel
<haasn>
I'll see what I can come up with
<haasn>
for splitting ya8 into yyyy aaaa I imagine the best way is to use a 16-bit AND to turn yayayaya into y0y0y0y0 then use vpackuswb to pack that back to yyyyyy ?
<haasn>
and a right shift by 8 to turn yayayaya into a0a0a0a0
<haasn>
hmm, we should really have the ability to fuse an u8 load and a conversion from u8 to u16 then into a single op that just skips the last step
twelve has quit [Remote host closed the connection]
Marth64 has quit [Ping timeout: 260 seconds]
<Lynne>
why would you and away during that step?
Marth64 has joined #ffmpeg-devel
<Lynne>
just do a byte-level unpack, then you can do a 64-bit level unpack or pshufd to smush two different regs together
<Lynne>
with a final vperm2i128 to split up regs
<Lynne>
pshufw out, r1, r2, q0123 is extremely powerful, since it'll take q01 from r1, and q23 from r2
<haasn>
Lynne: I don't follow - how do unpacking instructions help here? aren't they accomplishing the exact opposite?
Guest11 has joined #ffmpeg-devel
Guest11 has quit [Client Quit]
Guest94 has joined #ffmpeg-devel
<Lynne>
load 2 regs of yayayaya....
<Lynne>
so you have y1a1y2... and the second one has y16a16y17a17...
<Lynne>
pack them together and you have y1y16y2y17...
<haasn>
which instruction is doing that? punpcklbw would do y1y16a1a16...
<haasn>
or are you suggesting to do a word-level unpack after that, followed by a dword-level unpack?
<haasn>
(etc)
<Lynne>
yup, that gets you out of byte-level hell and lets you easily manipulate with word-level immediates shuffles like pshufw+more unpacks
<haasn>
ah
<haasn>
not sure that will be faster than packuswb+and+srl
Guest94 has quit [Quit: Client closed]
<Lynne>
you use more registers with more parallel operations with this strategy
Guest65 has joined #ffmpeg-devel
Guest65 has quit [Client Quit]
<Gramner>
for splitting packed rgb24 into separate planes: pshufb + punpck(l|h)dq. process 12 input bytes per lane to avoid having to deal with the uneven number of elements at the end
<haasn>
oh, packuswb produces its results in two different lanes for some reason
<haasn>
x86 is weird
<Gramner>
most operations operate on lanes for performance reasons
<Gramner>
cross-lane shuffles has like triple the latency of in-lane shuffles
<haasn>
Gramner: I am operating on a minimum of 48 bytes which is a clean multiple of both 4 and 3, so there is no need to worry about leftover bytes
<Gramner>
I mean if you're trying to keep every single byte of input from each 16-byte lane it will result in some really awkward shuffling to get everything together. just do some extra memory loads so that you only need to care about 12 of them
abdu69 has joined #ffmpeg-devel
abdu57 has quit [Ping timeout: 240 seconds]
tufei has quit [Remote host closed the connection]
Mirarora has quit [Quit: Mirarora encountered a fatal error and needs to close]
Mirarora has joined #ffmpeg-devel
<fflogger>
[editedticket] jb_alvarado: Ticket #11469 ([ffmpeg] ffmpeg_demux: readrate plays "catch up" if output is blocked, then later resumed) updated https://trac.ffmpeg.org/ticket/11469#comment:13
<ramiro>
haasn: that's precisely why I decided to start with neon for jit :P
<haasn>
understandable
<haasn>
I would love to do rvv instead
abdu69 has quit [Quit: Client closed]
abdu69 has joined #ffmpeg-devel
jamrial has joined #ffmpeg-devel
<ramiro>
haasn: sorry if this is a dumb question, but is it normal that the dithering matrix only adds positive numbers?
<ramiro>
( and in which formats is size_log2 == 0 so that it adds 0.5? )
twelve has joined #ffmpeg-devel
<haasn>
ramiro: because SWS_OP_CONVERT truncates
<haasn>
(int) (x + 0.5) is rounding
twelve has quit [Remote host closed the connection]
<haasn>
in the current logic we use rounding (so size=0) for 16-bit output
<haasn>
and dithering for lower-than-16 bit output
<haasn>
but I was thinking we should relax this to do hard truncation unless SWS_ACCURATE_RND is specified
<haasn>
above certain bit depth
<haasn>
that would allow us to do e.g. downconversion from yuv444p16 to yuv444p12 with just an rshift
<haasn>
or upconversion from rgb24 to gbrp12le by using an expand+shift
<haasn>
and we can probably set a lower cut-off to start using a real dither matrix, maybe 14 bit?
<haasn>
incidentally, one major advantage of the new approach is that we can consolidate and centralize these decisions now
<haasn>
because the logic is entirely self-contained in fmt_dither() in formats.c
<haasn>
and it will therefore be consistent for all formats :)
<haasn>
ramiro: not sure if you saw but I recently pushed swscale5 with a lot more internal ABI changes
<haasn>
for the ops
<haasn>
one notable change is that I decided reads/writes should be planar by default; this mainly changes the case for elems = 1 which is now considered a planar read
<haasn>
(and packed reads are explicitly tagged)
<haasn>
I also think I want to make a memdup of the SwsOpsList before calling into compile() so that compile() can make arbitrary mutations to the op list, not just consuming ops
<haasn>
e.g. also relaxing their types or partially implementing them
<haasn>
I also sort of want to find a way for compile() to return multiple functions at the same time
<haasn>
or maybe make the SwsOpChain into a proper type and let compile() assemble the entire op chain in one call
<haasn>
that will make checkasm easier to implement
<haasn>
I think my next step for now should be to write a simple checkasm for the ops backends so that I can test them without losing my sanity
<ramiro>
haasn: preserving sanity is the reason why my productivity has fallen well below 50% in the past few years... totally worth it.
<ramiro>
haasn: "consolidate and centralize these decisions now" <= definitely a win... current libswscale has evolved with multiple diverging paths per specialized converter and very little effort into consodilating them.