#ffmpeg-devel on 2025-03-22 — irc logs at libera.irclog.whitequark.org

2025-03-03 01:04 michaelni changed the topic of #ffmpeg-devel to: Welcome to the FFmpeg development channel | Questions about using FFmpeg or developing with libav* libs should be asked in #ffmpeg | This channel is publicly logged | FFmpeg 7.1.1 has been released! | Please read ffmpeg.org/developer.html#Code-of-conduct

00:35 derpydoo has quit [Quit: derpydoo]

00:42 Mirarora has joined #ffmpeg-devel

00:55 <haasn> ramiro: I just realized there is some overlap between OP_READ { u8, packed } and OP_READ { u32} + OP_UNPACK { 8, 8, 8, 8 }

00:55 <haasn> but maybe it makes sense to keep packed reads around as a distinct op type if it makes it easier to compile them to a strided load

00:55 <haasn> the real reason to have them is because rgb24 can't possibly be implemented as a packed read + unpack operation

00:56 <haasn> without imposing additional restriction on e.g. the block sizes

00:56 <haasn> and in general I want to strongly avoid pixel-mixing inside the pipeline

00:56 <haasn> it would also introduce hte concept of a variable block size, which is a hedache I'd rather avoid for now

00:57 <BtbN> The RGBX formats are pretty common for that reason

01:01 minimal has quit [Quit: Leaving]

01:07 <ramiro> haasn: OP_READ { u8, packed } is one instruction on neon, it's better to keep it

01:08 <haasn> I mean you can compile the sequence OP_READ + OP_UNPACK to the same instruction

01:08 <haasn> it's just more trouble

01:08 <haasn> compile() can consume multiple ops

01:09 <ramiro> haasn: btw I just got matrix3+off3 working with asmjit. I haven't done dithering or clamping yet, so there's a tiny bit of loss in quality. but it's already 2x faster. that's pretty impressive...

01:10 <ramiro> I still want to write the loop code (horizontal and vertical) in asmjit, so it's just one call and it doesn't have to keep looping in C. this way I can factor out some constants, which should make it even faster.

01:11 <haasn> loop code?

01:11 <ramiro> I'll leave dithering for tomorrow. I'm too tired right now.

01:11 <haasn> you mean the run_op_func()?

01:13 <haasn> I was thinking of ways we can reduce the per-dispatch overhead more; I did try moving the image pointers to registers so the read ops don't have to indirect load them after the calling code increments them from the previous iteration

01:13 <ramiro> haasn: isn't it called run_op_pass? but yes, that one.

01:13 <haasn> oh yeah

01:13 <haasn> but I'm not sure if it did anything except make the asm code easier to write

01:14 <haasn> the idea I just got was to have the kernel entrypoint be in charge of setting up the context

01:14 <haasn> so that the calling code doesn't have to do anything except increment a register and call it

01:15 <haasn> but I'm not sure that will accomplish anything

01:15 <haasn> ramiro: you do technically know the image dimensions at compile time

01:15 <haasn> so you have that advantage

01:15 <haasn> could unroll the whole loop in theory

01:19 <ramiro> haasn: unrolling might be a bit too much :)

01:19 <haasn> well, for one line :)

01:39 <haasn> oh, one advantage of having the first op do setup: we can skip incrementing pointers for unused planes :)

01:46 thilo has quit [Ping timeout: 248 seconds]

01:48 thilo has joined #ffmpeg-devel

01:54 Anthony_ZO has joined #ffmpeg-devel

01:59 Anthony_ZO has quit [Ping timeout: 252 seconds]

02:03 abdu67 has joined #ffmpeg-devel

02:20 abdu16 has joined #ffmpeg-devel

02:24 abdu67 has quit [Ping timeout: 240 seconds]

02:28 <haasn> ramiro: I checked, on my laptop, pure memcpy does not appear to be faster than the x86 SIMD

02:28 <haasn> for plane->plane copies

02:30 <haasn> actually, for some reason, straight memcpy is _slower_

02:30 <haasn> don't ask me why nor how

02:30 <haasn> oh, I can explain why

02:31 <haasn> because our code loads the source plane once and writes it multiple times

02:31 <haasn> whereas memcpy has to load the source plane again on every copy :)

02:31 <haasn> (this is for gray -> gbrp)

02:32 <haasn> for yuv444p -> yuva444p, our code is ~3% slower than memcpy+memset

02:32 <haasn> probably not worth maintaining a dedicated memcpy backend

02:32 <haasn> then again, it is already written..

02:34 <haasn> Of course, the real goal there would be to promote the memcpy to a refcopy

02:35 <haasn> I think we can extend the API of SwsGraph slightly to make that a possibility

02:35 <haasn> Just requires adding an AVBufferRef to the planes

02:46 cone-421 has joined #ffmpeg-devel

02:46 <cone-421> ffmpeg Andreas Rheinhardt master:c94143350f49: avutil/libm: Only include intfloat.h when needed

02:46 <cone-421> ffmpeg Andreas Rheinhardt master:9f0970ee35a5: tests/checkasm/videodsp: Don't use declare_func_emms

02:51 <Lynne> wow, sws still didn't rely on refcounting

02:52 aaabbb has quit [Changing host]

02:52 aaabbb has joined #ffmpeg-devel

02:59 jamrial has quit []

03:16 pross has quit [Ping timeout: 252 seconds]

03:24 ahmedhamed has joined #ffmpeg-devel

03:42 derpydoo has joined #ffmpeg-devel

03:51 abdu16 has quit [Ping timeout: 240 seconds]

04:23 psykose has quit [Ping timeout: 244 seconds]

04:24 psykose has joined #ffmpeg-devel

04:35 System_Error has quit [Remote host closed the connection]

04:42 System_Error has joined #ffmpeg-devel

05:00 System_Error has quit [Remote host closed the connection]

05:05 System_Error has joined #ffmpeg-devel

05:26 System_Error has quit [Remote host closed the connection]

05:33 System_Error has joined #ffmpeg-devel

05:36 System_Error has quit [Remote host closed the connection]

05:42 System_Error has joined #ffmpeg-devel

05:46 cone-421 has quit [Quit: transmission timeout]

07:44 ahmedhamed has quit [Quit: Connection closed for inactivity]

08:18 IndecisiveTurtle has quit [Ping timeout: 265 seconds]

09:11 Anthony_ZO has joined #ffmpeg-devel

10:08 derpydoo has quit [Ping timeout: 252 seconds]

10:11 j45_ has joined #ffmpeg-devel

10:11 j45 has quit [Ping timeout: 260 seconds]

10:11 j45_ is now known as j45

10:11 j45 has joined #ffmpeg-devel

10:11 j45 has quit [Changing host]

10:20 abdu16 has joined #ffmpeg-devel

10:56 mkver has quit [Ping timeout: 252 seconds]

10:57 mkver has joined #ffmpeg-devel

11:32 Mirarora has quit [Quit: Mirarora encountered a fatal error and needs to close]

11:39 Mirarora has joined #ffmpeg-devel

11:45 minimal has joined #ffmpeg-devel

11:45 lemourin has joined #ffmpeg-devel

11:45 lemourin is now known as Guest2349

11:45 Guest2349 has quit [Killed (lead.libera.chat (Nickname regained by services))]

11:58 IndecisiveTurtle has joined #ffmpeg-devel

12:31 Anthony_ZO has quit [Remote host closed the connection]

12:43 <fflogger> [newticket] catap: Ticket #11525 ([undetermined] 60fps stream from Elgato Facecam Pro is frozen) created https://trac.ffmpeg.org/ticket/11525

12:58 twelve has joined #ffmpeg-devel

13:26 IndecisiveTurtle has quit [Quit: IndecisiveTurtle]

13:34 twelve has quit [Remote host closed the connection]

13:45 <fflogger> [editedticket] ffmpegScale: Ticket #11500 ([undetermined] AV1 frame extraction (screenshot) fails) updated https://trac.ffmpeg.org/ticket/11500#comment:2

14:05 <BBB> making x86 asm 64bit only seems entirely reasonable to me

14:14 rvalue has quit [Read error: Connection reset by peer]

14:14 <courmisch> BBB: aren't you afraid that an angry mob of the one OS/2 user?

14:14 <courmisch> +of

14:15 rvalue has joined #ffmpeg-devel

14:16 <BBB> it's a fair point, but I'll take the risk

14:16 <BBB> statistically, chances are that one user does not live close to me

14:16 <BBB> so then it's ok, maybe

14:17 <courmisch> we know who is, so a meeting can be arranged

14:18 <courmisch> as for me, I'll start to test RV32 support when Debian provides a riscv32-linux-gnu cross-compiler

14:21 <courmisch> maybe in a parallel universe, somebody sells a RV32 CPU with vector support

16:02 elvis_a_presley has quit [Quit: smoke-bomb ; grapple-hook]

16:02 elvis_a_presley has joined #ffmpeg-devel

16:34 abdu45 has joined #ffmpeg-devel

16:36 abdu16 has quit [Ping timeout: 240 seconds]

16:41 abdu57 has joined #ffmpeg-devel

16:44 abdu45 has quit [Ping timeout: 240 seconds]

17:10 BBB has quit [Ping timeout: 260 seconds]

17:16 lemourin has quit [Quit: The Lounge - https://thelounge.chat]

17:16 lemourin has joined #ffmpeg-devel

17:26 abdu57 has quit [Ping timeout: 240 seconds]

17:32 Marth64 has joined #ffmpeg-devel

17:50 BBB has joined #ffmpeg-devel

18:00 <welder> How can I switch to digest mode instead of receiving every single email? I can't fund such option in the web UI. I did some back and forth with the mailbot, sent two lines "set authenticate <my password>" and "set digest mime" but I did not get any reply and I'm still receiving every email

18:01 <welder> I dont wish to completely unsubscribe, though

18:12 abdu57 has joined #ffmpeg-devel

18:40 <haasn> ramiro: btw, I read that modern CPUs do register renaming for xmm/ymm regs; seems like ivy lake+ and zen2+

18:41 <haasn> so a SWS_OP_SWIZZLE that just moves registers around should be almost free

18:43 Thulinma has joined #ffmpeg-devel

18:43 Thul has quit [Read error: Connection reset by peer]

18:47 rvalue- has joined #ffmpeg-devel

18:48 rvalue has quit [Ping timeout: 252 seconds]

18:54 rvalue- is now known as rvalue

19:09 <haasn> for x86 SIMD, what's the best strategy for loading stride-3 data (e.g. rgb24) into separate vector registers?

19:09 <haasn> a bunch of pshufb?

19:17 <Lynne> rrrr, gggg, bbbb, in separate regs?

19:18 <nevcairiel> 24bit data just sucks, so shuffling is likely your only option

19:19 <Lynne> nah, you'd be wasting a ton of potential if you took the easy way

19:20 <Lynne> you have to load 3 regs worth of data (doesn't matter if they're 128bit or 256bit), then use a bunch of unpacks and strategic pshufw, vpermilps and blends to get the data in the right order

19:21 <Lynne> you'll have a lot more ideal tools with avx512 but that's still a meme thanks to intel

19:22 <Lynne> pshufb is expensive as it requires wasting a register and a load, whilst vpermil, blends and pshufws take an immediate

19:25 iive has joined #ffmpeg-devel

19:28 <Lynne> you start off with: r1g1b1r2 g2b2r3g3 b3r4g4b4 r5g5b5r6 g6b6r7g7 b7r8g8b8 r9g9b9r10 g10b10r11g11

19:29 <Lynne> hmm, pshufb would be the best first step

19:30 <Lynne> since the pattern for r1r2r3... would mirror in the upper half of the reg to yield you g6g7g8...

19:31 <Lynne> then you'd do a bunch of punpck to split up the words, and finally you'd split up the rrrrrrr/gggggg regs in half via vperm2i128

19:33 <Lynne> since the very last component is cut off across different regs, you'd need to blend in the last value

19:34 <Lynne> there's a crazy instruction that might be perfect for this which lets you shift one half of a register and glue it onto another

19:35 twelve has joined #ffmpeg-devel

19:40 <haasn> I'll see what I can come up with

19:41 <haasn> for splitting ya8 into yyyy aaaa I imagine the best way is to use a 16-bit AND to turn yayayaya into y0y0y0y0 then use vpackuswb to pack that back to yyyyyy ?

19:41 <haasn> and a right shift by 8 to turn yayayaya into a0a0a0a0

19:43 <haasn> hmm, we should really have the ability to fuse an u8 load and a conversion from u8 to u16 then into a single op that just skips the last step

19:46 twelve has quit [Remote host closed the connection]

19:51 Marth64 has quit [Ping timeout: 260 seconds]

19:54 <Lynne> why would you and away during that step?

19:55 Marth64 has joined #ffmpeg-devel

19:55 <Lynne> just do a byte-level unpack, then you can do a 64-bit level unpack or pshufd to smush two different regs together

19:55 <Lynne> with a final vperm2i128 to split up regs

19:56 <Lynne> pshufw out, r1, r2, q0123 is extremely powerful, since it'll take q01 from r1, and q23 from r2

20:14 <haasn> Lynne: I don't follow - how do unpacking instructions help here? aren't they accomplishing the exact opposite?

20:16 Guest11 has joined #ffmpeg-devel

20:18 Guest11 has quit [Client Quit]

20:22 Guest94 has joined #ffmpeg-devel

20:22 <Lynne> load 2 regs of yayayaya....

20:23 <Lynne> so you have y1a1y2... and the second one has y16a16y17a17...

20:23 <Lynne> pack them together and you have y1y16y2y17...

20:25 <haasn> which instruction is doing that? punpcklbw would do y1y16a1a16...

20:26 <haasn> or are you suggesting to do a word-level unpack after that, followed by a dword-level unpack?

20:26 <haasn> (etc)

20:27 <Lynne> yup, that gets you out of byte-level hell and lets you easily manipulate with word-level immediates shuffles like pshufw+more unpacks

20:28 <haasn> ah

20:28 <haasn> not sure that will be faster than packuswb+and+srl

20:28 Guest94 has quit [Quit: Client closed]

20:30 <Lynne> you use more registers with more parallel operations with this strategy

20:30 Guest65 has joined #ffmpeg-devel

20:31 Guest65 has quit [Client Quit]

20:44 <Gramner> for splitting packed rgb24 into separate planes: pshufb + punpck(l|h)dq. process 12 input bytes per lane to avoid having to deal with the uneven number of elements at the end

20:54 <haasn> oh, packuswb produces its results in two different lanes for some reason

20:54 <haasn> x86 is weird

20:55 <Gramner> most operations operate on lanes for performance reasons

20:55 <Gramner> cross-lane shuffles has like triple the latency of in-lane shuffles

20:56 <haasn> Gramner: I am operating on a minimum of 48 bytes which is a clean multiple of both 4 and 3, so there is no need to worry about leftover bytes

20:58 <Gramner> I mean if you're trying to keep every single byte of input from each 16-byte lane it will result in some really awkward shuffling to get everything together. just do some extra memory loads so that you only need to care about 12 of them

20:59 abdu69 has joined #ffmpeg-devel

21:01 abdu57 has quit [Ping timeout: 240 seconds]

21:25 tufei has quit [Remote host closed the connection]

21:26 Mirarora has quit [Quit: Mirarora encountered a fatal error and needs to close]

21:53 Mirarora has joined #ffmpeg-devel

22:36 <fflogger> [editedticket] jb_alvarado: Ticket #11469 ([ffmpeg] ffmpeg_demux: readrate plays "catch up" if output is blocked, then later resumed) updated https://trac.ffmpeg.org/ticket/11469#comment:13

22:37 <ramiro> haasn: that's precisely why I decided to start with neon for jit :P

22:47 <haasn> understandable

22:47 <haasn> I would love to do rvv instead

22:50 abdu69 has quit [Quit: Client closed]

22:50 abdu69 has joined #ffmpeg-devel

22:57 jamrial has joined #ffmpeg-devel

23:25 <ramiro> haasn: sorry if this is a dumb question, but is it normal that the dithering matrix only adds positive numbers?

23:28 <ramiro> ( and in which formats is size_log2 == 0 so that it adds 0.5? )

23:32 twelve has joined #ffmpeg-devel

23:34 <haasn> ramiro: because SWS_OP_CONVERT truncates

23:35 <haasn> (int) (x + 0.5) is rounding

23:35 twelve has quit [Remote host closed the connection]

23:36 <haasn> in the current logic we use rounding (so size=0) for 16-bit output

23:36 <haasn> and dithering for lower-than-16 bit output

23:36 <haasn> but I was thinking we should relax this to do hard truncation unless SWS_ACCURATE_RND is specified

23:36 <haasn> above certain bit depth

23:37 <haasn> that would allow us to do e.g. downconversion from yuv444p16 to yuv444p12 with just an rshift

23:37 <haasn> or upconversion from rgb24 to gbrp12le by using an expand+shift

23:39 <haasn> and we can probably set a lower cut-off to start using a real dither matrix, maybe 14 bit?

23:40 <haasn> incidentally, one major advantage of the new approach is that we can consolidate and centralize these decisions now

23:40 <haasn> because the logic is entirely self-contained in fmt_dither() in formats.c

23:40 <haasn> and it will therefore be consistent for all formats :)

23:41 <haasn> ramiro: not sure if you saw but I recently pushed swscale5 with a lot more internal ABI changes

23:41 <haasn> for the ops

23:41 <haasn> one notable change is that I decided reads/writes should be planar by default; this mainly changes the case for elems = 1 which is now considered a planar read

23:41 <haasn> (and packed reads are explicitly tagged)

23:42 <haasn> I also think I want to make a memdup of the SwsOpsList before calling into compile() so that compile() can make arbitrary mutations to the op list, not just consuming ops

23:42 <haasn> e.g. also relaxing their types or partially implementing them

23:42 <haasn> I also sort of want to find a way for compile() to return multiple functions at the same time

23:44 <haasn> or maybe make the SwsOpChain into a proper type and let compile() assemble the entire op chain in one call

23:45 <haasn> that will make checkasm easier to implement

23:45 <haasn> I think my next step for now should be to write a simple checkasm for the ops backends so that I can test them without losing my sanity

23:56 <ramiro> haasn: preserving sanity is the reason why my productivity has fallen well below 50% in the past few years... totally worth it.

23:59 <ramiro> haasn: "consolidate and centralize these decisions now" <= definitely a win... current libswscale has evolved with multiple diverging paths per specialized converter and very little effort into consodilating them.