#ffmpeg-devel on 2025-03-31 — irc logs at libera.irclog.whitequark.org

2025-03-03 01:04 michaelni changed the topic of #ffmpeg-devel to: Welcome to the FFmpeg development channel | Questions about using FFmpeg or developing with libav* libs should be asked in #ffmpeg | This channel is publicly logged | FFmpeg 7.1.1 has been released! | Please read ffmpeg.org/developer.html#Code-of-conduct

00:29 thilo has quit [Ping timeout: 248 seconds]

00:31 thilo has joined #ffmpeg-devel

00:31 thilo has quit [Changing host]

00:31 thilo has joined #ffmpeg-devel

00:39 Anthony_ZO has joined #ffmpeg-devel

00:42 <ramiro> haasn: /* Skip unpacking components that are not used */ <-- this doesn't seem right. we're dropping crucial information to know where the find the components.

00:51 cone-896 has joined #ffmpeg-devel

00:51 <cone-896> ffmpeg Michael Niedermayer master:2f0500f22c1d: avcodec/ffv1enc: Better heuristic for selecting mul values.

00:51 <cone-896> ffmpeg Michael Niedermayer master:3faee894fc77: avcodec/ffv1enc: Eliminate copy_state

00:51 <cone-896> ffmpeg Michael Niedermayer master:1d2c39100524: avcodec/ffv1enc: Add -remap_optimizer option

00:51 <cone-896> ffmpeg Michael Niedermayer master:f76508511547: avcodec/ffv1enc: Eliminate RemapEncoderState

00:54 <haasn> ramiro: notice that it skips only trailing components

00:55 <haasn> PackOp is defined as being msb->lsb ordered

00:56 <haasn> But I don’t think this optimization is actually doing anything useful in 99.9% of cases

00:56 <haasn> So if it makes it more annoying just remove it

00:57 <haasn> Realistically you can only skip the alpha channel which only the 1010102 format even has

00:57 <ramiro> I hit it on xv30le -> gray

00:57 <haasn> Ah

01:00 <ramiro> the information available on pack/unpack is quite confusing... the input/output types, which components are used for input and output, and the offsets being msb->lsb ordered. I'll have to read more carefully. currently I have a huge hack for unpack and a medium hack for pack.

01:00 <ramiro> packing/unpacking frequently leads to an unnecessary conversion before or after. I had to add more checks to make sure the conversion only happens once.

01:05 <ramiro> I just finished *most* unscaled conversions. all except for monow/monob, xv36be, and the occasional asmjit failure on consecutive register allocation. it's *very* fast, and I haven't done the loop in asm yet. it should be even faster with just one function call and no redundant loading of const data.

01:10 <haasn> I didn’t really think about the design of pack/unpack too much

01:10 <haasn> We should define it in whatever way is most convenient for asm

01:11 <haasn> How does a typical SIMD implementation of this work?

01:12 <Lynne> load N registers where N is a convenient mod of the depth/elements

01:12 <Lynne> then deinterleave until done

01:12 <Lynne> same with packing

01:12 <haasn> If it’s some combination of AND and shift right, then I imagine we should keep the element size the same

01:12 <haasn> Rather than implying a conversion

01:12 <haasn> Since we’re guaranteed unpacking to separate registers it doesn’t really cost us anything

01:13 <Lynne> you can do it one step at a time for most pixfmts, but it would be significantly slower

01:13 <haasn> Deinterleave?

01:13 <Lynne> shuffle/deinterleave until everything ends up separated by component

01:14 <haasn> What instruction sets have a bitwise shuffle or deinterleave?

01:15 <ramiro> haasn: for packing I'm doing 1) convert to higher format 2) shift left 3) or. 1 and 2 could be merged in the same instruction with a widening shift left.

01:15 <Lynne> oh, for 10bit semi-packed data, like x2bgr10?

01:15 <Lynne> or v210?

01:15 <haasn> Yeah, or rgb565

01:16 <haasn> For byte aligned formats you don’t need SWS_OP_UN/PACK

01:16 <ramiro> haasn: for unpacking it's 1) shift right 2) and 3) convert down. I could merge 1 and 3 with narrowing shift right.

01:17 <Lynne> for that I'd go with shifts and masks, yeah

01:22 <haasn> Could do a survey of how often an unpack is followed by a conversion

01:22 <haasn> Across all possible conversions

01:23 <haasn> Or, what if we just allow the pack type to vary

01:25 <haasn> So { op = UNPACK, pack.type = U16, .type = U8 } + { convert u8 -> u16 } just becomes .type = U16 on the unpack

01:25 <haasn> That strikes me as the most general solution

01:25 <haasn> Then you can use either a narrowing right shift or a normal right shift depending on the desired type

01:26 <haasn> Maybe with the restriction that type <= pack.type

01:26 <haasn> Although the other part of me wants to just delete the confusing pack.type field

01:26 <ramiro> haasn: it is confusing, yes :)

01:26 <haasn> Define PACK/UNPACK on a single deprh only

01:27 <haasn> And let the backend worry about fusing the right shift and narrow

01:27 <ramiro> it would be more helpful to have, for each op, input and output type. even if that means duplicating some information from next/previous op.

01:27 <haasn> That only affects CONVERT right

01:28 <ramiro> also maybe comps.unused_in and comps.unused_out

01:28 <ramiro> haasn: no, for pack/unpack

01:28 <haasn> (With the above proposal to delete pack.type)

01:30 <haasn> I guess it would simplify swapping operations

01:30 <haasn> then again, I should just make swap_ops automatically do that when swapping a conversion

01:33 <haasn> one of the things I want to try and solve is return the ability for op table entries to match multiple ops again

01:34 System_Error has quit [Remote host closed the connection]

01:34 <haasn> but the stupid way of doing that (turning ops into an array) is both cumbersome and creates a lot of dead binary space

01:34 <haasn> by blowing up the size of those tables

01:34 <haasn> and it's not convenient to make them pointers with C macros

01:36 <haasn> maybe it's okay for now to just allow matching _two_ ops, rather than N, that keeps the size manageable

01:37 <haasn> though I suspect we may really want some dedicated kernels for packed read-swizzle-write (e.g. shuffling rgba byte order)

01:37 <haasn> since those can just be pshufb instead of needing to split the whole thing into separate registers

01:41 System_Error has joined #ffmpeg-devel

02:40 <fflogger> [editedticket] MasterQuestionable: Ticket #11435 ([avformat] Added "-extension_picky" breaks various applications) updated https://trac.ffmpeg.org/ticket/11435#comment:18

02:40 <fflogger> [editedticket] MasterQuestionable: Ticket #10775 ([avformat] More sensible HEVC handling for HLS creation) updated https://trac.ffmpeg.org/ticket/10775#comment:9

02:53 ^Neo has quit [Ping timeout: 245 seconds]

03:22 qwertyttyert has joined #ffmpeg-devel

03:30 <qwertyttyert> There is a libtheora. CMD is limited to displaying 300 rows. -codecs doesn't show everything. -encoders is also 300 lines, but the output of the information is different. Copy, paste, and find. I'm ffnpeg use it on linux.

03:33 jamrial has quit []

03:40 <fflogger> [newticket] y2krankor: Ticket #11534 ([undetermined] Copying dvd subtitles from mp4 to mp4 alters size) created https://trac.ffmpeg.org/ticket/11534

03:51 cone-896 has quit [Quit: transmission timeout]

03:54 qwertyttyert has left #ffmpeg-devel [#ffmpeg-devel]

04:28 <fflogger> [editedticket] MasterQuestionable: Ticket #11534 ([avcodec] Copying "dvd_subtitle" from MP4 to MP4 altered display size) updated https://trac.ffmpeg.org/ticket/11534#comment:1

04:47 <fflogger> [editedticket] y2krankor: Ticket #11534 ([avcodec] Copying "dvd_subtitle" from MP4 to MP4 altered display size) updated https://trac.ffmpeg.org/ticket/11534#comment:2

05:57 mkver has joined #ffmpeg-devel

06:21 witchymary has quit [Ping timeout: 246 seconds]

06:25 cone-637 has joined #ffmpeg-devel

06:25 <cone-637> ffmpeg Andreas Rheinhardt master:35c091f4b7fb: avformat/rtpenc: Check dimensions during init

06:51 derpydoo has joined #ffmpeg-devel

06:55 Guest44 has joined #ffmpeg-devel

06:56 Guest44 has quit [Client Quit]

06:56 jestrabikr has joined #ffmpeg-devel

07:03 jestrabikr has quit [Quit: Client closed]

07:07 Grimmauld has joined #ffmpeg-devel

07:17 nush has joined #ffmpeg-devel

07:20 nush has quit [Remote host closed the connection]

07:20 nush has joined #ffmpeg-devel

07:24 nush has quit [Remote host closed the connection]

07:43 rit has joined #ffmpeg-devel

07:45 Teukka has quit [Read error: Connection reset by peer]

07:51 Teukka has joined #ffmpeg-devel

07:51 Teukka has quit [Changing host]

07:51 Teukka has joined #ffmpeg-devel

07:57 ngaullier has joined #ffmpeg-devel

08:06 rvalue has quit [Read error: Connection reset by peer]

08:06 rvalue has joined #ffmpeg-devel

08:21 rit has quit [Remote host closed the connection]

08:24 rit has joined #ffmpeg-devel

09:17 Marth64 has quit [Remote host closed the connection]

09:17 Marth64 has joined #ffmpeg-devel

09:25 cone-637 has quit [Quit: transmission timeout]

09:47 mkver has quit [Ping timeout: 265 seconds]

09:49 mkver has joined #ffmpeg-devel

10:24 <haasn> ramiro: I kinda want to start merging the ops code even though the backends aren't "ready" yet, as it will give people a chance to both test the new code and start making it possible for other people to start developing beckends without the constant pain of rebasing

10:25 <haasn> I was thinking about using SWS_EXPERIMENTAL to guard the new code for now, but I'm not very happy with that solution

10:26 <haasn> what if we used an informal env var instead? like how drivers etc sometimes do it

10:51 witchymary has joined #ffmpeg-devel

10:56 mkver has quit [Ping timeout: 268 seconds]

11:01 mkver has joined #ffmpeg-devel

11:08 cone-276 has joined #ffmpeg-devel

11:08 <cone-276> ffmpeg Niklas Haas master:0e2742a693d5: tests/swscale: allow choosing specific flags and dither mode

11:08 <cone-276> ffmpeg Niklas Haas master:d467ceaa9b14: tests/swscale: use hex format for flags values

11:08 <cone-276> ffmpeg Niklas Haas master:995986e512d3: tests/swscale: allow testing only unscaled convertors

11:08 <cone-276> ffmpeg Niklas Haas master:6c12b1535a86: tests/swscale: switch from MSE to SSIM

11:08 <cone-276> ffmpeg Niklas Haas master:f438f3f8cd52: tests/swscale: print speedup numbers in color

11:08 <cone-276> ffmpeg Niklas Haas master:1707e81073ce: tests/swscale: use yuva444p as reference

11:08 <cone-276> ffmpeg Niklas Haas master:e1736d0d0bc5: tests/swscale: print performance stats on exit

11:08 <cone-276> ffmpeg Niklas Haas master:a22faeb992b2: tests/swscale: check supported inputs for legacy swscale separately

11:08 <cone-276> ffmpeg Niklas Haas master:9549daa99639: tests/swscale: remove stray whitespace in scanf format

11:08 <cone-276> ffmpeg Niklas Haas master:8fc9808f18a1: tests/swscale: calculate theoretical expected SSIM

11:08 <cone-276> ffmpeg Niklas Haas master:92a57f1cfdfe: tests/swscale: constrain reference SSIM for low bit depth formats

11:08 <cone-276> ffmpeg Niklas Haas master:3e32dc8b0834: tests/swscale: allow setting log verbosity

11:10 Grimmauld92 has joined #ffmpeg-devel

11:12 Grimmauld has quit [Ping timeout: 240 seconds]

11:17 derpydoo has quit [Quit: derpydoo]

11:29 usagi_mimi has joined #ffmpeg-devel

11:34 ^Neo has joined #ffmpeg-devel

11:37 usagi_mimi has quit [Quit: WeeChat 4.5.2]

11:38 usagi_mimi has joined #ffmpeg-devel

11:56 minimal has joined #ffmpeg-devel

11:58 <ramiro> haasn: in general I'm not a big fan of env vars. what's wrong with using a flag?

11:58 <haasn> just worry that it may be too poorly defined

11:58 <haasn> so no users can actually end up using it

11:58 <ramiro> it could also be selected by configure for now, disabled by default

11:59 <haasn> but we have the flags anyway, might as well use them

12:01 <Lynne> I think you should enable it by default after 8.0 gets released

12:02 <haasn> I want at least x86 coverage before doing that

12:02 <haasn> of the major paths

12:02 <haasn> Lynne: do you have any idea why my SIMD routine is slower than the compiler generated C code depite the latter doing about 4 unnecessary memory round trips?

12:03 <haasn> compiler output: https://bpa.st/raw/X2FA

12:04 <haasn> my code v1: https://bpa.st/raw/STTA (0.74x slowdown)

12:05 <haasn> my code v2: https://bpa.st/raw/OLFA (0.95x slowdown)

12:06 <Lynne> could you post the C code too?

12:07 <haasn> pseudocode: for (int i = 0; i < 32; i++) { mx[i] = in[2*i]; my[i] = in[2*i + 1]; mx2[i] = in[64 + 2*i]; my2[i] = in[64 + 2*i + 1]; }

12:08 <haasn> sure: https://bpa.st/raw/GMZA

12:08 <haasn> it's not 1:1 the same as the asm code because the C code is only reading 32 pixels at a time instead of 64 pixels at a time

12:09 <haasn> and the C code is also writing to stack instead of leaving the data in mx/my/mx2/my2 but that should only be making it slower in this comparison

12:09 <Lynne> wow, vpermq is more expensive than I thought

12:09 <Lynne> throughput of 2, latency of 6.5

12:09 <haasn> you reckon it's just from leaving away that?

12:09 <Lynne> on a zen 3

12:10 <haasn> let me test at different block sizes just to be sure

12:10 rit has quit [Remote host closed the connection]

12:11 <Lynne> yeah, I reckon its that, the compiler generated code relies on bitwise ops which are very cheap and pipelineable

12:11 <Lynne> even though its got much worse loads

12:11 <haasn> https://bpa.st/raw/IU5A

12:11 <haasn> oh

12:11 <haasn> I'm comparing apples to oranges

12:12 <haasn> since the block size is different the avx code is doing twice as much work

12:12 <Lynne> ah, yeah, that looks better

12:12 <haasn> that... kinda screws with the whole idea of using checkasm --bench in general

12:12 <haasn> this is with the same 32 block size forced for both

12:12 <haasn> really I should be reporting performance in cycles per pixel

12:13 <Lynne> still barely 1.64x faster though, that's fine on arm64, but I'd have expected a bit more on x86

12:14 <haasn> the sse2 version is with v1 not v2

12:14 <haasn> give me a minute to make v2 compatible with sse2 and test it again

12:16 <haasn> we are gueranteed SSSE3 on x86_64 right?

12:16 <wbs> no, only SSE2

12:16 <haasn> damn

12:17 <haasn> read_packed2_u8_ssse3: 9.4 ( 1.77x)

12:17 <haasn> using the pshufb variant

12:18 <ramiro> haasn: but we did decide as vdd that ssse3 can be considered the minimum.

12:18 <haasn> reasonable

12:19 <haasn> I expect a lot more pshufb in our future anways

12:20 <haasn> could always go back to supporting separate sse2 variants if a need arises

12:20 <ramiro> haasn: we could also go back to supporting mmx variants if the need arises, I just hope it never arises :)

12:21 <haasn> I'm thinking if there's a way we could avoid the constant need for vpermq

12:21 <haasn> it seems it's needed after basically every bit depth conversion (packuswb etc)

12:21 Grimmauld92 has quit [Quit: Client closed]

12:22 <haasn> maybe we can just keep track of the number of such ops and do a single vpermq at the end only if needed :)

12:22 Grimmauld has joined #ffmpeg-devel

12:23 <haasn> that seems like the type of witchcraft better reserved for a jit backend though, or at least postponed until we actually have 100% ops coverage

12:24 <Lynne> vpermq is cross-lane, you should try replacing it with vperm2i128+shufpd

12:24 <Lynne> 6.5 cycle latency is pretty damn bad

12:28 <haasn> I could use vperm2i128 + blendpd

12:29 <haasn> or shufpd yeah

12:34 <haasn> uh

12:35 <haasn> https://git.ffmpeg.org/gitweb/ffmpeg.git/blob/HEAD:/tests/checkasm/checkasm.h#l367

12:35 <haasn> why the hell does this discard tsum/tcount and only store the performance of the very last iteration?

12:35 <haasn> doesn't that make --runs completely ineffective also?

12:36 <haasn> dav1d's version of checkasm does cycles = tsum; iters = tcount;

12:37 <haasn> e0d56f097f4 made this change with seemingly no justification

12:39 rit has joined #ffmpeg-devel

12:39 <haasn> what a miracle, changed it and suddenly my numbers are way more consistent between test runs

12:41 <ramiro> haasn: i wonder if now I can drop my scripts that run checkasm multiple times and pick the median result...

12:42 <haasn> and now I no longer need to use linux-perf

12:42 <haasn> what a miracle

12:43 <haasn> Lynne: https://bpa.st/raw/C7AA ???

12:43 <haasn> after switching from linux-perf back to the native backend

12:44 <haasn> I wonder if this is actually the result of the poor granularity on the rdtsc effectively rounding the execution down to 0 in many cases

12:44 <haasn> or else how can simply changing the timing method result in a dramatic speedup

12:45 <ramiro> haasn: collect the number of 0 runs and print that as well

12:45 <kierank> you need to run the function more times

12:45 <kierank> the granularity is too low

12:45 jamrial has joined #ffmpeg-devel

12:45 <kierank> 10 cycles or dezicycles or whatever is way too small

12:46 <wbs> kierank: it does run it many times, but the existing checkasm essentially threw away all the previous iterations and only reported the runtime from the last single iteration

12:47 <haasn> increased it from 4 invocations to 16, now I get read_packed2_u8_ssse3: 16.0 ( 6.16x)

12:47 <haasn> fun

12:47 <wbs> kierank: due to an oversight in e0d56f097f42bcdbe6c3b2f57df62a4da63f2094

12:47 <kierank> wbs: :(

12:48 <haasn> wbs: the other problem I think is that we only run the function 4 times in between measuring timestamps

12:48 <haasn> I think it would be better to switch to running the function say 16 times and reducing the run count accordingly

12:49 <haasn> because at least on my platform, rdtsc only reports in increments of I think 36

12:49 <wbs> haasn: maybe, but that feels like a change that depends a lot on the kind of function you're working with

12:49 <haasn> sure, but it shouldn't matter for slower functions and for faster functions will improve accuracy a lot, no

12:50 <haasn> I mean obviously it would be better to adjust the number of runs dynamically to either meet a certain time target or sttdev threshold per test

12:50 <wbs> if we'd decrease the run count accordingly, then yeah, it could be fine

12:51 <wbs> but the change to actually use tsum and not just t, feels like a slam dunk fix. the other one is probably ok but a bit less of an obvious change

12:51 <wbs> haasn: but great catch!

12:51 <haasn> read_packed2_u8_ssse3: 0.0 ( 0.00x)

12:51 <haasn> with the default 4 iters, lol

12:53 <haasn> wbs: sent it for now

12:58 <haasn> wbs: can we have a shared checkasm living in a git submodule of dav1d/ffmpeg yet please?

12:59 <wbs> haasn: that would be awesome indeed. unfortunately it is a bit tied to the host project in a bunch of ways

13:00 <haasn> indeed, I suspect a fresh start would be needed

13:00 <haasn> maybe a project for STF 2025 if anybody's interested in that still? :)

13:01 <wbs> nah, I don't think a fresh start is needed, only a bit of refactoring to get it free from the host project e.g. cpu flags

13:01 <haasn> would have to go through and cherry-pick the best parts of dav1d and ffmpeg as they have diverged in terms of features

13:02 <haasn> well I was thinking we could also put it into a clean namespace while we're at it and make it more of a universal library

13:02 <haasn> was it BBB who I was discussing the future of checkasm with at some point?

13:05 <haasn> https://github.com/p-ranav/criterion why does C++ get all the fun toys :(

13:05 <wbs> haasn: fwiw, https://patchwork.ffmpeg.org/project/ffmpeg/list/?series=14192 brings in a bunch of other niceties from dav1d's checkasm

13:10 mkver has quit [Remote host closed the connection]

13:11 mkver has joined #ffmpeg-devel

13:12 <ramiro> haasn: I think the biggest blocker for using asmjit is the fact that it's c++ :/

13:13 ccawley2011 has joined #ffmpeg-devel

13:14 <haasn> wbs: yes that would be useful for me as well

13:15 <wbs> according to patchwork, the patch for hevc produces errors in loongarch, so apparently the loongarch hevc simd has issues that this does catch

13:17 <ramiro> wbs: is it even possible to have access to a loongarch sbc for testing?

13:19 <cone-276> ffmpeg Danil Iashchenko master:1015ea2ba15d: doc/filters: add thumbnail_cuda entry

13:20 ngaullier has quit [Remote host closed the connection]

13:20 ngaullier has joined #ffmpeg-devel

13:24 <haasn> may I propose making linux-perf the default timing backend as well

13:25 <haasn> it really is substantially more accurate than rdtsc; I get better consistency from linux-perf at --runs=10 than from rdtsc at --runs=16

13:25 <haasn> likely because of cpu scaling messing up the tsc results

13:26 <haasn> it was so accurate that I could use it for timing even with the above bug (single iter); whereas rdtsc in the single-run case just gave no useful results

13:27 <ramiro> haasn: +1 for making perf the default

13:27 <cone-276> ffmpeg Niklas Haas master:256a38101fd1: tests/checkasm: fix wrong summation of bench time

13:39 <ramiro> haasn: what would be the best struct/place to free the context created by a backend?

13:41 <haasn> ff_sws_op_chain_append(chain, func_to_call, free_ctx, (SwsOpPriv) { .ptr = ctx });

13:42 <ramiro> haasn: also, in backend_x86's compile you recalculate chain->block_w every time. shouldn't this be done only once?

13:42 <haasn> I guess the downside of this is that it locks you out from using `SwsOpPriv` for other purposes, if the amount of private data you need is less than 16 bytes

13:43 <haasn> ramiro: the x86 backend compile() loops internally after setting the block size

13:43 <wbs> ramiro: the gcc compile farm has one (or a couple) loongarch machines, I've tested dav1d on them before

13:43 <haasn> backend_c does set the block size on every iter but it's not a big deal

13:44 <haasn> to be honest I was considering dropping the ability for compile() to even return EAGAIN but kept it in there because why not

13:44 <haasn> but maybe it would be better to make it clear that compile() should compile _all_ ops

13:45 <wbs> haasn: perf is the default for benchmarking on arm, but I usually disable it

13:46 <ramiro> haasn: yes, I think it's best to have compile() compile all ops

13:46 <wbs> there's an abandoned patchset from courmisch that prepares for making it possible to switch benchmark implementation without recompiling

13:48 <ramiro> haasn: hmm, I'm not calling ff_sws_op_compile_tables() though (which would help freeing the context)

13:49 <haasn> you don't have to

13:50 <ramiro> is it normal that ff_sws_op_chain_append() does nothing with the 'free' argument?

13:51 <haasn> .. that is a bug

13:51 <haasn> https://bpa.st/raw/HZ7A

13:54 <ramiro> haasn: thanks. so I tie my context to the chain.

13:57 <haasn> btw, when you do switch to looping in asm, you may want to still keep the block_h reasonably small though

13:57 <haasn> to enable threading it over larger images

13:58 Grimmauld has quit [Quit: Client closed]

14:07 <ramiro> haasn: I think block_h would still be 1 or 2, and we need to figure out a way to let run_main/run_tail decide how many rows could be done per call.

14:10 <haasn> I'm more worried atm about run_tail being slow with a too large block_w

14:10 <haasn> the varying block sizes thing is turning into a bit of a pain especially for checkasm because we can't easily guarantee coverage of all possible block sizes

14:12 <haasn> maybe a better approach would be for ops.c to compile a SwsOpChain per possible block size

14:14 <ramiro> haasn: the tail is a small amount of data, especially on large image sizes. it would matter little from 720p and up. don't put this amount of energy into varying block sizes now. get it working with a full block size first and then measure the performance hit with different image sizes.

14:15 <haasn> fair

14:16 <haasn> at 4k even with block size 64 we have over 60 blocks per row

14:16 <haasn> so even in the worst case of spending an entire extra block just for one pixel we are wasting only 1.6%

14:17 <haasn> that is micro-optimization territory

14:25 Ishesh has joined #ffmpeg-devel

14:27 <ramiro> haasn: do we really need SwsOpPriv? can't we just do the usual void* and let the function cast it?

14:28 <haasn> we don't really need it though it does make the C code slightly more convenient so I'm not sure I see the harm

14:29 <haasn> well, and there is the fact that void * is not 16 bytes large :)

14:29 <haasn> so we do need a union somewhere; either there or in SwsOpImpl

14:30 <ramiro> can't the backend specify the size of privdata?

14:30 <haasn> it has to be fixed size

14:30 <haasn> look at the way private data is stored in SwsOpImpl; it is alternating with the function pointers

14:30 manny2 has joined #ffmpeg-devel

14:30 <haasn> this enables the non-JIT backends to just bump the pointer by 32 bytes and call the next function

14:31 <haasn> without having to bump two different pointers or do math to determine how much to bump it by

14:32 <ramiro> ah, that's good.

14:33 <haasn> 16 bytes of "immediate" data is plenty for almost all ops

14:34 <haasn> so not only is this approach lower for per-kernel overhead but it also saves an indirection for all other kernels

14:34 <haasn> and the fact that it's already in cache only helps us

14:34 <haasn> typically the entire impl array fits into cache anyway

14:39 Anthony_ZO has quit [Ping timeout: 248 seconds]

14:45 manny2 has quit [Quit: Client closed]

14:46 manny81 has joined #ffmpeg-devel

14:48 <haasn> Lynne: I don't think vpermq can be replaced by vperm2i128+shufpd here; I have {A0 B0 | A1 B1} and I want {A0 A1 | B0 B1}; after vperm2i128 I have tmp := {A1 B1 | A0 B0} but that doesn't help with a single shufpd to get the result I want

14:49 <haasn> unless I use another shufpd but I somehow doubt vperm2i128 + 2 * shufpd is going to be faster than a single vpermq

14:53 <haasn> as expected, vpermq is much faster, yes

14:53 <haasn> seems we're stuck with it

14:56 <mkver> jamrial: The isRGB48 check in https://github.com/FFmpeg/FFmpeg/blob/master/libswscale/swscale_unscaled.c#L1891 means that the last four ifs are dead (and x2rgb10to(bgr)?64_(no)?bswap are unused.

14:57 <mkver> Which makes me wonder how you tested e7382b4d0101f02e61fdf5ad2c48aca500bb413f.

15:02 <jamrial> mkver: ah, good catch

15:03 <mkver> Will fix it.

15:05 <mkver> Patch sent. I still don't know how this ever worked for these conversions.

15:09 <jamrial> it probably didn't and the normal, non unscaled path was used

15:11 <jamrial> mkver: http://pastie.org/p/61HkUrspbv5avBDKdm3zFo/raw this will exercise your change

15:11 <jamrial> the output of the new tests that adds will change

15:13 <jamrial> if the output after your change looks wrong, then it would mean the unscaled code i added is wrong

15:30 <ramiro> haasn: valgrind complains that the data returned from generate_bayer_matrix leaks

15:41 usagi_mimi has quit [Quit: WeeChat 4.5.2]

15:41 usagi_mimi has joined #ffmpeg-devel

15:41 usagi_mimi has quit [Changing host]

15:41 usagi_mimi has joined #ffmpeg-devel

16:01 usagi_mimi has quit [Ping timeout: 248 seconds]

16:05 Grimmauld has joined #ffmpeg-devel

16:22 <Lynne> haasn: I see

16:22 <Lynne> nak on making linux-perf the default

16:23 <Lynne> you have a weird system

16:27 cone-276 has quit [Quit: transmission timeout]

16:51 <mkver> Is there a nasm equivalent to -ffunction-sections?

16:59 ngaullier has quit [Remote host closed the connection]

17:01 Grimmauld has quit [Quit: Client closed]

17:04 <haasn> pshufb can only shuffle within 16 bytes?

17:07 <jamrial> yeah, within a lane

17:07 <jamrial> for crosslane byte shuffle i think you need avx512

17:12 ccawley2011 has quit [Ping timeout: 245 seconds]

17:16 ccawley2011 has joined #ffmpeg-devel

17:21 <Gramner> avx2 has vpermd and vpermq for 32- and 64-bit cross-lane shuffles. for smaller elements you need avx-512

17:29 manny81 has quit [Ping timeout: 240 seconds]

17:36 ccawley2011 has quit [Ping timeout: 260 seconds]

17:40 manny69 has joined #ffmpeg-devel

17:43 ___nick___ has joined #ffmpeg-devel

17:48 ccawley2011 has joined #ffmpeg-devel

17:50 <Lynne> vpermd is the white flag instruction you use to indicate you surrender

17:52 <Lynne> it is almost never the right choice, reparations include losing the war, a register, a memory load, tons of CPU ports, and latency

17:53 <jamrial> not necessarely. look at vector_fmul_reverse in float_dsp. vpermps (float variant) is faster than insertf128 + shufps

17:54 <Lynne> in more non-trivial applications such as FFTs, you can always fold the functionality of the shuffle back up the chain if you have enough patience and avoid it

17:56 Warcop has joined #ffmpeg-devel

17:58 IndecisiveTurtle has joined #ffmpeg-devel

18:01 ___nick___ has quit [Ping timeout: 252 seconds]

18:03 <haasn> for some reason I was thinking 16 bytes is less than one lane, but now I realize I am being silly and should probably go eat something instead

18:04 <haasn> but that does mean I can generally load shuffle masks using vbroadcasti128 when I need it multiple times

18:12 IndecisiveTurtle has quit [Quit: IndecisiveTurtle]

18:13 ___nick___ has joined #ffmpeg-devel

18:15 abdu has joined #ffmpeg-devel

18:36 <Lynne> you can, though you ought to be careful

18:37 <Lynne> it may be cheaper to just outright load 256bits at a time, even if they're duplicated, since vbroadcast uses the shuffle ports

18:39 <Lynne> vbroadcast m, xmm has a latency of 4! 4! its cheaper to vperm2i128 m, m+whatever happens to be in lane2, 0x00

18:52 rit has quit [Remote host closed the connection]

18:52 rit has joined #ffmpeg-devel

18:59 <Gramner> broadcast from memory of dwords and larger are not shuffle-µops, just regular memory loads

19:02 abdu83 has joined #ffmpeg-devel

19:04 abdu2 has joined #ffmpeg-devel

19:05 abdu has quit [Ping timeout: 240 seconds]

19:05 iive has joined #ffmpeg-devel

19:08 abdu83 has quit [Ping timeout: 240 seconds]

19:24 Ishesh has quit [Quit: Client closed]

19:35 <Lynne> on zen3 they look like they use the same ports

19:54 cone-391 has joined #ffmpeg-devel

19:54 <cone-391> ffmpeg Andreas Rheinhardt master:581a6a042ca8: doc/encoders: Move FFV1 encoder to video encoder section

19:54 <cone-391> ffmpeg Andreas Rheinhardt master:4da84d5c2b2b: swscale/swscale_unscaled: Actually use X2->RGBA64 conversions

19:58 rit has quit [Remote host closed the connection]

20:04 ___nick___ has quit [Ping timeout: 260 seconds]

20:20 <wbs> Gramner: do you have opinions on https://patchwork.ffmpeg.org/project/ffmpeg/patch/20250331133043.167420-1-ffmpeg@haasn.xyz/ ?

20:36 <Gramner> not really. 4 was completely arbitrary to begin with

20:41 <Gramner> haasn: needs to update avg_cycles_per_call() too though

20:42 abdu73 has joined #ffmpeg-devel

20:45 abdu73 has quit [Client Quit]

20:46 abdu2 has quit [Ping timeout: 240 seconds]

20:46 abdu73 has joined #ffmpeg-devel

21:13 Grimmauld has joined #ffmpeg-devel

21:13 ccawley2011 has quit [Read error: Connection reset by peer]

21:16 <haasn> can somebody explain to me the whole decicycles confusion

21:16 <haasn> so rdtsc is supposed to measure "ticks"

21:17 <haasn> avg_cycles_per_call() then multiplies this by 10

21:18 <haasn> for whatever reason

21:18 <haasn> and print_bench calls that "decicycles"

21:18 <haasn> but then it prints it as decicycles / 10

21:19 <haasn> why not just never multiply anything, print the cycles directly as measured and call it as such?

21:19 <Gramner> i think we rewrote that in dav1d because that was a bit confusing

21:20 <Lynne> yeah, I'd be happy to see that changed

21:25 <pengvado> the complication dates back to when we did everything with integer arithmetic. now that avg_cycles_per_call returns a float, there's no reason to *10.

21:45 <Lynne> always an event when you respond

22:12 <iive> recently I saw a discussion for scientifically accurate averaging functions, ones that won't lose data, because of the order floats are summed.

22:13 <iive> ...data and/or precision...

22:20 manny69 has quit [Ping timeout: 240 seconds]

22:23 <fflogger> [newticket] 2246c68: Ticket #11535 ([avformat] Fixes for CVE-2023-6602 broke my code) created https://trac.ffmpeg.org/ticket/11535

22:39 Mirarora has quit [Quit: Mirarora encountered a fatal error and needs to close]

22:41 Grimmauld has quit [Quit: Client closed]

22:44 Mirarora has joined #ffmpeg-devel

22:49 <fflogger> [editedticket] Balling: Ticket #11524 ([avcodec] g726(le) produces clicks in encoded audio) updated https://trac.ffmpeg.org/ticket/11524#comment:7

22:52 DodoGTA has quit [Quit: DodoGTA]

22:53 DodoGTA has joined #ffmpeg-devel

22:53 cone-391 has quit [Quit: transmission timeout]

23:11 <ramiro> haasn: I'm getting non-bitexact output with f32 on checkasm, which was kind of expected...

23:21 IndecisiveTurtle has joined #ffmpeg-devel

23:23 abdu73 has quit [Ping timeout: 240 seconds]

23:29 zeezie01 has joined #ffmpeg-devel

23:29 zeezie01 has quit [Changing host]

23:29 zeezie01 has joined #ffmpeg-devel

23:34 mkver has quit [Ping timeout: 252 seconds]

23:36 <haasn> ramiro: from which op?

23:36 <fflogger> [editedticket] Balling: Ticket #11524 ([avcodec] g726(le) produces clicks in encoded audio) updated https://trac.ffmpeg.org/ticket/11524#comment:8

23:45 <fflogger> [editedticket] Lastique: Ticket #11524 ([avcodec] g726(le) produces clicks in encoded audio) updated https://trac.ffmpeg.org/ticket/11524#comment:9

23:51 <ramiro> haasn: linear and dither