#ffmpeg-devel on 2025-02-08 — irc logs at libera.irclog.whitequark.org

2024-09-30 09:44 michaelni changed the topic of #ffmpeg-devel to: Welcome to the FFmpeg development channel | Questions about using FFmpeg or developing with libav* libs should be asked in #ffmpeg | This channel is publicly logged | FFmpeg 7.1 has been released! | Please read ffmpeg.org/developer.html#Code-of-conduct

00:04 System_Error has joined #ffmpeg-devel

00:05 <cone-887> ffmpeg Peter Ross master:16f9cfcf4bd8: avcodec/leaddec: support format 0x1006

00:16 derpydoo has joined #ffmpeg-devel

00:19 System_Error has quit [Remote host closed the connection]

00:24 System_Error has joined #ffmpeg-devel

00:32 iive has quit [Quit: They came for me...]

00:33 Marth64 has quit [Remote host closed the connection]

00:44 Kei_N_ has quit [Read error: Connection reset by peer]

00:45 Kei_N has joined #ffmpeg-devel

00:54 jarthur has quit [Quit: jarthur]

01:01 derpydoo has quit [Ping timeout: 245 seconds]

01:08 ^Neo has joined #ffmpeg-devel

01:08 ^Neo has quit [Changing host]

01:08 ^Neo has joined #ffmpeg-devel

01:20 System_Error has quit [Remote host closed the connection]

01:26 System_Error has joined #ffmpeg-devel

01:27 <kierank> can ffmpeg checkasm show the byte differences

01:33 <jamrial> kierank: each tests prints whatever it wants. most tests don't print anything

01:34 <jamrial> the integer ones use memcpy in general, and looping through the arrays manually to print mismatching bytes may be slow

01:34 <jamrial> *memcmp

01:37 <fflogger> [newticket] 0x20z: Ticket #11460 ([avformat] SEGV FFmpeg-master/libavformat/mov.c:5209:39 in mov_read_trak) created https://trac.ffmpeg.org/ticket/11460

01:53 thilo has quit [Ping timeout: 268 seconds]

01:55 thilo has joined #ffmpeg-devel

01:55 thilo has quit [Changing host]

01:55 thilo has joined #ffmpeg-devel

02:02 <cone-887> ffmpeg James Almer master:43be8d07281c: avformat/mov: check for tts_count before deferencing tts_data

02:02 <fflogger> [editedticket] jamrial: Ticket #11460 ([avformat] SEGV FFmpeg-master/libavformat/mov.c:5209:39 in mov_read_trak) updated https://trac.ffmpeg.org/ticket/11460#comment:1

02:06 Martchus_ has joined #ffmpeg-devel

02:07 Martchus has quit [Ping timeout: 260 seconds]

02:59 <aaabbb> is there any reason that framehash etc doesn't use sha1 cpu extensions for sha1 ("SHA160")? or sha2 extensions for the sha2 hashes? or just that no one submitted a patch yet?

02:59 <aaabbb> if just no one submitting a patch then i could do it

03:21 <jamrial> aaabbb: no one did, yes

03:24 <aaabbb> jamrial: do you know off the top of your head what crc32 variant that -hash crc32 uses? is it crc32c or is it the one used by png/zlib?

03:24 <aaabbb> if crc32c then it can use the crc32 instructions too

03:25 <jamrial> no, it's not that one

03:27 <aaabbb> alright only sha1 and sha2 are reasonable to accelerate then

03:35 tufei has joined #ffmpeg-devel

03:37 <Compn> crc32 should be enough for anyone /s

03:38 ^Neo has quit [Ping timeout: 260 seconds]

03:45 MyNetAz has quit [Remote host closed the connection]

03:49 MyNetAz has joined #ffmpeg-devel

03:55 jamrial has quit []

04:04 Guest75 has joined #ffmpeg-devel

04:05 Martchus has joined #ffmpeg-devel

04:06 Martchus_ has quit [Ping timeout: 252 seconds]

05:02 cone-887 has quit [Quit: transmission timeout]

05:02 MyNetAz has quit [Remote host closed the connection]

05:06 MyNetAz has joined #ffmpeg-devel

05:12 rvalue has quit [Read error: Connection reset by peer]

05:12 rvalue has joined #ffmpeg-devel

05:19 Guest75 has quit [Quit: Client closed]

05:34 derpydoo has joined #ffmpeg-devel

05:35 derpydoo has quit [Client Quit]

05:39 <fflogger> [editedticket] Lynne: Ticket #11418 ([undetermined] stack-buffer-overflow on libavcodec/aacenc_tns.c) updated https://trac.ffmpeg.org/ticket/11418#comment:3

06:00 System_Error has quit [Remote host closed the connection]

06:07 System_Error has joined #ffmpeg-devel

06:18 MisterMinister has quit [Ping timeout: 252 seconds]

06:22 derpydoo has joined #ffmpeg-devel

06:41 System_Error has quit [Remote host closed the connection]

06:50 darkapex has quit [Ping timeout: 245 seconds]

06:55 darkapex has joined #ffmpeg-devel

07:02 System_Error has joined #ffmpeg-devel

07:38 Guest31 has joined #ffmpeg-devel

07:44 Guest31 has quit [Quit: Client closed]

08:45 ^Neo has joined #ffmpeg-devel

08:45 ^Neo has quit [Changing host]

08:45 ^Neo has joined #ffmpeg-devel

08:56 abdu has joined #ffmpeg-devel

09:44 abdu28 has joined #ffmpeg-devel

09:45 abdu has quit [Ping timeout: 240 seconds]

10:08 derpydoo has quit [Quit: derpydoo]

10:36 abdu56 has joined #ffmpeg-devel

10:36 DauntlessOne4 has quit [Quit: Ping timeout (120 seconds)]

10:37 DauntlessOne4 has joined #ffmpeg-devel

10:39 abdu28 has quit [Ping timeout: 240 seconds]

11:16 abdu50 has joined #ffmpeg-devel

11:19 abdu56 has quit [Ping timeout: 240 seconds]

11:32 abdu13 has joined #ffmpeg-devel

11:35 abdu50 has quit [Ping timeout: 240 seconds]

11:52 abdu13 has quit [Ping timeout: 240 seconds]

12:01 abdu13 has joined #ffmpeg-devel

12:21 jamrial has joined #ffmpeg-devel

13:04 <BBB> kierank: it can, yes

13:04 <BBB> kierank: modify the test to use the proper check macro/function instead of memcmp

13:05 <BBB> kierank: and -v will do what you're looking for

13:05 <BBB> dav1d uses this everywhere and it's wonderful :)

13:06 <kierank> I will try, thanks

13:09 <haasn> refactored my swscale-ng design further and now the generic (float) conversion path is actually 25% faster than swscale's dedicated LUT for 8 bit yuv -> yuvj conversion

13:10 <haasn> even before any handwritten asm or applying our own LUT optimizations

13:11 <haasn> it's a good day when your generic C code outperforms swscale SIMD

13:11 <haasn> I'm learning more and more that the smaller and more atomic I make my operations, the better

13:12 <haasn> instead of having a single clunky "decode(depth, range)" and "encode(depth, range)" operation I just have a single SWS_OP_CONVERT for int <-> float and a SWS_OP_FMA

13:13 <llyyr> how does it compare to zimg? or too premature to do that comparison

13:13 <haasn> FMA is great because it can embed everything; range conversions, bit depth conversion, shifting, even clearing out or defaulting unused components (e.g. for gray -> yuva)

13:14 <haasn> and FMA composes very well

13:14 <haasn> it composes with itself, it composes with matrix multiplications

13:14 <haasn> it even composes with LUTs, so if you have any long sequence of such operations they can be usually collapsed down into a single matrix or a single LUT

13:15 <haasn> and then all we really need to write is one dedicated dispatch path for fused convert(f32) -> apply_lut/mat/fma -> convert(uint*)

13:16 <haasn> the conversion functions are even cheaper in this approach because all they need to handle is clipping, no divisions by 255.0 or whatever (that is embedded into the fma/matrix already)

13:16 <haasn> and using floats here really is a natural fit because it can losslessy cover multiple orders of magnitude

13:16 <haasn> keeping operations extremely atomic also helps the compiler produce much faster code

13:17 <haasn> inlining everything into one mega-operation often makes it slower, not faster

13:23 <haasn> llyyr: on yuvtestsrc mpeg -> jpeg

13:23 <haasn> frame=100000 fps=7257 q=-0.0 Lsize=N/A time=01:06:39.96 bitrate=N/A speed= 290x swscale

13:23 <haasn> frame=100000 fps=13032 q=-0.0 Lsize=N/A time=01:06:40.00 bitrate=N/A speed= 521x zscale

13:23 <haasn> frame=100000 fps=14608 q=-0.0 Lsize=N/A time=01:06:40.00 bitrate=N/A speed= 584x swscale-ng

13:23 <haasn> \o/

13:23 <llyyr> nice

13:24 <haasn> or maybe I should call it nuscale for now :^)

13:24 <llyyr> lol

13:26 <haasn> for yuv444p -> yuv444p10le, my code is 575x vs zscale 781x (vs swscale 416x) but this is not terribly surprising as I don't have a fast path for this yet

13:26 <haasn> so that's 575x going through the _float path_

13:26 <llyyr> massive glow up for swscale

13:27 <haasn> and for yuv444p10be my code and zscale are basically equal already

13:27 <haasn> once I add a direct shifting fast path for yuv range conversions it will most likely explode

13:31 <haasn> rgba -> bgra my code is 888x vs zscale 295x

13:31 <haasn> vs swscale 521x

13:32 <haasn> oh, that's not a fair comparison on zscale's part because it had to include another swscale fmt conversion

13:33 <haasn> so it's going via bgrap

13:34 <haasn> but I suppose that's still a win vs vf_zscale :)

13:34 <llyyr> could drop the horrible zimg redirection code in mpv once all this is in a release then

13:34 <llyyr> 8.0 (or 7.2?) when

13:35 <haasn> 2x faster for rgba -> rgba64le (and be)

13:36 <haasn> I still need to implement rgb <-> yuv but with my current design it should probably be a lot faster also

13:36 <haasn> the main thing my code does not handle at all atm is scaling

13:36 <haasn> and that's still a bit of an open question in terms of the best design, but I have some ideas-ish

13:36 <haasn> so also no subsampled input

13:37 <haasn> it will be interesting to see how swscale compares against whatever I come up with for yuv420p -> rgb24 because swscale rolls the scaling and conversion into a single kernel

13:37 <haasn> whereas my code would most likely do something like, convert luma to float, convert chroma to float, upscale chroma, merge results back together and convert to int

13:39 <llyyr> thanks for all the work \o/

13:39 <haasn> but with everything being inlined into a tight loop it might not actually be that problematic to handle it this way, you just have a slight bit more function call overhead

13:40 <haasn> but fortunately decades of bloated C++ projects have made modern processors stupidly efficient at function calls into cached functions

13:41 <haasn> I will point out that we are basically doing on the order of 72 million function calls per second in this benchmark result

13:41 <haasn> albeit split across 32 cores

13:44 <haasn> or really more like 200M, since for each 16x1 block of pixels we do 1 read call, 1 fused fma conversion, and 1 write call

13:45 <haasn> but changing that to 32x1 makes it slower, halving the number of function calls is not worth the loss in cache locality

13:46 <haasn> or maybe the compiler is just generating worse code at larger kernel sizes, will see once we add proper hand-written SIMD

13:46 <haasn> fortunately this constant can be trivially changed, it is absolutely not a hard design parameter of anything

13:49 derpydoo has joined #ffmpeg-devel

13:52 <haasn> nice, got another 5%-10% speedup just from using aligned memory reads/writes where possible

13:53 <haasn> another thing we get for free from this code is

13:54 <haasn> 1) way better performance on unaligned input/output pointers, since all internal processing is done on properly aligned chunks

13:54 <haasn> 2) no more overread/overwrite

13:55 <haasn> 3) all kernels are easier to implement because they can assume a fixed size, aligned pointers, no need to worry about loops or leftover pixels

14:01 abdu13 has quit [Ping timeout: 240 seconds]

14:02 abdu13 has joined #ffmpeg-devel

14:03 <haasn> out of curiosity, it seems like using 64-bit instead of 32-bit float precision is a 38% perf downgrade; I guess we only ever need to consider this in extreme cases

14:04 <Lynne> it would be nice to have the option

14:04 <haasn> (e.g. to ensure bit exact processing of 32 bpc formats)

14:04 <haasn> Lynne: it's about 10 lines of code to all

14:04 <haasn> add*

14:04 <Lynne> sure

14:04 <haasn> the main cost is in executable size :)

14:07 abdu13 has quit [Ping timeout: 240 seconds]

14:07 <Lynne> no one's complained about lavu/tx having double precision transforms yet

14:09 <Lynne> I think it may be a bit heavier than swscale-ng

14:14 <haasn> it may be _very very_ slightly slower for cache locality to have space for 64 bit coefficients in our ring buffers though

14:15 <haasn> I will need to re-evaluate when scaling is implemented

14:32 derpydoo has quit [Quit: derpydoo]

14:37 <Lynne> ring buffers?

14:47 <haasn> Lynne: yeah, for scaling I basically run my input ops outputting directly into a ring buffer, after which the filter function will run to generate one chunk of output data at a time

14:48 <haasn> so if we need to allocate enough space in the ring buffer to hold 64 bit values, adjacent chunk entries are further apart

14:48 <haasn> likely a non-issue but I'll wait for numbers before deciding

14:48 DodoGTA has quit [Quit: DodoGTA]

14:49 <Lynne> so if you're doing 10bit packed yuv to 32bit rgb, you'd do chunks of 5*3 components, which then go into a ring buffer?

14:50 <Lynne> or chunks as in, you're doing yuv to rgb, and you need to know neighbouring chroma, so you'd do 4x4 blocks, all of which go into a ring buffer?

14:51 DodoGTA has joined #ffmpeg-devel

14:51 <haasn> I'm not really sure yet how to handle packed yuv

14:52 <haasn> I'm also not yet sure how to handle horizontally adjacent blocks, but I suppose in the worst case we would need a buffer: {row 0 chunk 0} {row 0 chunk 1} {row 0 chunk 2} {row 1 chunk 0} {row 1 chunk 1} {row 1 chunk 2} ... as many rows as we have filter taps

14:52 <haasn> since you need the left and right adjacent columns for horizontal filtering

14:54 <haasn> for packed yuv we need to read in one chunk and expand it out to 2x2 chunks

14:54 <haasn> or rather

14:54 <haasn> we need to take one yuyv input chunk and convert it into two y chunks and one uv chunk

14:54 derpydoo has joined #ffmpeg-devel

14:54 <Lynne> assuming each chunk in the ring buffer has a position and size, I guess you get slice threading for free this way?

14:54 <haasn> then do 2x horizontal scaling on the uv chunk

14:55 <Lynne> or rather, workqueue-based threading

14:55 <haasn> irrelevant, we already thread at a higher level

14:55 <haasn> sws_graph_run() internally already splits the image into rows and gives one width x (height / threads) slice to the underlying kernel dispatcher

14:59 <haasn> Although.. hmm… if we use a full size buffer instead of a ring buffer then we _could_ share vertically adjacent lines at slice edges

15:00 <haasn> Between threads

15:00 <haasn> Would need a mutex per like or something to signal row availability

15:00 <haasn> Per row*

15:01 <haasn> I will use my existing microbench framework to test if scaling in a ring buffer vs scaling in a full size buffer actually matters

15:06 <Lynne> it would be faster if you could either reuse the input or output (dirtying it) as the intermediate full buffer

15:07 <haasn> perhaps, but that's microoptimization stuff and sounds painful to guarentee the safety of

15:38 LaserEyess has quit [Quit: fugg]

15:40 LaserEyess has joined #ffmpeg-devel

15:40 LaserEyess has quit [Changing host]

15:40 LaserEyess has joined #ffmpeg-devel

15:41 derpydoo has quit [Ping timeout: 252 seconds]

16:02 elvis_a_presley has quit [Quit: smoke-bomb ; grapple-hook]

16:02 elvis_a_presley has joined #ffmpeg-devel

16:13 <kierank> BBB: what's the name of the function? I'm not seeing it in FFmpeg checkasm

16:15 <BBB> ../tests/checkasm/checkasm.c:int checkasm_check_##type(const char *file, int line, \

16:16 <BBB> this exists, e.g., for DEF_CHECKASM_CHECK_FUNC(uint8_t, "%02x")

16:16 <kierank> thanks

16:16 <BBB> so that would be checkasm_check_uint8_t(), I think?

16:17 <BBB> dav1d has a slightly different system where (for 8+10bit support) we have a "pixel" type, which is either uint8_t or uint16_t

16:17 <BBB> so we use checkasm_check_pixel()

16:17 <BBB> but you'll get it soon enough... if this is for 10bit content, use checkasm_check_uint16_t

16:29 <kierank> https://www.irccloud.com/pastebin/2tawUxJP/

16:29 <kierank> not 100% sure the ffmpeg port actually works

16:29 <kierank> oh it's unrelated

16:29 <kierank> macro hell

16:30 <kierank> tests seem to pass

16:36 <kierank> BBB: thanks

16:37 <fflogger> [editedticket] ke4roh: Ticket #8349 ([avcodec] Dolby AC-4 Support) updated https://trac.ffmpeg.org/ticket/8349#comment:93

16:37 <BBB> yw

16:52 <kierank> hmmm maybe my patch does not work

16:52 <kierank> as the test passes all the time :)

16:53 <haasn> yuv444p 1920x1080 -> yuv444p10le 1920x1080, flags=0 dither=0, MSE={ 0 1 1 0}

16:53 <haasn> time=1513 us, ref=2411 us, speedup=1.593x faster

16:53 <haasn> there we go

16:54 <haasn> solved the slow path

16:55 <another|> Nice

16:56 <haasn> fps=20431 versus zscale fps=16321

16:59 <Compn> now do 4k

17:00 <Compn> 1080p is a long time ago now

17:07 <kierank> https://github.com/FFmpeg/FFmpeg/blob/master/tests/checkasm/checkasm.c#L1181

17:07 <kierank> but that won't work when h = 1?

17:08 <kierank> ignore that

17:08 * kierank doesn't understand why test always passes

17:16 <kierank> memcmp doesn't even get called

17:40 Marth64 has joined #ffmpeg-devel

17:41 Traneptora has quit [Quit: Quit]

17:47 <fflogger> [editedticket] Marth64: Ticket #8349 ([avcodec] Dolby AC-4 Support) updated https://trac.ffmpeg.org/ticket/8349#comment:94

17:48 <Marth64> clearly LLM response IMO

17:53 Marth64 has quit [Remote host closed the connection]

17:54 Marth64 has joined #ffmpeg-devel

17:56 <haasn> compn: fps=409 at 4K

17:56 <haasn> vs fps=377 swscale

18:03 <haasn> Less gains at higher resolution; I guess I was mainly benchmarking the insanely lower per-call overhead of my code

18:27 MisterMinister has joined #ffmpeg-devel

19:19 <Compn> haasn, still 1080p speedup important :)

19:19 <haasn> that is just for yuv444p -> yuv444p10, which is already quite optimized in swscale

19:19 <haasn> in all other cases we still get massive speedups at 4k :)

19:22 System_Error has quit [Ping timeout: 264 seconds]

19:25 System_Error has joined #ffmpeg-devel

19:36 Marth64 has quit [Remote host closed the connection]

20:02 abdu13 has joined #ffmpeg-devel

20:41 tufei has quit [Remote host closed the connection]

20:57 iive has joined #ffmpeg-devel

21:02 tufei has joined #ffmpeg-devel

21:39 <welder> How do you format C code? Is there a clang-format or anything of the kind?

21:39 abdu13 has quit [Ping timeout: 240 seconds]

21:41 <JEEB> yea, clang-format can be utilized if you have a close enough configured config file

21:42 <JEEB> although generally editors tend to allow you to have things indented well enough

21:58 lemourin has quit [Quit: The Lounge - https://thelounge.chat]

21:58 lemourin has joined #ffmpeg-devel

22:47 <kierank> BBB: .xx..xx..x........x..x...xx..x..

22:47 <kierank> this is helpful!

23:40 HarshK23 has quit [Quit: Connection closed for inactivity]

23:45 <kierank> jdarnley: do you have a zen4 cpu?

23:50 Chagall has quit [Ping timeout: 265 seconds]

23:50 Chagalle has joined #ffmpeg-devel

23:55 <jdarnley> no

23:56 <kierank> Damn

23:56 <kierank> My Simd works but is roughly the same speed on ice lake

23:56 <kierank> But avx2 slower