michaelni changed the topic of #ffmpeg-devel to: Welcome to the FFmpeg development channel | Questions about using FFmpeg or developing with libav* libs should be asked in #ffmpeg | This channel is publicly logged | FFmpeg 7.1 has been released! | Please read ffmpeg.org/developer.html#Code-of-conduct
System_Error has joined #ffmpeg-devel
<cone-887> ffmpeg Peter Ross master:16f9cfcf4bd8: avcodec/leaddec: support format 0x1006
derpydoo has joined #ffmpeg-devel
System_Error has quit [Remote host closed the connection]
System_Error has joined #ffmpeg-devel
iive has quit [Quit: They came for me...]
Marth64 has quit [Remote host closed the connection]
Kei_N_ has quit [Read error: Connection reset by peer]
Kei_N has joined #ffmpeg-devel
jarthur has quit [Quit: jarthur]
derpydoo has quit [Ping timeout: 245 seconds]
^Neo has joined #ffmpeg-devel
^Neo has quit [Changing host]
^Neo has joined #ffmpeg-devel
System_Error has quit [Remote host closed the connection]
System_Error has joined #ffmpeg-devel
<kierank> can ffmpeg checkasm show the byte differences
<jamrial> kierank: each tests prints whatever it wants. most tests don't print anything
<jamrial> the integer ones use memcpy in general, and looping through the arrays manually to print mismatching bytes may be slow
<jamrial> *memcmp
<fflogger> [newticket] 0x20z: Ticket #11460 ([avformat] SEGV FFmpeg-master/libavformat/mov.c:5209:39 in mov_read_trak) created https://trac.ffmpeg.org/ticket/11460
thilo has quit [Ping timeout: 268 seconds]
thilo has joined #ffmpeg-devel
thilo has quit [Changing host]
thilo has joined #ffmpeg-devel
<cone-887> ffmpeg James Almer master:43be8d07281c: avformat/mov: check for tts_count before deferencing tts_data
<fflogger> [editedticket] jamrial: Ticket #11460 ([avformat] SEGV FFmpeg-master/libavformat/mov.c:5209:39 in mov_read_trak) updated https://trac.ffmpeg.org/ticket/11460#comment:1
Martchus_ has joined #ffmpeg-devel
Martchus has quit [Ping timeout: 260 seconds]
<aaabbb> is there any reason that framehash etc doesn't use sha1 cpu extensions for sha1 ("SHA160")? or sha2 extensions for the sha2 hashes? or just that no one submitted a patch yet?
<aaabbb> if just no one submitting a patch then i could do it
<jamrial> aaabbb: no one did, yes
<aaabbb> jamrial: do you know off the top of your head what crc32 variant that -hash crc32 uses? is it crc32c or is it the one used by png/zlib?
<aaabbb> if crc32c then it can use the crc32 instructions too
<jamrial> no, it's not that one
<aaabbb> alright only sha1 and sha2 are reasonable to accelerate then
tufei has joined #ffmpeg-devel
<Compn> crc32 should be enough for anyone /s
^Neo has quit [Ping timeout: 260 seconds]
MyNetAz has quit [Remote host closed the connection]
MyNetAz has joined #ffmpeg-devel
jamrial has quit []
Guest75 has joined #ffmpeg-devel
Martchus has joined #ffmpeg-devel
Martchus_ has quit [Ping timeout: 252 seconds]
cone-887 has quit [Quit: transmission timeout]
MyNetAz has quit [Remote host closed the connection]
MyNetAz has joined #ffmpeg-devel
rvalue has quit [Read error: Connection reset by peer]
rvalue has joined #ffmpeg-devel
Guest75 has quit [Quit: Client closed]
derpydoo has joined #ffmpeg-devel
derpydoo has quit [Client Quit]
<fflogger> [editedticket] Lynne: Ticket #11418 ([undetermined] stack-buffer-overflow on libavcodec/aacenc_tns.c) updated https://trac.ffmpeg.org/ticket/11418#comment:3
System_Error has quit [Remote host closed the connection]
System_Error has joined #ffmpeg-devel
MisterMinister has quit [Ping timeout: 252 seconds]
derpydoo has joined #ffmpeg-devel
System_Error has quit [Remote host closed the connection]
darkapex has quit [Ping timeout: 245 seconds]
darkapex has joined #ffmpeg-devel
System_Error has joined #ffmpeg-devel
Guest31 has joined #ffmpeg-devel
Guest31 has quit [Quit: Client closed]
^Neo has joined #ffmpeg-devel
^Neo has quit [Changing host]
^Neo has joined #ffmpeg-devel
abdu has joined #ffmpeg-devel
abdu28 has joined #ffmpeg-devel
abdu has quit [Ping timeout: 240 seconds]
derpydoo has quit [Quit: derpydoo]
abdu56 has joined #ffmpeg-devel
DauntlessOne4 has quit [Quit: Ping timeout (120 seconds)]
DauntlessOne4 has joined #ffmpeg-devel
abdu28 has quit [Ping timeout: 240 seconds]
abdu50 has joined #ffmpeg-devel
abdu56 has quit [Ping timeout: 240 seconds]
abdu13 has joined #ffmpeg-devel
abdu50 has quit [Ping timeout: 240 seconds]
abdu13 has quit [Ping timeout: 240 seconds]
abdu13 has joined #ffmpeg-devel
jamrial has joined #ffmpeg-devel
<BBB> kierank: it can, yes
<BBB> kierank: modify the test to use the proper check macro/function instead of memcmp
<BBB> kierank: and -v will do what you're looking for
<BBB> dav1d uses this everywhere and it's wonderful :)
<kierank> I will try, thanks
<haasn> refactored my swscale-ng design further and now the generic (float) conversion path is actually 25% faster than swscale's dedicated LUT for 8 bit yuv -> yuvj conversion
<haasn> even before any handwritten asm or applying our own LUT optimizations
<haasn> it's a good day when your generic C code outperforms swscale SIMD
<haasn> I'm learning more and more that the smaller and more atomic I make my operations, the better
<haasn> instead of having a single clunky "decode(depth, range)" and "encode(depth, range)" operation I just have a single SWS_OP_CONVERT for int <-> float and a SWS_OP_FMA
<llyyr> how does it compare to zimg? or too premature to do that comparison
<haasn> FMA is great because it can embed everything; range conversions, bit depth conversion, shifting, even clearing out or defaulting unused components (e.g. for gray -> yuva)
<haasn> and FMA composes very well
<haasn> it composes with itself, it composes with matrix multiplications
<haasn> it even composes with LUTs, so if you have any long sequence of such operations they can be usually collapsed down into a single matrix or a single LUT
<haasn> and then all we really need to write is one dedicated dispatch path for fused convert(f32) -> apply_lut/mat/fma -> convert(uint*)
<haasn> the conversion functions are even cheaper in this approach because all they need to handle is clipping, no divisions by 255.0 or whatever (that is embedded into the fma/matrix already)
<haasn> and using floats here really is a natural fit because it can losslessy cover multiple orders of magnitude
<haasn> keeping operations extremely atomic also helps the compiler produce much faster code
<haasn> inlining everything into one mega-operation often makes it slower, not faster
<haasn> llyyr: on yuvtestsrc mpeg -> jpeg
<haasn> frame=100000 fps=7257 q=-0.0 Lsize=N/A time=01:06:39.96 bitrate=N/A speed= 290x swscale
<haasn> frame=100000 fps=13032 q=-0.0 Lsize=N/A time=01:06:40.00 bitrate=N/A speed= 521x zscale
<haasn> frame=100000 fps=14608 q=-0.0 Lsize=N/A time=01:06:40.00 bitrate=N/A speed= 584x swscale-ng
<haasn> \o/
<llyyr> nice
<haasn> or maybe I should call it nuscale for now :^)
<llyyr> lol
<haasn> for yuv444p -> yuv444p10le, my code is 575x vs zscale 781x (vs swscale 416x) but this is not terribly surprising as I don't have a fast path for this yet
<haasn> so that's 575x going through the _float path_
<llyyr> massive glow up for swscale
<haasn> and for yuv444p10be my code and zscale are basically equal already
<haasn> once I add a direct shifting fast path for yuv range conversions it will most likely explode
<haasn> rgba -> bgra my code is 888x vs zscale 295x
<haasn> vs swscale 521x
<haasn> oh, that's not a fair comparison on zscale's part because it had to include another swscale fmt conversion
<haasn> so it's going via bgrap
<haasn> but I suppose that's still a win vs vf_zscale :)
<llyyr> could drop the horrible zimg redirection code in mpv once all this is in a release then
<llyyr> 8.0 (or 7.2?) when
<haasn> 2x faster for rgba -> rgba64le (and be)
<haasn> I still need to implement rgb <-> yuv but with my current design it should probably be a lot faster also
<haasn> the main thing my code does not handle at all atm is scaling
<haasn> and that's still a bit of an open question in terms of the best design, but I have some ideas-ish
<haasn> so also no subsampled input
<haasn> it will be interesting to see how swscale compares against whatever I come up with for yuv420p -> rgb24 because swscale rolls the scaling and conversion into a single kernel
<haasn> whereas my code would most likely do something like, convert luma to float, convert chroma to float, upscale chroma, merge results back together and convert to int
<llyyr> thanks for all the work \o/
<haasn> but with everything being inlined into a tight loop it might not actually be that problematic to handle it this way, you just have a slight bit more function call overhead
<haasn> but fortunately decades of bloated C++ projects have made modern processors stupidly efficient at function calls into cached functions
<haasn> I will point out that we are basically doing on the order of 72 million function calls per second in this benchmark result
<haasn> albeit split across 32 cores
<haasn> or really more like 200M, since for each 16x1 block of pixels we do 1 read call, 1 fused fma conversion, and 1 write call
<haasn> but changing that to 32x1 makes it slower, halving the number of function calls is not worth the loss in cache locality
<haasn> or maybe the compiler is just generating worse code at larger kernel sizes, will see once we add proper hand-written SIMD
<haasn> fortunately this constant can be trivially changed, it is absolutely not a hard design parameter of anything
derpydoo has joined #ffmpeg-devel
<haasn> nice, got another 5%-10% speedup just from using aligned memory reads/writes where possible
<haasn> another thing we get for free from this code is
<haasn> 1) way better performance on unaligned input/output pointers, since all internal processing is done on properly aligned chunks
<haasn> 2) no more overread/overwrite
<haasn> 3) all kernels are easier to implement because they can assume a fixed size, aligned pointers, no need to worry about loops or leftover pixels
abdu13 has quit [Ping timeout: 240 seconds]
abdu13 has joined #ffmpeg-devel
<haasn> out of curiosity, it seems like using 64-bit instead of 32-bit float precision is a 38% perf downgrade; I guess we only ever need to consider this in extreme cases
<Lynne> it would be nice to have the option
<haasn> (e.g. to ensure bit exact processing of 32 bpc formats)
<haasn> Lynne: it's about 10 lines of code to all
<haasn> add*
<Lynne> sure
<haasn> the main cost is in executable size :)
abdu13 has quit [Ping timeout: 240 seconds]
<Lynne> no one's complained about lavu/tx having double precision transforms yet
<Lynne> I think it may be a bit heavier than swscale-ng
<haasn> it may be _very very_ slightly slower for cache locality to have space for 64 bit coefficients in our ring buffers though
<haasn> I will need to re-evaluate when scaling is implemented
derpydoo has quit [Quit: derpydoo]
<Lynne> ring buffers?
<haasn> Lynne: yeah, for scaling I basically run my input ops outputting directly into a ring buffer, after which the filter function will run to generate one chunk of output data at a time
<haasn> so if we need to allocate enough space in the ring buffer to hold 64 bit values, adjacent chunk entries are further apart
<haasn> likely a non-issue but I'll wait for numbers before deciding
DodoGTA has quit [Quit: DodoGTA]
<Lynne> so if you're doing 10bit packed yuv to 32bit rgb, you'd do chunks of 5*3 components, which then go into a ring buffer?
<Lynne> or chunks as in, you're doing yuv to rgb, and you need to know neighbouring chroma, so you'd do 4x4 blocks, all of which go into a ring buffer?
DodoGTA has joined #ffmpeg-devel
<haasn> I'm not really sure yet how to handle packed yuv
<haasn> I'm also not yet sure how to handle horizontally adjacent blocks, but I suppose in the worst case we would need a buffer: {row 0 chunk 0} {row 0 chunk 1} {row 0 chunk 2} {row 1 chunk 0} {row 1 chunk 1} {row 1 chunk 2} ... as many rows as we have filter taps
<haasn> since you need the left and right adjacent columns for horizontal filtering
<haasn> for packed yuv we need to read in one chunk and expand it out to 2x2 chunks
<haasn> or rather
<haasn> we need to take one yuyv input chunk and convert it into two y chunks and one uv chunk
derpydoo has joined #ffmpeg-devel
<Lynne> assuming each chunk in the ring buffer has a position and size, I guess you get slice threading for free this way?
<haasn> then do 2x horizontal scaling on the uv chunk
<Lynne> or rather, workqueue-based threading
<haasn> irrelevant, we already thread at a higher level
<haasn> sws_graph_run() internally already splits the image into rows and gives one width x (height / threads) slice to the underlying kernel dispatcher
<haasn> Although.. hmm… if we use a full size buffer instead of a ring buffer then we _could_ share vertically adjacent lines at slice edges
<haasn> Between threads
<haasn> Would need a mutex per like or something to signal row availability
<haasn> Per row*
<haasn> I will use my existing microbench framework to test if scaling in a ring buffer vs scaling in a full size buffer actually matters
<Lynne> it would be faster if you could either reuse the input or output (dirtying it) as the intermediate full buffer
<haasn> perhaps, but that's microoptimization stuff and sounds painful to guarentee the safety of
LaserEyess has quit [Quit: fugg]
LaserEyess has joined #ffmpeg-devel
LaserEyess has quit [Changing host]
LaserEyess has joined #ffmpeg-devel
derpydoo has quit [Ping timeout: 252 seconds]
elvis_a_presley has quit [Quit: smoke-bomb ; grapple-hook]
elvis_a_presley has joined #ffmpeg-devel
<kierank> BBB: what's the name of the function? I'm not seeing it in FFmpeg checkasm
<BBB> ../tests/checkasm/checkasm.c:int checkasm_check_##type(const char *file, int line, \
<BBB> this exists, e.g., for DEF_CHECKASM_CHECK_FUNC(uint8_t, "%02x")
<kierank> thanks
<BBB> so that would be checkasm_check_uint8_t(), I think?
<BBB> dav1d has a slightly different system where (for 8+10bit support) we have a "pixel" type, which is either uint8_t or uint16_t
<BBB> so we use checkasm_check_pixel()
<BBB> but you'll get it soon enough... if this is for 10bit content, use checkasm_check_uint16_t
<kierank> not 100% sure the ffmpeg port actually works
<kierank> oh it's unrelated
<kierank> macro hell
<kierank> tests seem to pass
<kierank> BBB: thanks
<fflogger> [editedticket] ke4roh: Ticket #8349 ([avcodec] Dolby AC-4 Support) updated https://trac.ffmpeg.org/ticket/8349#comment:93
<BBB> yw
<kierank> hmmm maybe my patch does not work
<kierank> as the test passes all the time :)
<haasn> yuv444p 1920x1080 -> yuv444p10le 1920x1080, flags=0 dither=0, MSE={ 0 1 1 0}
<haasn> time=1513 us, ref=2411 us, speedup=1.593x faster
<haasn> there we go
<haasn> solved the slow path
<another|> Nice
<haasn> fps=20431 versus zscale fps=16321
<Compn> now do 4k
<Compn> 1080p is a long time ago now
<kierank> but that won't work when h = 1?
<kierank> ignore that
* kierank doesn't understand why test always passes
<kierank> memcmp doesn't even get called
Marth64 has joined #ffmpeg-devel
Traneptora has quit [Quit: Quit]
<fflogger> [editedticket] Marth64: Ticket #8349 ([avcodec] Dolby AC-4 Support) updated https://trac.ffmpeg.org/ticket/8349#comment:94
<Marth64> clearly LLM response IMO
Marth64 has quit [Remote host closed the connection]
Marth64 has joined #ffmpeg-devel
<haasn> compn: fps=409 at 4K
<haasn> vs fps=377 swscale
<haasn> Less gains at higher resolution; I guess I was mainly benchmarking the insanely lower per-call overhead of my code
MisterMinister has joined #ffmpeg-devel
<Compn> haasn, still 1080p speedup important :)
<haasn> that is just for yuv444p -> yuv444p10, which is already quite optimized in swscale
<haasn> in all other cases we still get massive speedups at 4k :)
System_Error has quit [Ping timeout: 264 seconds]
System_Error has joined #ffmpeg-devel
Marth64 has quit [Remote host closed the connection]
abdu13 has joined #ffmpeg-devel
tufei has quit [Remote host closed the connection]
iive has joined #ffmpeg-devel
tufei has joined #ffmpeg-devel
<welder> How do you format C code? Is there a clang-format or anything of the kind?
abdu13 has quit [Ping timeout: 240 seconds]
<JEEB> yea, clang-format can be utilized if you have a close enough configured config file
<JEEB> although generally editors tend to allow you to have things indented well enough
lemourin has quit [Quit: The Lounge - https://thelounge.chat]
lemourin has joined #ffmpeg-devel
<kierank> BBB: .xx..xx..x........x..x...xx..x..
<kierank> this is helpful!
HarshK23 has quit [Quit: Connection closed for inactivity]
<kierank> jdarnley: do you have a zen4 cpu?
Chagall has quit [Ping timeout: 265 seconds]
Chagalle has joined #ffmpeg-devel
<jdarnley> no
<kierank> Damn
<kierank> My Simd works but is roughly the same speed on ice lake
<kierank> But avx2 slower