michaelni changed the topic of #ffmpeg-devel to: Welcome to the FFmpeg development channel | Questions about using FFmpeg or developing with libav* libs should be asked in #ffmpeg | This channel is publicly logged | FFmpeg 7.1 has been released! | Please read ffmpeg.org/developer.html#Code-of-conduct
System_Error has joined #ffmpeg-devel
<cone-887>
ffmpeg Peter Ross master:16f9cfcf4bd8: avcodec/leaddec: support format 0x1006
derpydoo has joined #ffmpeg-devel
System_Error has quit [Remote host closed the connection]
System_Error has joined #ffmpeg-devel
iive has quit [Quit: They came for me...]
Marth64 has quit [Remote host closed the connection]
Kei_N_ has quit [Read error: Connection reset by peer]
Kei_N has joined #ffmpeg-devel
jarthur has quit [Quit: jarthur]
derpydoo has quit [Ping timeout: 245 seconds]
^Neo has joined #ffmpeg-devel
^Neo has quit [Changing host]
^Neo has joined #ffmpeg-devel
System_Error has quit [Remote host closed the connection]
System_Error has joined #ffmpeg-devel
<kierank>
can ffmpeg checkasm show the byte differences
<jamrial>
kierank: each tests prints whatever it wants. most tests don't print anything
<jamrial>
the integer ones use memcpy in general, and looping through the arrays manually to print mismatching bytes may be slow
<jamrial>
*memcmp
<fflogger>
[newticket] 0x20z: Ticket #11460 ([avformat] SEGV FFmpeg-master/libavformat/mov.c:5209:39 in mov_read_trak) created https://trac.ffmpeg.org/ticket/11460
thilo has quit [Ping timeout: 268 seconds]
thilo has joined #ffmpeg-devel
thilo has quit [Changing host]
thilo has joined #ffmpeg-devel
<cone-887>
ffmpeg James Almer master:43be8d07281c: avformat/mov: check for tts_count before deferencing tts_data
<aaabbb>
is there any reason that framehash etc doesn't use sha1 cpu extensions for sha1 ("SHA160")? or sha2 extensions for the sha2 hashes? or just that no one submitted a patch yet?
<aaabbb>
if just no one submitting a patch then i could do it
<jamrial>
aaabbb: no one did, yes
<aaabbb>
jamrial: do you know off the top of your head what crc32 variant that -hash crc32 uses? is it crc32c or is it the one used by png/zlib?
<aaabbb>
if crc32c then it can use the crc32 instructions too
<jamrial>
no, it's not that one
<aaabbb>
alright only sha1 and sha2 are reasonable to accelerate then
tufei has joined #ffmpeg-devel
<Compn>
crc32 should be enough for anyone /s
^Neo has quit [Ping timeout: 260 seconds]
MyNetAz has quit [Remote host closed the connection]
MyNetAz has joined #ffmpeg-devel
jamrial has quit []
Guest75 has joined #ffmpeg-devel
Martchus has joined #ffmpeg-devel
Martchus_ has quit [Ping timeout: 252 seconds]
cone-887 has quit [Quit: transmission timeout]
MyNetAz has quit [Remote host closed the connection]
MyNetAz has joined #ffmpeg-devel
rvalue has quit [Read error: Connection reset by peer]
System_Error has quit [Remote host closed the connection]
System_Error has joined #ffmpeg-devel
MisterMinister has quit [Ping timeout: 252 seconds]
derpydoo has joined #ffmpeg-devel
System_Error has quit [Remote host closed the connection]
darkapex has quit [Ping timeout: 245 seconds]
darkapex has joined #ffmpeg-devel
System_Error has joined #ffmpeg-devel
Guest31 has joined #ffmpeg-devel
Guest31 has quit [Quit: Client closed]
^Neo has joined #ffmpeg-devel
^Neo has quit [Changing host]
^Neo has joined #ffmpeg-devel
abdu has joined #ffmpeg-devel
abdu28 has joined #ffmpeg-devel
abdu has quit [Ping timeout: 240 seconds]
derpydoo has quit [Quit: derpydoo]
abdu56 has joined #ffmpeg-devel
DauntlessOne4 has quit [Quit: Ping timeout (120 seconds)]
DauntlessOne4 has joined #ffmpeg-devel
abdu28 has quit [Ping timeout: 240 seconds]
abdu50 has joined #ffmpeg-devel
abdu56 has quit [Ping timeout: 240 seconds]
abdu13 has joined #ffmpeg-devel
abdu50 has quit [Ping timeout: 240 seconds]
abdu13 has quit [Ping timeout: 240 seconds]
abdu13 has joined #ffmpeg-devel
jamrial has joined #ffmpeg-devel
<BBB>
kierank: it can, yes
<BBB>
kierank: modify the test to use the proper check macro/function instead of memcmp
<BBB>
kierank: and -v will do what you're looking for
<BBB>
dav1d uses this everywhere and it's wonderful :)
<kierank>
I will try, thanks
<haasn>
refactored my swscale-ng design further and now the generic (float) conversion path is actually 25% faster than swscale's dedicated LUT for 8 bit yuv -> yuvj conversion
<haasn>
even before any handwritten asm or applying our own LUT optimizations
<haasn>
it's a good day when your generic C code outperforms swscale SIMD
<haasn>
I'm learning more and more that the smaller and more atomic I make my operations, the better
<haasn>
instead of having a single clunky "decode(depth, range)" and "encode(depth, range)" operation I just have a single SWS_OP_CONVERT for int <-> float and a SWS_OP_FMA
<llyyr>
how does it compare to zimg? or too premature to do that comparison
<haasn>
FMA is great because it can embed everything; range conversions, bit depth conversion, shifting, even clearing out or defaulting unused components (e.g. for gray -> yuva)
<haasn>
and FMA composes very well
<haasn>
it composes with itself, it composes with matrix multiplications
<haasn>
it even composes with LUTs, so if you have any long sequence of such operations they can be usually collapsed down into a single matrix or a single LUT
<haasn>
and then all we really need to write is one dedicated dispatch path for fused convert(f32) -> apply_lut/mat/fma -> convert(uint*)
<haasn>
the conversion functions are even cheaper in this approach because all they need to handle is clipping, no divisions by 255.0 or whatever (that is embedded into the fma/matrix already)
<haasn>
and using floats here really is a natural fit because it can losslessy cover multiple orders of magnitude
<haasn>
keeping operations extremely atomic also helps the compiler produce much faster code
<haasn>
inlining everything into one mega-operation often makes it slower, not faster
<haasn>
or maybe I should call it nuscale for now :^)
<llyyr>
lol
<haasn>
for yuv444p -> yuv444p10le, my code is 575x vs zscale 781x (vs swscale 416x) but this is not terribly surprising as I don't have a fast path for this yet
<haasn>
so that's 575x going through the _float path_
<llyyr>
massive glow up for swscale
<haasn>
and for yuv444p10be my code and zscale are basically equal already
<haasn>
once I add a direct shifting fast path for yuv range conversions it will most likely explode
<haasn>
rgba -> bgra my code is 888x vs zscale 295x
<haasn>
vs swscale 521x
<haasn>
oh, that's not a fair comparison on zscale's part because it had to include another swscale fmt conversion
<haasn>
so it's going via bgrap
<haasn>
but I suppose that's still a win vs vf_zscale :)
<llyyr>
could drop the horrible zimg redirection code in mpv once all this is in a release then
<llyyr>
8.0 (or 7.2?) when
<haasn>
2x faster for rgba -> rgba64le (and be)
<haasn>
I still need to implement rgb <-> yuv but with my current design it should probably be a lot faster also
<haasn>
the main thing my code does not handle at all atm is scaling
<haasn>
and that's still a bit of an open question in terms of the best design, but I have some ideas-ish
<haasn>
so also no subsampled input
<haasn>
it will be interesting to see how swscale compares against whatever I come up with for yuv420p -> rgb24 because swscale rolls the scaling and conversion into a single kernel
<haasn>
whereas my code would most likely do something like, convert luma to float, convert chroma to float, upscale chroma, merge results back together and convert to int
<llyyr>
thanks for all the work \o/
<haasn>
but with everything being inlined into a tight loop it might not actually be that problematic to handle it this way, you just have a slight bit more function call overhead
<haasn>
but fortunately decades of bloated C++ projects have made modern processors stupidly efficient at function calls into cached functions
<haasn>
I will point out that we are basically doing on the order of 72 million function calls per second in this benchmark result
<haasn>
albeit split across 32 cores
<haasn>
or really more like 200M, since for each 16x1 block of pixels we do 1 read call, 1 fused fma conversion, and 1 write call
<haasn>
but changing that to 32x1 makes it slower, halving the number of function calls is not worth the loss in cache locality
<haasn>
or maybe the compiler is just generating worse code at larger kernel sizes, will see once we add proper hand-written SIMD
<haasn>
fortunately this constant can be trivially changed, it is absolutely not a hard design parameter of anything
derpydoo has joined #ffmpeg-devel
<haasn>
nice, got another 5%-10% speedup just from using aligned memory reads/writes where possible
<haasn>
another thing we get for free from this code is
<haasn>
1) way better performance on unaligned input/output pointers, since all internal processing is done on properly aligned chunks
<haasn>
2) no more overread/overwrite
<haasn>
3) all kernels are easier to implement because they can assume a fixed size, aligned pointers, no need to worry about loops or leftover pixels
abdu13 has quit [Ping timeout: 240 seconds]
abdu13 has joined #ffmpeg-devel
<haasn>
out of curiosity, it seems like using 64-bit instead of 32-bit float precision is a 38% perf downgrade; I guess we only ever need to consider this in extreme cases
<Lynne>
it would be nice to have the option
<haasn>
(e.g. to ensure bit exact processing of 32 bpc formats)
<haasn>
Lynne: it's about 10 lines of code to all
<haasn>
add*
<Lynne>
sure
<haasn>
the main cost is in executable size :)
abdu13 has quit [Ping timeout: 240 seconds]
<Lynne>
no one's complained about lavu/tx having double precision transforms yet
<Lynne>
I think it may be a bit heavier than swscale-ng
<haasn>
it may be _very very_ slightly slower for cache locality to have space for 64 bit coefficients in our ring buffers though
<haasn>
I will need to re-evaluate when scaling is implemented
derpydoo has quit [Quit: derpydoo]
<Lynne>
ring buffers?
<haasn>
Lynne: yeah, for scaling I basically run my input ops outputting directly into a ring buffer, after which the filter function will run to generate one chunk of output data at a time
<haasn>
so if we need to allocate enough space in the ring buffer to hold 64 bit values, adjacent chunk entries are further apart
<haasn>
likely a non-issue but I'll wait for numbers before deciding
DodoGTA has quit [Quit: DodoGTA]
<Lynne>
so if you're doing 10bit packed yuv to 32bit rgb, you'd do chunks of 5*3 components, which then go into a ring buffer?
<Lynne>
or chunks as in, you're doing yuv to rgb, and you need to know neighbouring chroma, so you'd do 4x4 blocks, all of which go into a ring buffer?
DodoGTA has joined #ffmpeg-devel
<haasn>
I'm not really sure yet how to handle packed yuv
<haasn>
I'm also not yet sure how to handle horizontally adjacent blocks, but I suppose in the worst case we would need a buffer: {row 0 chunk 0} {row 0 chunk 1} {row 0 chunk 2} {row 1 chunk 0} {row 1 chunk 1} {row 1 chunk 2} ... as many rows as we have filter taps
<haasn>
since you need the left and right adjacent columns for horizontal filtering
<haasn>
for packed yuv we need to read in one chunk and expand it out to 2x2 chunks
<haasn>
or rather
<haasn>
we need to take one yuyv input chunk and convert it into two y chunks and one uv chunk
derpydoo has joined #ffmpeg-devel
<Lynne>
assuming each chunk in the ring buffer has a position and size, I guess you get slice threading for free this way?
<haasn>
then do 2x horizontal scaling on the uv chunk
<Lynne>
or rather, workqueue-based threading
<haasn>
irrelevant, we already thread at a higher level
<haasn>
sws_graph_run() internally already splits the image into rows and gives one width x (height / threads) slice to the underlying kernel dispatcher
<haasn>
Although.. hmm… if we use a full size buffer instead of a ring buffer then we _could_ share vertically adjacent lines at slice edges
<haasn>
Between threads
<haasn>
Would need a mutex per like or something to signal row availability
<haasn>
Per row*
<haasn>
I will use my existing microbench framework to test if scaling in a ring buffer vs scaling in a full size buffer actually matters
<Lynne>
it would be faster if you could either reuse the input or output (dirtying it) as the intermediate full buffer
<haasn>
perhaps, but that's microoptimization stuff and sounds painful to guarentee the safety of
LaserEyess has quit [Quit: fugg]
LaserEyess has joined #ffmpeg-devel
LaserEyess has quit [Changing host]
LaserEyess has joined #ffmpeg-devel
derpydoo has quit [Ping timeout: 252 seconds]
elvis_a_presley has quit [Quit: smoke-bomb ; grapple-hook]
elvis_a_presley has joined #ffmpeg-devel
<kierank>
BBB: what's the name of the function? I'm not seeing it in FFmpeg checkasm
<BBB>
../tests/checkasm/checkasm.c:int checkasm_check_##type(const char *file, int line, \
<BBB>
this exists, e.g., for DEF_CHECKASM_CHECK_FUNC(uint8_t, "%02x")
<kierank>
thanks
<BBB>
so that would be checkasm_check_uint8_t(), I think?
<BBB>
dav1d has a slightly different system where (for 8+10bit support) we have a "pixel" type, which is either uint8_t or uint16_t
<BBB>
so we use checkasm_check_pixel()
<BBB>
but you'll get it soon enough... if this is for 10bit content, use checkasm_check_uint16_t