michaelni changed the topic of #ffmpeg-devel to: Welcome to the FFmpeg development channel | Questions about using FFmpeg or developing with libav* libs should be asked in #ffmpeg | This channel is publicly logged | FFmpeg 7.1.1 has been released! | Please read ffmpeg.org/developer.html#Code-of-conduct
LainExperiments has quit [Ping timeout: 240 seconds]
Traneptora has quit [Quit: Quit]
LainExperiments has joined #ffmpeg-devel
IndecisiveTurtle has joined #ffmpeg-devel
thilo has quit [Ping timeout: 265 seconds]
thilo has joined #ffmpeg-devel
Marth64[m] has joined #ffmpeg-devel
Marth64 has quit [Ping timeout: 246 seconds]
LainExperiments has quit [Quit: Client closed]
IndecisiveTurtle has quit [Ping timeout: 272 seconds]
realies has quit [Quit: ~]
realies has joined #ffmpeg-devel
abdu has joined #ffmpeg-devel
<cone-197>
ffmpeg Michael Niedermayer master:0e917389fe73: avcodec/exr: do not output 32bit floats when a file stores 16bit floats
LainExperiments has joined #ffmpeg-devel
Aadil has quit [Ping timeout: 240 seconds]
abdu has quit [Ping timeout: 240 seconds]
LainExperiments has quit [Ping timeout: 240 seconds]
Tanay has quit [Remote host closed the connection]
Tanay has joined #ffmpeg-devel
ahmedhamed has quit [Quit: Connection closed for inactivity]
^Neo has quit [Ping timeout: 244 seconds]
Aadil has joined #ffmpeg-devel
Martchus_ has joined #ffmpeg-devel
Martchus has quit [Ping timeout: 260 seconds]
mkver has quit [Quit: Leaving]
HarshK23 has joined #ffmpeg-devel
twelve has joined #ffmpeg-devel
twelve has quit [Remote host closed the connection]
twelve has joined #ffmpeg-devel
cone-197 has quit [Quit: transmission timeout]
Aadil has quit [Ping timeout: 240 seconds]
System_Error has quit [Remote host closed the connection]
System_Error has joined #ffmpeg-devel
Coinflipper has quit [Quit: ]
Coinflipper has joined #ffmpeg-devel
_av500_ has quit [Remote host closed the connection]
derpydoo has joined #ffmpeg-devel
av500 has joined #ffmpeg-devel
<haasn>
are we allowed to use GCC vector extensions?
<haasn>
afaict they are not portable to MSVC etc
<compnn>
the gcc police will get you /s
<nevcairiel>
code has to compile on msvc, and if its substantially slower due to disabled code, that would be yet another argument for using x86inc asm :P
<JEEB>
> translate vector optimization results via godbolt or local dumping into x86inc
<haasn>
well
<haasn>
the reason I ask is because there are optimizations we could do with vectors that would be plain impossible even with hand written SIMD (unless we want to abandon standard calling convention)
<haasn>
in particular, if functions accept and return vectors, gcc will directly return results in %ymm0 etc
<haasn>
this saves a memory roundtrip
<haasn>
let me estimate the speedup
<nevcairiel>
sounds like you are already abandoning standard calling conventions
twelve has quit [Ping timeout: 265 seconds]
<haasn>
well okay, but the point is that this would retain compatibility with the C code
<nevcairiel>
msvc actually specifies the __vectorcall calling convention which supports up to 4 vector registers as return value
<nevcairiel>
not sure how compatible that is with gccs extension, but you are still stuck with many other compilers i'm sure
<haasn>
Clearly we need a C language extensions for vectors
th3synth4x has joined #ffmpeg-devel
twelve has joined #ffmpeg-devel
twelve has quit [Remote host closed the connection]
<haasn>
Admittedly the vectorization scheme would break down in any case because you can’t then also return _multiple_ vectors from a single function
<haasn>
So I can’t even benchmark the hypothetical speedup
<haasn>
I think what we will want to do to push maximum performance in practice is devise our own internal calling convention for the pure asm impl
<haasn>
Where we just keep four vectors reserved for I/O
<wbs>
custom calling conventions are fine between two asm functions where we control everything, but I wouldn't attempt doing that between asm and C
<haasn>
Right
<haasn>
That would require calling into only asm functions, so it’s something we could only attempt once we have 100% coverage anyway
<JEEB>
dav1d basically had something like that, I think? (custom calling convention within asm)
<wbs>
yes, within the itx functions
<haasn>
For now the round trip through L1 is not the end of the world
<haasn>
The overhead of going from say 4 to 5 function calls is almost nothing compared to the overhead of the functions themselves
<JEEB>
:)
<haasn>
The fastest speedup comes from defining subsets of functions that operate only on some components
<haasn>
Maybe it would have made sense to define per-component and per-pixel operations separately
<haasn>
But without vector extensions that wouldn’t be worth it anyway
<haasn>
wbs: btw, maybe you have an idea about how to do an efficient swizzle in asm? Imagine you have a void func(pixel *a, pixel *b, pixel * c, pixel *d, swizzle_t mask); which should permute the contents of the pointers according to the swizzle mask
<haasn>
For example swapping *a and *b
<wbs>
haasn: swapping the pointers themselves, or the contents they point to?
<haasn>
The contents
<wbs>
hmm, not sure really, that sounds quite non-idiomatic
<haasn>
Assume fixed size, eg 16 elements or w/e
<haasn>
Well in reality it is a SoA representation of a block
<haasn>
So you have struct chunk { pixel x[], y[], z[], w[] };
<haasn>
And I need some routine for swizzle(chunk *x, mask)
<haasn>
Obviously when the final input / output is planar we could just swap the pointers directly
<haasn>
But it’s not always possible to eliminate these swizzle operations
<haasn>
For rxample, on decoding rgb30 vs bgr30; we have one unpack101010 operation which unpacks the 32 bit pixel into four 16 bit values in a fixed order in the chunk struct
<haasn>
And then I may need to rearrange this to repack it differently (eg as rgb565)
<haasn>
Sure we could jump through hoops to try and imbue the unpack itself with an extra swizzle mask
<haasn>
But that just makes the problem difficult elsewhere
twelve has joined #ffmpeg-devel
<haasn>
what I did in my implementation now is essentially having one routine for each possible swizzle, or at least each type possible in practice (currently there are 13)
<haasn>
and then choosing the right one at init time
<haasn>
it's not too bad because they are quite small, each one is only a few opcodes
<haasn>
but I wanted to expand this to allow repetitions in the swizzle mask as well which would raise it to 256 possibilities in theory
th3synth4x has quit [Ping timeout: 240 seconds]
derpydoo has quit [Ping timeout: 268 seconds]
th3synth4x has joined #ffmpeg-devel
<kurosu>
(late because stuck in buffer) For #11363, I think ubitux' colleague might be of some help, as he contributed and was interested some years ago in this topic. mateo` maybe ?
ccawley2011__ has quit [Ping timeout: 260 seconds]
<haasn>
ah, it's fewer op codes actually, because of widening shifts
abdu62 has quit [Quit: Client closed]
abdu62 has joined #ffmpeg-devel
<Lynne>
binary matters more for this optimization, with a *257 you need either a gpr load with immediate (wastes l1i), or a rodata (wastes l1c), and for vectors, even worse
SohamK has joined #ffmpeg-devel
SohamK has quit [Quit: Client closed]
abdu62 has quit [Ping timeout: 240 seconds]
jamrial has quit [Read error: Connection reset by peer]
keith has quit [Remote host closed the connection]
keith has joined #ffmpeg-devel
<haasn>
Overall speedup=1.870x faster, min=0.235x max=23.406x // another 11% speedup on average \o/
<haasn>
and now all of the weird low bit depth format conversions are faster than swscale again
jamrial has joined #ffmpeg-devel
keith has quit [Remote host closed the connection]
cone-806 has quit [Quit: transmission timeout]
<compnn>
did you test 160x120 and 320x240 res conversions? /s
keith has joined #ffmpeg-devel
<compnn>
i guess there are a bunch of low bit depths too
<compnn>
15bpp :D
<compnn>
i'm not sure we have organized bit depth sample collection though
<compnn>
but if you need one, mostly video game and early codecs, it could be collected.
<compnn>
i remember 8bpp was that gif ?
ccawley2011 has joined #ffmpeg-devel
Sean_McG has quit [Quit: leaving]
ccawley2011_ has quit [Ping timeout: 252 seconds]
ccawley2011_ has joined #ffmpeg-devel
ccawley2011 has quit [Ping timeout: 252 seconds]
ccawley2011 has joined #ffmpeg-devel
ccawley2011__ has joined #ffmpeg-devel
ccawley2011_ has quit [Ping timeout: 260 seconds]
ccawley2011 has quit [Ping timeout: 265 seconds]
ccawley2011__ has quit [Ping timeout: 252 seconds]
<haasn>
Overall speedup=1.931x faster, min=0.245x max=24.129x almost at the 2x
<haasn>
Now the only slow paths left are either missing LUT optimizations (e.g. for treating rgb8 as pal8), missing optimized packed read/write fast paths (the MMX code in swscale vastly outperforms our code on these), and one corner case where we do unnecessary swizzling on planar read/writes