michaelni changed the topic of #ffmpeg-devel to: Welcome to the FFmpeg development channel | Questions about using FFmpeg or developing with libav* libs should be asked in #ffmpeg | This channel is publicly logged | FFmpeg 7.1.1 has been released! | Please read ffmpeg.org/developer.html#Code-of-conduct
LainExperiments has quit [Ping timeout: 240 seconds]
Traneptora has quit [Quit: Quit]
LainExperiments has joined #ffmpeg-devel
IndecisiveTurtle has joined #ffmpeg-devel
thilo has quit [Ping timeout: 265 seconds]
thilo has joined #ffmpeg-devel
Marth64[m] has joined #ffmpeg-devel
Marth64 has quit [Ping timeout: 246 seconds]
LainExperiments has quit [Quit: Client closed]
IndecisiveTurtle has quit [Ping timeout: 272 seconds]
realies has quit [Quit: ~]
realies has joined #ffmpeg-devel
abdu has joined #ffmpeg-devel
ffmpeg Michael Niedermayer master:0e917389fe73: avcodec/exr: do not output 32bit floats when a file stores 16bit floats
LainExperiments has joined #ffmpeg-devel
Aadil has quit [Ping timeout: 240 seconds]
abdu has quit [Ping timeout: 240 seconds]
LainExperiments has quit [Ping timeout: 240 seconds]
Tanay has quit [Remote host closed the connection]
Tanay has joined #ffmpeg-devel
ahmedhamed has quit [Quit: Connection closed for inactivity]
^Neo has quit [Ping timeout: 244 seconds]
Aadil has joined #ffmpeg-devel
Martchus_ has joined #ffmpeg-devel
Martchus has quit [Ping timeout: 260 seconds]
mkver has quit [Quit: Leaving]
HarshK23 has joined #ffmpeg-devel
twelve has joined #ffmpeg-devel
twelve has quit [Remote host closed the connection]
twelve has joined #ffmpeg-devel
cone-197 has quit [Quit: transmission timeout]
Aadil has quit [Ping timeout: 240 seconds]
System_Error has quit [Remote host closed the connection]
System_Error has joined #ffmpeg-devel
Coinflipper has quit [Quit: ]
Coinflipper has joined #ffmpeg-devel
_av500_ has quit [Remote host closed the connection]
derpydoo has joined #ffmpeg-devel
av500 has joined #ffmpeg-devel
are we allowed to use GCC vector extensions?
afaict they are not portable to MSVC etc
the gcc police will get you /s
code has to compile on msvc, and if its substantially slower due to disabled code, that would be yet another argument for using x86inc asm :P
> translate vector optimization results via godbolt or local dumping into x86inc
the reason I ask is because there are optimizations we could do with vectors that would be plain impossible even with hand written SIMD (unless we want to abandon standard calling convention)
in particular, if functions accept and return vectors, gcc will directly return results in %ymm0 etc
this saves a memory roundtrip
let me estimate the speedup
sounds like you are already abandoning standard calling conventions
twelve has quit [Ping timeout: 265 seconds]
well okay, but the point is that this would retain compatibility with the C code
msvc actually specifies the __vectorcall calling convention which supports up to 4 vector registers as return value
not sure how compatible that is with gccs extension, but you are still stuck with many other compilers i'm sure
Clearly we need a C language extensions for vectors
th3synth4x has joined #ffmpeg-devel
twelve has joined #ffmpeg-devel
twelve has quit [Remote host closed the connection]
Admittedly the vectorization scheme would break down in any case because you can’t then also return _multiple_ vectors from a single function
So I can’t even benchmark the hypothetical speedup
I think what we will want to do to push maximum performance in practice is devise our own internal calling convention for the pure asm impl
Where we just keep four vectors reserved for I/O
custom calling conventions are fine between two asm functions where we control everything, but I wouldn't attempt doing that between asm and C
That would require calling into only asm functions, so it’s something we could only attempt once we have 100% coverage anyway
dav1d basically had something like that, I think? (custom calling convention within asm)
yes, within the itx functions
For now the round trip through L1 is not the end of the world
The overhead of going from say 4 to 5 function calls is almost nothing compared to the overhead of the functions themselves
The fastest speedup comes from defining subsets of functions that operate only on some components
Maybe it would have made sense to define per-component and per-pixel operations separately
But without vector extensions that wouldn’t be worth it anyway
wbs: btw, maybe you have an idea about how to do an efficient swizzle in asm? Imagine you have a void func(pixel *a, pixel *b, pixel * c, pixel *d, swizzle_t mask); which should permute the contents of the pointers according to the swizzle mask
For example swapping *a and *b
haasn: swapping the pointers themselves, or the contents they point to?
The contents
hmm, not sure really, that sounds quite non-idiomatic
Assume fixed size, eg 16 elements or w/e
Well in reality it is a SoA representation of a block
So you have struct chunk { pixel x[], y[], z[], w[] };
And I need some routine for swizzle(chunk *x, mask)
Obviously when the final input / output is planar we could just swap the pointers directly
But it’s not always possible to eliminate these swizzle operations
For rxample, on decoding rgb30 vs bgr30; we have one unpack101010 operation which unpacks the 32 bit pixel into four 16 bit values in a fixed order in the chunk struct
And then I may need to rearrange this to repack it differently (eg as rgb565)
Sure we could jump through hoops to try and imbue the unpack itself with an extra swizzle mask
But that just makes the problem difficult elsewhere
twelve has joined #ffmpeg-devel
what I did in my implementation now is essentially having one routine for each possible swizzle, or at least each type possible in practice (currently there are 13)
and then choosing the right one at init time
it's not too bad because they are quite small, each one is only a few opcodes
but I wanted to expand this to allow repetitions in the swizzle mask as well which would raise it to 256 possibilities in theory
th3synth4x has quit [Ping timeout: 240 seconds]
derpydoo has quit [Ping timeout: 268 seconds]
th3synth4x has joined #ffmpeg-devel
(late because stuck in buffer) For #11363, I think ubitux' colleague might be of some help, as he contributed and was interested some years ago in this topic. mateo` maybe ?
ccawley2011__ has quit [Ping timeout: 260 seconds]
ah, it's fewer op codes actually, because of widening shifts
abdu62 has quit [Quit: Client closed]
abdu62 has joined #ffmpeg-devel
binary matters more for this optimization, with a *257 you need either a gpr load with immediate (wastes l1i), or a rodata (wastes l1c), and for vectors, even worse
SohamK has joined #ffmpeg-devel
SohamK has quit [Quit: Client closed]
abdu62 has quit [Ping timeout: 240 seconds]
jamrial has quit [Read error: Connection reset by peer]
keith has quit [Remote host closed the connection]
keith has joined #ffmpeg-devel
Overall speedup=1.870x faster, min=0.235x max=23.406x // another 11% speedup on average \o/
and now all of the weird low bit depth format conversions are faster than swscale again
jamrial has joined #ffmpeg-devel
keith has quit [Remote host closed the connection]
cone-806 has quit [Quit: transmission timeout]
did you test 160x120 and 320x240 res conversions? /s
keith has joined #ffmpeg-devel
i guess there are a bunch of low bit depths too
15bpp :D
i'm not sure we have organized bit depth sample collection though
but if you need one, mostly video game and early codecs, it could be collected.
i remember 8bpp was that gif ?
ccawley2011 has joined #ffmpeg-devel
Sean_McG has quit [Quit: leaving]
ccawley2011_ has quit [Ping timeout: 252 seconds]
ccawley2011_ has joined #ffmpeg-devel
ccawley2011 has quit [Ping timeout: 252 seconds]
ccawley2011 has joined #ffmpeg-devel
ccawley2011__ has joined #ffmpeg-devel
ccawley2011_ has quit [Ping timeout: 260 seconds]
ccawley2011 has quit [Ping timeout: 265 seconds]
ccawley2011__ has quit [Ping timeout: 252 seconds]
Overall speedup=1.931x faster, min=0.245x max=24.129x almost at the 2x
Now the only slow paths left are either missing LUT optimizations (e.g. for treating rgb8 as pal8), missing optimized packed read/write fast paths (the MMX code in swscale vastly outperforms our code on these), and one corner case where we do unnecessary swizzling on planar read/writes