michaelni changed the topic of #ffmpeg-devel to: Welcome to the FFmpeg development channel | Questions about using FFmpeg or developing with libav* libs should be asked in #ffmpeg | This channel is publicly logged | FFmpeg 7.0 has been released! | Please read ffmpeg.org/developer.html#Code-of-conduct
<Lynne> can't send a patch without it being bikeshed out of existence or actively ignored
cone-549 has quit [Quit: transmission timeout]
<jamrial> Lynne: ok, found what's happening, but it makes no sense
<jamrial> it's the fmaddpd instruction
<jamrial> if it remove it (and call mulpd + addpd like in sse2), i see the same boost you got
thilo has quit [Ping timeout: 256 seconds]
thilo has joined #ffmpeg-devel
iive has quit [Quit: They came for me...]
<Lynne> really weird, I did do a couple of tests without that earlier
<Lynne> which CPU did you have, a Zen 5?
<Lynne> universally, fused add-multiplies are very sensitive to input latency, at least that's what I noticed
<Lynne> the opusdsp deemphasis code is precariously optimized and really a wonder that it's so stable, with 3 or so chained fmadds
<jamrial> alder lake
lexano has quit [Ping timeout: 252 seconds]
cone-225 has joined #ffmpeg-devel
<cone-225> ffmpeg Andreas Rheinhardt n7.0.1:HEAD: tests/checkasm/vvc_alf: Don't use declare_func_emms
<Lynne> jamrial: thanks, sorry about that earlier
<Lynne> I got a 30% speedup on zen 3 too without fmaddps
<Lynne> *pd
<Lynne> yup, fmadd is getting choked by the loads earlier
michaelni changed the topic of #ffmpeg-devel to: Welcome to the FFmpeg development channel | Questions about using FFmpeg or developing with libav* libs should be asked in #ffmpeg | This channel is publicly logged | FFmpeg 7.0.1 has been released! | Please read ffmpeg.org/developer.html#Code-of-conduct
<Lynne> v2 sent
arch1t3cht1 has joined #ffmpeg-devel
arch1t3cht has quit [Ping timeout: 260 seconds]
arch1t3cht1 is now known as arch1t3cht
mkver has quit [Ping timeout: 268 seconds]
lemourin has quit [Quit: The Lounge - https://thelounge.chat]
lemourin has joined #ffmpeg-devel
IndecisiveTurtle has quit [Ping timeout: 255 seconds]
jamrial has quit []
HarshK23 has quit [Quit: Connection closed for inactivity]
Martchus_ has joined #ffmpeg-devel
Martchus has quit [Ping timeout: 268 seconds]
cone-225 has quit [Quit: transmission timeout]
MisterMinister has joined #ffmpeg-devel
System_Error has quit [Remote host closed the connection]
MisterMinister has quit [Ping timeout: 268 seconds]
kepstin has quit [Remote host closed the connection]
kepstin has joined #ffmpeg-devel
<courmisch> how is x86 able to make do w/o VP7 DSP. unthinkable, such major codec!
IndecisiveTurtle has joined #ffmpeg-devel
Livio has joined #ffmpeg-devel
Livio has quit [Ping timeout: 268 seconds]
ccawley2011 has joined #ffmpeg-devel
___nick___ has joined #ffmpeg-devel
___nick___ has quit [Client Quit]
microchip__ has joined #ffmpeg-devel
___nick___ has joined #ffmpeg-devel
microchip_ has quit [Ping timeout: 260 seconds]
ramiro has quit [Ping timeout: 264 seconds]
ramiro has joined #ffmpeg-devel
cone-865 has joined #ffmpeg-devel
<cone-865> ffmpeg Rémi Denis-Courmont master:0b2316e37fca: lavc/sbrdsp: fix inverted boundary check
<cone-865> ffmpeg Rémi Denis-Courmont master:e6b38c944f0e: lavc/sbrdsp: fix potential overflow in noise table
<cone-865> ffmpeg Rémi Denis-Courmont release/7.0:58ac1f9ea800: lavc/sbrdsp: fix potential overflow in noise table
Krowl has joined #ffmpeg-devel
System_Error has joined #ffmpeg-devel
System_Error has quit [Remote host closed the connection]
<cone-865> ffmpeg sunyuechi master:63697d3350b1: lavc/vp8dsp: R-V V put_epel hv
mkver has joined #ffmpeg-devel
<Gramner> Lynne: using fma instructions has a dependency on the accumulator, so if you're doing it in a tight loop you'll be bottlenecked by the latency whereas separate mul+add can execute more stuff in parallell with out-of-order execution.
<BtbN> What is even the status with the whole tree/auto vectorization? I thought gcc had that long fixed? Or does clang implement something there that performs much better?
<courmisch> f**k, amended the commit message but not the diff
<Gramner> a way to work around this is to unroll the loop a bit and use fma with multiple accumulators inside the loop, and add those accumulators together after the loop. this tends to be better as long as the loop iteration count is reasonably large
<courmisch> BtbN: as far as I know, it's mostly useless because it doesnt't integrate with runtime CPU detection, except for NEON and SSE2 which are typically enabled at compilation time
<courmisch> unless you go out of your way to compile the code multiple times, but then that's not really *automatic* vectorisation
<JEEB> BtbN: I think clang just seemed to not be completely broken, while each time it was enabled with GCC some version was found to bork in some manner. I really don't recall which version was the thing last time
<JEEB> since I think for a short period of time it was enabled, and then just disabled again
<BtbN> It's been disabled since forever now, hasn't it?
<BtbN> And apparently it's some massive performance gains left out
<JEEB> not sure about massive, although I bet it depends on how much existing SIMD there is in various modules
<BtbN> Some people in the issues on my builds recently reported 20-30% performance gains for building with clang, and I can't think of much else causing this than the leck of vectorization.
rvalue has quit [Read error: Connection reset by peer]
<BtbN> Cause I doubt clang can just magically optimize something THAT much better
rvalue has joined #ffmpeg-devel
<psykose> i think gcc also enabled it by default in 13
<psykose> ah
<psykose> it was always enabled for -O3
<psykose> and for -O2 since 12
<BtbN> ffmpeg force-disables it though. For gcc only.
<psykose> sure
<JEEB> yea so 2016 was when it was briefly enabled
<JEEB> cb8646af24bd8e9627cc5e1c62b049a00fe0b07b
<JEEB> it was disabled for GCC in 973859f5230e77beea7bb59dc081870689d6d191 , 2009
<JEEB> then fd6dbc53855fbfc9a782095d0ffe11dd3a98905f was where it was reverted to be disabled again
<JEEB> I think someone made a patch within the last 3-4 years attempting to re-enable it again, but that went nowhere
<BtbN> There's entire distros built with it these days
<BtbN> I'm tempted to just override it in my builds, and see
<BtbN> Just need to be sure how I can override that from the commandline.
<courmisch> I think it is safe to leave it to the default at least
<courmisch> except for those benchmarks which won't look so good on x86-64 and AArch64 any longer
<courmisch> it would also be maybe a good idea not to reinvent standard functions for which libc has optimised versions
System_Error has joined #ffmpeg-devel
<nevcairiel> last time someone suggested to re-enable it, it still broke some of the inline asm we have, but noone felt compelled to actually fix it
<courmisch> re-enable it, or not disable it? not the same
<courmisch> especially on x86-32
<courmisch> I suppose
<JEEB> gcc currently leads to `-fno-tree-vectorize`
<JEEB> so removing that one could call "enabling" ? although yes, one is indeed no longer disabling it any more
<courmisch> there is a difference between trusting the compiler defaults and enabling
<courmisch> especially for old known broken compiler versions
<JEEB> yes, if we were explicitly setting -ftree-vectorize always
<JEEB> versus just removing the no-tree-vectorize override
<JEEB> we just assume that it has been enabled in gcc for ages by default, so it gets called re-enablement, even though it just leaves the value to the default :)
<nevcairiel> because thats effectively what it does, and if removing the flag breaks stuff, thats no good :shrug:
<Lynne> Gramner: so it was the accumulator, not loads
<Lynne> we're doing 2 loads so they should be finished in a cycle on most CPUs
<Lynne> plus data is guaranteed to be in L1, and there's pipelining to hide it further, and the loop is barely 7 instructions long
<Gramner> loads wont finish in a cycle, but they can be scheduled out-of-order so it's mostly a non-issue as long as they're in cache and the address calculation doesn't consist of a long dependency chain
mkver has quit [Remote host closed the connection]
mkver has joined #ffmpeg-devel
Krowl has quit [Read error: Connection reset by peer]
kurosu has joined #ffmpeg-devel
HarshK23 has joined #ffmpeg-devel
\\Mr_C\\ has joined #ffmpeg-devel
lexano has joined #ffmpeg-devel
<BtbN> But why does it break inline asm on gcc, but not clang?
<nevcairiel> because entirely different software products have very different behavior? :D
<BtbN> fate passes at least with the flag removed
<nevcairiel> it didnt affect all platforms
<JEEB> what the revert of the allowance of 4.9+ noted was > See the "[PATCH 2/2] configure: Enable GCC vectorization on ≥4.9" thread on the ffmpeg-devel ML
<JEEB> also do note that since we now require C11 I think the list of compilers supported is now quite a bit shorter
<BtbN> hm, it might just need more involved checks then. Cause it really does seem like quite a bit of performance left on the table there.
<JEEB> but yea, hopefully that thread has some info on the failure cases
<BtbN> A link to it would have been nice
<BtbN> been a few years
<mkver> JEEB, BtbN: GCC 11.3, configure --cpu=haswell, mingw32
<BtbN> specifically haswell? That's super weird.
<nevcairiel> i doubt its specifically haswell only, just something high enough
<JEEB> I'd guess x86_64 v3 (AVX2 config) would thus also be affected
<BtbN> -march=znver4 builds just fine at least
<BtbN> 13.2.1_20240210 with --cpu=haswell also at least builds without error
c1480 has quit [Read error: Connection reset by peer]
cone-865 has quit [Quit: transmission timeout]
<BtbN> I can't tell if it's actually turned on. Even adding -ftree-vectorizer-verbose=6, there is no further output.
<BtbN> No matter what I try, at least compilation with gcc-13 never fails.
<mkver> BtbN: Can't you just check whether the generated output is different from our default with -fno-tree-vectorize?
<BtbN> It's different cause of the different gcc invocation alone. I _think_ it's working, but I'm not certain.
<nevcairiel> do you measure the fabled double-digit percentage?
<BtbN> The binaries _with_ vectorization are quite a bit larger
<BtbN> 29M for stripped ffmpeg.exe with, and 23M for the same exe without vectorization
<BtbN> I'm gonna test prores now
<psykose> that includes local pgo too
MisterMinister has joined #ffmpeg-devel
<BtbN> yeah, hm. It's still an incredibly huge difference
<BtbN> ffmpeg -f lavfi -i testsrc=duration=120:size=1920x1080:rate=60 -c:v prores_ks -profile:v 3 -f null -
<BtbN> 150 FPS without vectorization, 160 FPS with.
<BtbN> So there is a non-trivial difference, ~6% faster.
kurosu has quit [Quit: Connection closed for inactivity]
<BtbN> That performance difference is consistent also with mingw vs. native
cone-260 has joined #ffmpeg-devel
<cone-260> ffmpeg Rémi Denis-Courmont master:728a1dd3b6fa: lavc/rv34dsp: remove stray load immediate
<cone-260> ffmpeg Rémi Denis-Courmont master:25a33665a0ce: lavc/vp8dsp: remove unused macro parameter
<courmisch> BtbN: which checkasm bench becomes useless?
<BtbN> I can't reproduce the failure, even with mingw on gcc 11.4
<BtbN> Did we fix that issue, or did gcc fix stuff?
Livio has joined #ffmpeg-devel
<nevcairiel> i doubt it was fixed in 11.4, possible we changed something
<nevcairiel> https://git.videolan.org/?p=ffmpeg.git;a=commitdiff;h=182663a58a7a099e02e76da3b0f96d63e5c26a6d this probably, as it disabled the inlining in that config
<BtbN> Is ARCH_X86_32 enabled on x86_64?
<nevcairiel> no
<BtbN> Oh, it's only broken on 32 bit?
<BtbN> I didn't test that, since I don't really care about that
<nevcairiel> it was 32-bit all this time yes :P
<BtbN> Well, disabling it for gcc on x86_32 it is then!
<BtbN> Though --arch=x86 --target-os=mingw64 --cross-prefix=i686-w64-mingw32- --cpu=haswell also works
<BtbN> That's "i686-w64-mingw32-gcc (GCC) 10-win32 20220113" even
<nevcairiel> like I said the commit above likely "fixed" it, as it disables inlining for the code that routinely failed
<courmisch> IIRC, VLC doesn't even compile with -O0 on 386 due to register pressure with inline assembler
<courmisch> there is no guarantee of any non-zero number of available register from the compiler, so...
<BtbN> yeah, reverting it brings the failure back
<nevcairiel> we have configure checks on available registers for some very picky inline asm, but its certainly thrown all sorts of errors over the years
<courmisch> I don't see how you can check for it in configure. It's entirely case-dependent
<courmisch> if that CABAC code is worth optimising for, it's probably worth making an ABI call of it on 386
<courmisch> that's pretty much the only safe way to write assembler on a register-deprived ISA
<BtbN> That really seems to have been the only problematic bit of code though
<courmisch> there isn't that much non-trivial inline assembler in FFMpeg, so I'm not sure if that's saying anything TBH
<courmisch> at least on x86
<courmisch> (obviously not true on PPC or MIPS)
<BtbN> With 182663a in, the tree vectorization seems to work
<courmisch> I would expect vectorisation to cause problem with insufficient clobbers rather than missing regs, but I guess that's possible too
<nevcairiel> cabac decoding is still a performance bottleneck, as even vvc keeps using it, but the overhead of making it an actual function call in 32-bit x86 at least fixes all the optimizer nonsense
<BtbN> If you still use x86_32 in 2024, you opted out of performance anywhere
<courmisch> because of register shortage?
<courmisch> otherwise, x32 (i.e. x86-64ilp32) is supposed to be faster than x86-64
<BtbN> 5% performance isn't the world, but also far from nothing
<courmisch> though IMO the only 32-bit ISA of *any* meaningufl relevance at this point is ARMv7-A, and even that's on the way down
<courmisch> I don't know how much FFmpeg would be simplified by dropping 686 support though
<courmisch> and I'm sure an army of trolls would come if it's even suggested seriously
<jkqxz> Do AMD/Intel care about x86 32-bit performance at all any more?
<courmisch> no
<BtbN> doubt it, it's in pure compat mode
<courmisch> x86s is basically dropping 32-bit performance completely
<courmisch> as I understand it, at least. I'm obviously no CISC expert
<nevcairiel> as I understood it, x86s just simplifies the boot-up with no impact to user-space, including 32-bit mode, unless you relied on like 16-bit protected mode
<courmisch> AFAIU, the point is pretty much to drop 16- and 32-bit hardware support as much as possible. The *visible* impact is that you can't boot in real mode
<jkqxz> 16-bit can be emulated running orders of magnitude faster than any hardware which existed when programs for it were written, unlike 32-bit.
<courmisch> Arm is also pushing to drop hardware 32-bit compat to save on die area for instruction decoding and repurpose it to something more useful
<nevcairiel> 32-bit OSes seem to be getting droped, but i heard nothing about 32-bit userspace getting slower, there is still tons of shit using that
<jkqxz> x86 has less pressure on instruction decoding, though, because the instructions are essentially the same and all translated to something different internally anyway.
<jkqxz> The area overhead of 32-bit is probably very small now, and it gets a lot of speed for free.
<BtbN> x32 exists for that
<BtbN> and people kinda stopped caring about that
<courmisch> uh, x32 is for anything but compatibility
<nevcairiel> x32 is not a hardware thing, just some people thought it was a good idea
<BtbN> x32 is trying to combine the speed benefits of the two
<courmisch> x32 is literally using 64-bit instruction set and extra registers, while keeping the memory footprint of 32-bit
<jkqxz> x32 is a different ABI, not a different instruction set.
<courmisch> yes, but x32 is not backward compatible with 686
<courmisch> jkqxz: x32 is x86-64 ISA, so it is a different ISA than 686 / x86-32 / whatever you call it
___nick___ has quit [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
<nevcairiel> sure, but its not a "new" ISA, its just standard 64-bit
<nevcairiel> is the point, i think
<courmisch> in fact the 64ilp32 concept has been attempted on mips, arm, riscv too
<jkqxz> Yes. It is tne same ISA are x86-64 with a different ABI..
___nick___ has joined #ffmpeg-devel
___nick___ has quit [Client Quit]
___nick___ has joined #ffmpeg-devel
MisterMinister has quit [Remote host closed the connection]
<BtbN> So, would another attempt at sending a patch that enables tree-vectorization have a chance now? :D
<Lynne> sure, send it, push it, and see if fate breaks
<BtbN> It's really unfortunate that the only way to utlize our army of fate boxes is to push to master
<Gramner> x32 was absurdly niche and I don't think anyone ever used it for something meaningful. even the official abi site has gone missing
<Gramner> if you want your arrays of pointer to use less memory then just use some pointer compression scheme like everyone else instead of changing the entire system ABI
dionisis has quit [Ping timeout: 260 seconds]
<Lynne> x32 proponents were LOUD
<Lynne> they weren't however reasonable
<BBB> that's generally the problem of loud people
<BBB> it's difficult to figure out if what they're loud about is reasonable
<BBB> because they are so ... well ... loud :)
Livio has quit [Read error: Connection reset by peer]
microchip__ is now known as microchip_
Livio has joined #ffmpeg-devel
<courmisch> nah, they were probably just engineers who didn't know what stupid idea to make a project of to justify funding
* Sean_McG peeks in
<courmisch> I didn't see any PPC patch
<Sean_McG> I did (well, I think I did...) implement flacdsp wasted32 -- how do I get the numbers to see if it sped it up or not?
<courmisch> make tests/checkasm/checkasm && tests/checkasm/checkasm --test=flacdsp --bench
<Sean_McG> cool, thanks.
<JEEB> hmm, I wonder which spec 23091-3 is referring to with "according to [1]", or did they just make a mistake matching it to BS 2051 / IEC 62574 and wrote SiL as just SL (same for the right one)
<JEEB> wrt to table 2, "Definition of loudspeaker index, OutputChannelPosition"
<Sean_McG> hm, it failed -- maybe I don't have it set up right
<JEEB> I have a draft amendment that contains the full tables, but not the thing it's supposedly referring to
cone-260 has quit [Quit: transmission timeout]
iive has joined #ffmpeg-devel
<JEEB> I love how the TrueHD in MP4 document from 2019 tells the user to map their L|R to BS.2051 FLc and FRc
<Lynne> even if we get it wrong, its not a problem, who has a 22.1 sound system?
<Lynne> just imagine all the wires
<JEEB> at least some "D atmos home entertainment studio technical guidelines" document from 2021 then goes back and maps left/right to L|FL
<courmisch> reminds me I should buy a ticket for Mad Max in ISense
<JEEB> Lynne: I mean it's mostly to figure out what the mapping should be :D or even coming up with some supposed azimuth (range) that we mean with them
<JEEB> at least I'm mostly considering the apple surround direct mappings as bullshit by now, that broke me for a good while
<JEEB> esp. after reading the definition of the direct channels and that they just got removed from the later SMPTE doc compared to the one in which they were defined in
<JEEB> although I guess more than the D docs, I lul at 23091-3 where depending on the layout, the "basic" left and right are mentioned as either L,R or Lc|FLc,Rc|FRc - with the same 30 degree azimuth
<JEEB> I think BS.2051 does the same...
<JEEB> actually no, at least BS.2051 doesn't seem to do this. It just changes from L,R to FL,FR
<Sean_McG> I still need to see the Mad Max films... should I see the originals or just start with Fury Road?
<JEEB> the azimuths do start changing, though. since BS.2051 system H uses the FLc,FRc channels and thus makes "normal" left and right be wider
<JEEB> Sean_McG: I unfortunately didn't get to the originals so started with Fury Road and it was a nice experience
<JEEB> I think the second one - road warrior, was the big thing
<iive> Sean_McG, the Fury Road is completely standalone.
<JEEB> yea
<JEEB> I think quite a few of the mad max movies were? the overall theme and some character of "mad max" being the main things stitching them together?
AbleBacon has joined #ffmpeg-devel
___nick___ has quit [Ping timeout: 264 seconds]
<JEEB> right, BS.2051 is azimuth based and then whatever the name exactly is defined by each layout is secondary
<JEEB> that's why it's completely OK that M+/-30 is called L,R in some layouts and FLc,FRc in another
Krowl has joined #ffmpeg-devel
___nick___ has joined #ffmpeg-devel
Livio has quit [Ping timeout: 264 seconds]
Krowl has quit [Read error: Connection reset by peer]
ccawley2011 has quit [Read error: Connection reset by peer]
ccawley2011 has joined #ffmpeg-devel
ccawley2011 has quit [Ping timeout: 252 seconds]
___nick___ has quit [Ping timeout: 260 seconds]
jamrial has joined #ffmpeg-devel
IndecisiveTurtle has quit [Ping timeout: 255 seconds]
<llyyr> nevcairiel: reminder for the vp9 patch :p
cone-377 has joined #ffmpeg-devel
<cone-377> ffmpeg Brad Smith master:43b1a956789b: configure: enable ffnvcodec, nvenc, nvdec for FreeBSD
<cone-377> ffmpeg Brad Smith release/4.4:fe7a4ea04970: configure: enable ffnvcodec, nvenc, nvdec for FreeBSD
<cone-377> ffmpeg Brad Smith release/5.0:80a676e8ae8d: configure: enable ffnvcodec, nvenc, nvdec for FreeBSD
<cone-377> ffmpeg Brad Smith release/5.1:389861c02151: configure: enable ffnvcodec, nvenc, nvdec for FreeBSD
<cone-377> ffmpeg Brad Smith release/6.0:0819bdc6212f: configure: enable ffnvcodec, nvenc, nvdec for FreeBSD
<cone-377> ffmpeg Brad Smith release/6.1:2ffc47b37de0: configure: enable ffnvcodec, nvenc, nvdec for FreeBSD
<cone-377> ffmpeg Brad Smith release/7.0:f42c35b7c9c9: configure: enable ffnvcodec, nvenc, nvdec for FreeBSD