michaelni changed the topic of #ffmpeg-devel to: Welcome to the FFmpeg development channel | Questions about using FFmpeg or developing with libav* libs should be asked in #ffmpeg | This channel is publicly logged | FFmpeg 7.1.1 has been released! | Please read ffmpeg.org/developer.html#Code-of-conduct
thilo has quit [Ping timeout: 248 seconds]
thilo has joined #ffmpeg-devel
thilo has quit [Changing host]
thilo has joined #ffmpeg-devel
Anthony_ZO has joined #ffmpeg-devel
<ramiro>
haasn: /* Skip unpacking components that are not used */ <-- this doesn't seem right. we're dropping crucial information to know where the find the components.
cone-896 has joined #ffmpeg-devel
<cone-896>
ffmpeg Michael Niedermayer master:2f0500f22c1d: avcodec/ffv1enc: Better heuristic for selecting mul values.
<cone-896>
ffmpeg Michael Niedermayer master:3faee894fc77: avcodec/ffv1enc: Eliminate copy_state
<cone-896>
ffmpeg Michael Niedermayer master:1d2c39100524: avcodec/ffv1enc: Add -remap_optimizer option
<cone-896>
ffmpeg Michael Niedermayer master:f76508511547: avcodec/ffv1enc: Eliminate RemapEncoderState
<haasn>
ramiro: notice that it skips only trailing components
<haasn>
PackOp is defined as being msb->lsb ordered
<haasn>
But I don’t think this optimization is actually doing anything useful in 99.9% of cases
<haasn>
So if it makes it more annoying just remove it
<haasn>
Realistically you can only skip the alpha channel which only the 1010102 format even has
<ramiro>
I hit it on xv30le -> gray
<haasn>
Ah
<ramiro>
the information available on pack/unpack is quite confusing... the input/output types, which components are used for input and output, and the offsets being msb->lsb ordered. I'll have to read more carefully. currently I have a huge hack for unpack and a medium hack for pack.
<ramiro>
packing/unpacking frequently leads to an unnecessary conversion before or after. I had to add more checks to make sure the conversion only happens once.
<ramiro>
I just finished *most* unscaled conversions. all except for monow/monob, xv36be, and the occasional asmjit failure on consecutive register allocation. it's *very* fast, and I haven't done the loop in asm yet. it should be even faster with just one function call and no redundant loading of const data.
<haasn>
I didn’t really think about the design of pack/unpack too much
<haasn>
We should define it in whatever way is most convenient for asm
<haasn>
How does a typical SIMD implementation of this work?
<Lynne>
load N registers where N is a convenient mod of the depth/elements
<Lynne>
then deinterleave until done
<Lynne>
same with packing
<haasn>
If it’s some combination of AND and shift right, then I imagine we should keep the element size the same
<haasn>
Rather than implying a conversion
<haasn>
Since we’re guaranteed unpacking to separate registers it doesn’t really cost us anything
<Lynne>
you can do it one step at a time for most pixfmts, but it would be significantly slower
<haasn>
Deinterleave?
<Lynne>
shuffle/deinterleave until everything ends up separated by component
<haasn>
What instruction sets have a bitwise shuffle or deinterleave?
<ramiro>
haasn: for packing I'm doing 1) convert to higher format 2) shift left 3) or. 1 and 2 could be merged in the same instruction with a widening shift left.
<Lynne>
oh, for 10bit semi-packed data, like x2bgr10?
<Lynne>
or v210?
<haasn>
Yeah, or rgb565
<haasn>
For byte aligned formats you don’t need SWS_OP_UN/PACK
<ramiro>
haasn: for unpacking it's 1) shift right 2) and 3) convert down. I could merge 1 and 3 with narrowing shift right.
<Lynne>
for that I'd go with shifts and masks, yeah
<haasn>
Could do a survey of how often an unpack is followed by a conversion
<haasn>
Across all possible conversions
<haasn>
Or, what if we just allow the pack type to vary
<haasn>
So { op = UNPACK, pack.type = U16, .type = U8 } + { convert u8 -> u16 } just becomes .type = U16 on the unpack
<haasn>
That strikes me as the most general solution
<haasn>
Then you can use either a narrowing right shift or a normal right shift depending on the desired type
<haasn>
Maybe with the restriction that type <= pack.type
<haasn>
Although the other part of me wants to just delete the confusing pack.type field
<ramiro>
haasn: it is confusing, yes :)
<haasn>
Define PACK/UNPACK on a single deprh only
<haasn>
And let the backend worry about fusing the right shift and narrow
<ramiro>
it would be more helpful to have, for each op, input and output type. even if that means duplicating some information from next/previous op.
<haasn>
That only affects CONVERT right
<ramiro>
also maybe comps.unused_in and comps.unused_out
<ramiro>
haasn: no, for pack/unpack
<haasn>
(With the above proposal to delete pack.type)
<haasn>
I guess it would simplify swapping operations
<haasn>
then again, I should just make swap_ops automatically do that when swapping a conversion
<haasn>
one of the things I want to try and solve is return the ability for op table entries to match multiple ops again
System_Error has quit [Remote host closed the connection]
<haasn>
but the stupid way of doing that (turning ops into an array) is both cumbersome and creates a lot of dead binary space
<haasn>
by blowing up the size of those tables
<haasn>
and it's not convenient to make them pointers with C macros
<haasn>
maybe it's okay for now to just allow matching _two_ ops, rather than N, that keeps the size manageable
<haasn>
though I suspect we may really want some dedicated kernels for packed read-swizzle-write (e.g. shuffling rgba byte order)
<haasn>
since those can just be pshufb instead of needing to split the whole thing into separate registers
<qwertyttyert>
There is a libtheora. CMD is limited to displaying 300 rows. -codecs doesn't show everything. -encoders is also 300 lines, but the output of the information is different. Copy, paste, and find. I'm ffnpeg use it on linux.
jamrial has quit []
<fflogger>
[newticket] y2krankor: Ticket #11534 ([undetermined] Copying dvd subtitles from mp4 to mp4 alters size) created https://trac.ffmpeg.org/ticket/11534
cone-896 has quit [Quit: transmission timeout]
qwertyttyert has left #ffmpeg-devel [#ffmpeg-devel]
<fflogger>
[editedticket] MasterQuestionable: Ticket #11534 ([avcodec] Copying "dvd_subtitle" from MP4 to MP4 altered display size) updated https://trac.ffmpeg.org/ticket/11534#comment:1
<cone-637>
ffmpeg Andreas Rheinhardt master:35c091f4b7fb: avformat/rtpenc: Check dimensions during init
derpydoo has joined #ffmpeg-devel
Guest44 has joined #ffmpeg-devel
Guest44 has quit [Client Quit]
jestrabikr has joined #ffmpeg-devel
jestrabikr has quit [Quit: Client closed]
Grimmauld has joined #ffmpeg-devel
nush has joined #ffmpeg-devel
nush has quit [Remote host closed the connection]
nush has joined #ffmpeg-devel
nush has quit [Remote host closed the connection]
rit has joined #ffmpeg-devel
Teukka has quit [Read error: Connection reset by peer]
Teukka has joined #ffmpeg-devel
Teukka has quit [Changing host]
Teukka has joined #ffmpeg-devel
ngaullier has joined #ffmpeg-devel
rvalue has quit [Read error: Connection reset by peer]
rvalue has joined #ffmpeg-devel
rit has quit [Remote host closed the connection]
rit has joined #ffmpeg-devel
Marth64 has quit [Remote host closed the connection]
Marth64 has joined #ffmpeg-devel
cone-637 has quit [Quit: transmission timeout]
mkver has quit [Ping timeout: 265 seconds]
mkver has joined #ffmpeg-devel
<haasn>
ramiro: I kinda want to start merging the ops code even though the backends aren't "ready" yet, as it will give people a chance to both test the new code and start making it possible for other people to start developing beckends without the constant pain of rebasing
<haasn>
I was thinking about using SWS_EXPERIMENTAL to guard the new code for now, but I'm not very happy with that solution
<haasn>
what if we used an informal env var instead? like how drivers etc sometimes do it
witchymary has joined #ffmpeg-devel
mkver has quit [Ping timeout: 268 seconds]
mkver has joined #ffmpeg-devel
cone-276 has joined #ffmpeg-devel
<cone-276>
ffmpeg Niklas Haas master:0e2742a693d5: tests/swscale: allow choosing specific flags and dither mode
<cone-276>
ffmpeg Niklas Haas master:d467ceaa9b14: tests/swscale: use hex format for flags values
<ramiro>
haasn: in general I'm not a big fan of env vars. what's wrong with using a flag?
<haasn>
just worry that it may be too poorly defined
<haasn>
so no users can actually end up using it
<ramiro>
it could also be selected by configure for now, disabled by default
<haasn>
but we have the flags anyway, might as well use them
<Lynne>
I think you should enable it by default after 8.0 gets released
<haasn>
I want at least x86 coverage before doing that
<haasn>
of the major paths
<haasn>
Lynne: do you have any idea why my SIMD routine is slower than the compiler generated C code depite the latter doing about 4 unnecessary memory round trips?
<haasn>
it's not 1:1 the same as the asm code because the C code is only reading 32 pixels at a time instead of 64 pixels at a time
<haasn>
and the C code is also writing to stack instead of leaving the data in mx/my/mx2/my2 but that should only be making it slower in this comparison
<Lynne>
wow, vpermq is more expensive than I thought
<Lynne>
throughput of 2, latency of 6.5
<haasn>
you reckon it's just from leaving away that?
<Lynne>
on a zen 3
<haasn>
let me test at different block sizes just to be sure
rit has quit [Remote host closed the connection]
<Lynne>
yeah, I reckon its that, the compiler generated code relies on bitwise ops which are very cheap and pipelineable
<haasn>
since the block size is different the avx code is doing twice as much work
<Lynne>
ah, yeah, that looks better
<haasn>
that... kinda screws with the whole idea of using checkasm --bench in general
<haasn>
this is with the same 32 block size forced for both
<haasn>
really I should be reporting performance in cycles per pixel
<Lynne>
still barely 1.64x faster though, that's fine on arm64, but I'd have expected a bit more on x86
<haasn>
the sse2 version is with v1 not v2
<haasn>
give me a minute to make v2 compatible with sse2 and test it again
<haasn>
we are gueranteed SSSE3 on x86_64 right?
<wbs>
no, only SSE2
<haasn>
damn
<haasn>
read_packed2_u8_ssse3: 9.4 ( 1.77x)
<haasn>
using the pshufb variant
<ramiro>
haasn: but we did decide as vdd that ssse3 can be considered the minimum.
<haasn>
reasonable
<haasn>
I expect a lot more pshufb in our future anways
<haasn>
could always go back to supporting separate sse2 variants if a need arises
<ramiro>
haasn: we could also go back to supporting mmx variants if the need arises, I just hope it never arises :)
<haasn>
I'm thinking if there's a way we could avoid the constant need for vpermq
<haasn>
it seems it's needed after basically every bit depth conversion (packuswb etc)
Grimmauld92 has quit [Quit: Client closed]
<haasn>
maybe we can just keep track of the number of such ops and do a single vpermq at the end only if needed :)
Grimmauld has joined #ffmpeg-devel
<haasn>
that seems like the type of witchcraft better reserved for a jit backend though, or at least postponed until we actually have 100% ops coverage
<Lynne>
vpermq is cross-lane, you should try replacing it with vperm2i128+shufpd
<haasn>
after switching from linux-perf back to the native backend
<haasn>
I wonder if this is actually the result of the poor granularity on the rdtsc effectively rounding the execution down to 0 in many cases
<haasn>
or else how can simply changing the timing method result in a dramatic speedup
<ramiro>
haasn: collect the number of 0 runs and print that as well
<kierank>
you need to run the function more times
<kierank>
the granularity is too low
jamrial has joined #ffmpeg-devel
<kierank>
10 cycles or dezicycles or whatever is way too small
<wbs>
kierank: it does run it many times, but the existing checkasm essentially threw away all the previous iterations and only reported the runtime from the last single iteration
<haasn>
increased it from 4 invocations to 16, now I get read_packed2_u8_ssse3: 16.0 ( 6.16x)
<haasn>
fun
<wbs>
kierank: due to an oversight in e0d56f097f42bcdbe6c3b2f57df62a4da63f2094
<kierank>
wbs: :(
<haasn>
wbs: the other problem I think is that we only run the function 4 times in between measuring timestamps
<haasn>
I think it would be better to switch to running the function say 16 times and reducing the run count accordingly
<haasn>
because at least on my platform, rdtsc only reports in increments of I think 36
<wbs>
haasn: maybe, but that feels like a change that depends a lot on the kind of function you're working with
<haasn>
sure, but it shouldn't matter for slower functions and for faster functions will improve accuracy a lot, no
<haasn>
I mean obviously it would be better to adjust the number of runs dynamically to either meet a certain time target or sttdev threshold per test
<wbs>
if we'd decrease the run count accordingly, then yeah, it could be fine
<wbs>
but the change to actually use tsum and not just t, feels like a slam dunk fix. the other one is probably ok but a bit less of an obvious change
<wbs>
haasn: but great catch!
<haasn>
read_packed2_u8_ssse3: 0.0 ( 0.00x)
<haasn>
with the default 4 iters, lol
<haasn>
wbs: sent it for now
<haasn>
wbs: can we have a shared checkasm living in a git submodule of dav1d/ffmpeg yet please?
<wbs>
haasn: that would be awesome indeed. unfortunately it is a bit tied to the host project in a bunch of ways
<haasn>
indeed, I suspect a fresh start would be needed
<haasn>
maybe a project for STF 2025 if anybody's interested in that still? :)
<wbs>
nah, I don't think a fresh start is needed, only a bit of refactoring to get it free from the host project e.g. cpu flags
<haasn>
would have to go through and cherry-pick the best parts of dav1d and ffmpeg as they have diverged in terms of features
<haasn>
well I was thinking we could also put it into a clean namespace while we're at it and make it more of a universal library
<haasn>
was it BBB who I was discussing the future of checkasm with at some point?
ngaullier has quit [Remote host closed the connection]
ngaullier has joined #ffmpeg-devel
<haasn>
may I propose making linux-perf the default timing backend as well
<haasn>
it really is substantially more accurate than rdtsc; I get better consistency from linux-perf at --runs=10 than from rdtsc at --runs=16
<haasn>
likely because of cpu scaling messing up the tsc results
<haasn>
it was so accurate that I could use it for timing even with the above bug (single iter); whereas rdtsc in the single-run case just gave no useful results
<ramiro>
haasn: +1 for making perf the default
<cone-276>
ffmpeg Niklas Haas master:256a38101fd1: tests/checkasm: fix wrong summation of bench time
<ramiro>
haasn: what would be the best struct/place to free the context created by a backend?
<ramiro>
haasn: also, in backend_x86's compile you recalculate chain->block_w every time. shouldn't this be done only once?
<haasn>
I guess the downside of this is that it locks you out from using `SwsOpPriv` for other purposes, if the amount of private data you need is less than 16 bytes
<haasn>
ramiro: the x86 backend compile() loops internally after setting the block size
<wbs>
ramiro: the gcc compile farm has one (or a couple) loongarch machines, I've tested dav1d on them before
<haasn>
backend_c does set the block size on every iter but it's not a big deal
<haasn>
to be honest I was considering dropping the ability for compile() to even return EAGAIN but kept it in there because why not
<haasn>
but maybe it would be better to make it clear that compile() should compile _all_ ops
<wbs>
haasn: perf is the default for benchmarking on arm, but I usually disable it
<ramiro>
haasn: yes, I think it's best to have compile() compile all ops
<wbs>
there's an abandoned patchset from courmisch that prepares for making it possible to switch benchmark implementation without recompiling
<ramiro>
haasn: hmm, I'm not calling ff_sws_op_compile_tables() though (which would help freeing the context)
<haasn>
you don't have to
<ramiro>
is it normal that ff_sws_op_chain_append() does nothing with the 'free' argument?
<ramiro>
haasn: thanks. so I tie my context to the chain.
<haasn>
btw, when you do switch to looping in asm, you may want to still keep the block_h reasonably small though
<haasn>
to enable threading it over larger images
Grimmauld has quit [Quit: Client closed]
<ramiro>
haasn: I think block_h would still be 1 or 2, and we need to figure out a way to let run_main/run_tail decide how many rows could be done per call.
<haasn>
I'm more worried atm about run_tail being slow with a too large block_w
<haasn>
the varying block sizes thing is turning into a bit of a pain especially for checkasm because we can't easily guarantee coverage of all possible block sizes
<haasn>
maybe a better approach would be for ops.c to compile a SwsOpChain per possible block size
<ramiro>
haasn: the tail is a small amount of data, especially on large image sizes. it would matter little from 720p and up. don't put this amount of energy into varying block sizes now. get it working with a full block size first and then measure the performance hit with different image sizes.
<haasn>
fair
<haasn>
at 4k even with block size 64 we have over 60 blocks per row
<haasn>
so even in the worst case of spending an entire extra block just for one pixel we are wasting only 1.6%
<haasn>
that is micro-optimization territory
Ishesh has joined #ffmpeg-devel
<ramiro>
haasn: do we really need SwsOpPriv? can't we just do the usual void* and let the function cast it?
<haasn>
we don't really need it though it does make the C code slightly more convenient so I'm not sure I see the harm
<haasn>
well, and there is the fact that void * is not 16 bytes large :)
<haasn>
so we do need a union somewhere; either there or in SwsOpImpl
<ramiro>
can't the backend specify the size of privdata?
<haasn>
it has to be fixed size
<haasn>
look at the way private data is stored in SwsOpImpl; it is alternating with the function pointers
manny2 has joined #ffmpeg-devel
<haasn>
this enables the non-JIT backends to just bump the pointer by 32 bytes and call the next function
<haasn>
without having to bump two different pointers or do math to determine how much to bump it by
<ramiro>
ah, that's good.
<haasn>
16 bytes of "immediate" data is plenty for almost all ops
<haasn>
so not only is this approach lower for per-kernel overhead but it also saves an indirection for all other kernels
<haasn>
and the fact that it's already in cache only helps us
<haasn>
typically the entire impl array fits into cache anyway
Anthony_ZO has quit [Ping timeout: 248 seconds]
manny2 has quit [Quit: Client closed]
manny81 has joined #ffmpeg-devel
<haasn>
Lynne: I don't think vpermq can be replaced by vperm2i128+shufpd here; I have {A0 B0 | A1 B1} and I want {A0 A1 | B0 B1}; after vperm2i128 I have tmp := {A1 B1 | A0 B0} but that doesn't help with a single shufpd to get the result I want
<haasn>
unless I use another shufpd but I somehow doubt vperm2i128 + 2 * shufpd is going to be faster than a single vpermq
<jamrial>
the output of the new tests that adds will change
<jamrial>
if the output after your change looks wrong, then it would mean the unscaled code i added is wrong
<ramiro>
haasn: valgrind complains that the data returned from generate_bayer_matrix leaks
usagi_mimi has quit [Quit: WeeChat 4.5.2]
usagi_mimi has joined #ffmpeg-devel
usagi_mimi has quit [Changing host]
usagi_mimi has joined #ffmpeg-devel
usagi_mimi has quit [Ping timeout: 248 seconds]
Grimmauld has joined #ffmpeg-devel
<Lynne>
haasn: I see
<Lynne>
nak on making linux-perf the default
<Lynne>
you have a weird system
cone-276 has quit [Quit: transmission timeout]
<mkver>
Is there a nasm equivalent to -ffunction-sections?
ngaullier has quit [Remote host closed the connection]
Grimmauld has quit [Quit: Client closed]
<haasn>
pshufb can only shuffle within 16 bytes?
<jamrial>
yeah, within a lane
<jamrial>
for crosslane byte shuffle i think you need avx512
ccawley2011 has quit [Ping timeout: 245 seconds]
ccawley2011 has joined #ffmpeg-devel
<Gramner>
avx2 has vpermd and vpermq for 32- and 64-bit cross-lane shuffles. for smaller elements you need avx-512
manny81 has quit [Ping timeout: 240 seconds]
ccawley2011 has quit [Ping timeout: 260 seconds]
manny69 has joined #ffmpeg-devel
___nick___ has joined #ffmpeg-devel
ccawley2011 has joined #ffmpeg-devel
<Lynne>
vpermd is the white flag instruction you use to indicate you surrender
<Lynne>
it is almost never the right choice, reparations include losing the war, a register, a memory load, tons of CPU ports, and latency
<jamrial>
not necessarely. look at vector_fmul_reverse in float_dsp. vpermps (float variant) is faster than insertf128 + shufps
<Lynne>
in more non-trivial applications such as FFTs, you can always fold the functionality of the shuffle back up the chain if you have enough patience and avoid it
Warcop has joined #ffmpeg-devel
IndecisiveTurtle has joined #ffmpeg-devel
___nick___ has quit [Ping timeout: 252 seconds]
<haasn>
for some reason I was thinking 16 bytes is less than one lane, but now I realize I am being silly and should probably go eat something instead
<haasn>
but that does mean I can generally load shuffle masks using vbroadcasti128 when I need it multiple times
IndecisiveTurtle has quit [Quit: IndecisiveTurtle]
___nick___ has joined #ffmpeg-devel
abdu has joined #ffmpeg-devel
<Lynne>
you can, though you ought to be careful
<Lynne>
it may be cheaper to just outright load 256bits at a time, even if they're duplicated, since vbroadcast uses the shuffle ports
<Lynne>
vbroadcast m, xmm has a latency of 4! 4! its cheaper to vperm2i128 m, m+whatever happens to be in lane2, 0x00
rit has quit [Remote host closed the connection]
rit has joined #ffmpeg-devel
<Gramner>
broadcast from memory of dwords and larger are not shuffle-µops, just regular memory loads
abdu83 has joined #ffmpeg-devel
abdu2 has joined #ffmpeg-devel
abdu has quit [Ping timeout: 240 seconds]
iive has joined #ffmpeg-devel
abdu83 has quit [Ping timeout: 240 seconds]
Ishesh has quit [Quit: Client closed]
<Lynne>
on zen3 they look like they use the same ports
cone-391 has joined #ffmpeg-devel
<cone-391>
ffmpeg Andreas Rheinhardt master:581a6a042ca8: doc/encoders: Move FFV1 encoder to video encoder section
<cone-391>
ffmpeg Andreas Rheinhardt master:4da84d5c2b2b: swscale/swscale_unscaled: Actually use X2->RGBA64 conversions
<Gramner>
not really. 4 was completely arbitrary to begin with
<Gramner>
haasn: needs to update avg_cycles_per_call() too though
abdu73 has joined #ffmpeg-devel
abdu73 has quit [Client Quit]
abdu2 has quit [Ping timeout: 240 seconds]
abdu73 has joined #ffmpeg-devel
Grimmauld has joined #ffmpeg-devel
ccawley2011 has quit [Read error: Connection reset by peer]
<haasn>
can somebody explain to me the whole decicycles confusion
<haasn>
so rdtsc is supposed to measure "ticks"
<haasn>
avg_cycles_per_call() then multiplies this by 10
<haasn>
for whatever reason
<haasn>
and print_bench calls that "decicycles"
<haasn>
but then it prints it as decicycles / 10
<haasn>
why not just never multiply anything, print the cycles directly as measured and call it as such?
<Gramner>
i think we rewrote that in dav1d because that was a bit confusing
<Lynne>
yeah, I'd be happy to see that changed
<pengvado>
the complication dates back to when we did everything with integer arithmetic. now that avg_cycles_per_call returns a float, there's no reason to *10.
<Lynne>
always an event when you respond
<iive>
recently I saw a discussion for scientifically accurate averaging functions, ones that won't lose data, because of the order floats are summed.