michaelni changed the topic of #ffmpeg-devel to: Welcome to the FFmpeg development channel | Questions about using FFmpeg or developing with libav* libs should be asked in #ffmpeg | This channel is publicly logged | FFmpeg 7.1.1 has been released! | Please read ffmpeg.org/developer.html#Code-of-conduct
abdu has joined #ffmpeg-devel
thilo has quit [Ping timeout: 260 seconds]
thilo has joined #ffmpeg-devel
derpydoo has joined #ffmpeg-devel
philipl has joined #ffmpeg-devel
blut has joined #ffmpeg-devel
\\Mr_C\\ has quit [Remote host closed the connection]
blut has left #ffmpeg-devel [#ffmpeg-devel]
Anthony_ZO has joined #ffmpeg-devel
^Neo has quit [Ping timeout: 252 seconds]
abdu has quit [Quit: Client closed]
jamrial has quit []
Traneptora has joined #ffmpeg-devel
kasper93 has quit [Ping timeout: 276 seconds]
kasper93 has joined #ffmpeg-devel
kasper93 has quit [Ping timeout: 252 seconds]
kasper93 has joined #ffmpeg-devel
mkver has joined #ffmpeg-devel
cone-291 has joined #ffmpeg-devel
<cone-291>
ffmpeg Andreas Rheinhardt master:18309fba3c82: avcodec/hq_hqadata: Avoid relocations
<cone-291>
ffmpeg Andreas Rheinhardt master:e38616c4acc5: avcodec/hq{xvlc,_hqadata}: Deduplicate and hardcode cbp table
<cone-291>
ffmpeg Andreas Rheinhardt master:12c9ffa569a0: avcodec/hq: Include alpha in cbp VLC table
<cone-291>
ffmpeg Andreas Rheinhardt master:ce0074f97bdc: avcodec/hq_hqa: Use RL-VLC table
<cone-291>
ffmpeg Andreas Rheinhardt master:9c0d6145c9e0: avcodec/hq_hqa: Include implicit +1 run in RL VLC table
<cone-291>
ffmpeg Andreas Rheinhardt master:c12108cdaa70: avcodec/hq_hqa: Don't zero in small chunks, don't zero twice
<cone-291>
ffmpeg Andreas Rheinhardt master:c1f124f3f03c: avcodec/hq_hqa: Use ff_vlc_init_from_lengths()
<cone-291>
ffmpeg Andreas Rheinhardt master:c39e23cc919b: avcodec/hq_hqa: Check available date before allocating frame
<cone-291>
ffmpeg Andreas Rheinhardt master:16943876f877: avcodec/hq_hqa: Remove implicit always-false checks
<cone-291>
ffmpeg Andreas Rheinhardt master:bf327ac6762d: avcodec/hq_hqa: Check size before initializing GetByteContext
System_Error has quit [Ping timeout: 264 seconds]
System_Error has joined #ffmpeg-devel
Anthony_ZO has quit [Remote host closed the connection]
derpydoo has quit [Quit: derpydoo]
cone-291 has quit [Quit: transmission timeout]
abdu has joined #ffmpeg-devel
pross has quit [Read error: Connection reset by peer]
mkver has quit [Ping timeout: 244 seconds]
mkver has joined #ffmpeg-devel
^Neo has joined #ffmpeg-devel
^Neo has quit [Changing host]
^Neo has joined #ffmpeg-devel
jamrial has joined #ffmpeg-devel
<haasn>
ramiro: sad, avx FMA instructions are not bit exact :( I guess this is where we start hitting real problems
<jamrial>
yeah, fma has high precision intermediates
<toots5446>
michaelni: thanks for the review. Do you have any suggestion on the logic that should be applied to that decoder patch? I would really like to move forward with it and I'm not sure what is the most advisable path. I'm all about a solution that is satisfactory for now and can be refined once we actually have code using it.
<haasn>
Is there an explanation anywhere of what the SBUTTERFLY and TRANSPOSE macros in x86util are doing, or how to use them?
<haasn>
cc BBB
<toots5446>
Maybe we should clear all pending metadata that have a PTS lower than the latest decoded frame? Or simple keep the latest pending metadata and clear it as soon as another one is ssubmitted or a frame with a higher PTS is decoded.
<fflogger>
[editedticket] MasterQuestionable: Ticket #11430 ([avformat] [Regression] Data stream in output may glitch "-stats" display since 7.0) updated https://trac.ffmpeg.org/ticket/11430#comment:14
<kurosu>
haasn: it's DCT-like stuff
<haasn>
jamrial: I'm guessing that's for 8x8. What do the variants like 4x4B do to the extra elements beyond the first 4 bytes in each register? Just ignore them / put random data?
Anthony_ZO has quit [Ping timeout: 252 seconds]
<haasn>
How does TRANSPOSE8x8W work when mmsize < 32? Or does it just not work in that case?
<jamrial>
8 words is 16 bytes, so it works fine with mmsize 16
<jamrial>
it uses an extra reg on x86_64, or stack on x86_32, for temporary storage
<jamrial>
and yes, afaik, the ones smaller than the destination reg just end up with garbage in the upper bits, which can be ignored
<haasn>
oh, right - and there isn't an 8x8D variant. hrm
<haasn>
I have a scenario where I effectively need to transpose 32x8B. I can pshufb this to turn it into a 16x8D transpose istead
<haasn>
but none of those variants exist, it seems
<BBB>
haasn: I'm wondering if you're thinking about this the wrong way
<haasn>
(context: reading RGBA into separate registers for R, G, B, A
<jamrial>
16x16w uses 16 regs of 16 bytes each, so no mmsize 32 version
<BBB>
don't take this the wrong way
<BBB>
but these macros are not meant to be useful API
<BBB>
they are just repeated sets of instructions
<BBB>
that we turn into a macro so we don't have to explicitly write it out every time
<BBB>
transpose8x8w, for example, likely works for any register size, it's just not necessarily a full cross-lane transpose, just an in-lane one
<BBB>
is that useful? depends on what you're trying to do...
<jamrial>
yeah, if there isn't an specific macro, is because no decoder needed it
<jamrial>
you can just go and mix SBUTTERFLYs and SWAPs to get 8x8d
<haasn>
what is SBUTTERFLY doing on a conceptual level?
<haasn>
{abcd} {xyzw} -> {axby} {czdw}?
<jamrial>
there's a small explanation above its definition
<haasn>
right, I think I get it now
<haasn>
I think the problem I'm facing is that I have contiguous elements in separate _lanes_ of the same register
<haasn>
whereas all of these interleaving instructions will end up placing elements from different registers adjacent to each other
<haasn>
maybe what I should be doing is loading xmm sized registers at a time then
<jamrial>
SBUTTERFLY dqqq is crosslane
<jamrial>
and like BBB said, they are just macros for specific combinations of instructions, to make code shorter
<jamrial>
just write the shuffles with punpck*, vperm and such as needed
Marth64 has joined #ffmpeg-devel
Marth64[m] has quit [Ping timeout: 276 seconds]
<haasn>
and then I can unpack the lanes individually and it will all be in the correct order at the end
<haasn>
yeah I think that's the approach I'll go with
<haasn>
I don't imagine movu xm0 + vinserti128 ym0 is significantly slower than movu ym0, the bottleneck is going to be reading data either way
DauntlessOne4 has quit [Ping timeout: 252 seconds]
<jamrial>
haasn: maybe a gather
<haasn>
will try that also
abdu59 has quit [Ping timeout: 240 seconds]
<haasn>
out of interest I did benchmark vinserti128 vs just fixing the lane order afterwards using vpermq for read_packed2 and the former was 8.1 cycles vs the latter 7.9 cycles
abdu has joined #ffmpeg-devel
<kurosu>
Is looking at anything below 0.5 cycles meaningful? It doesn't exist?
<kurosu>
And I stopped looking at vpgather for anything that is stride loading. Even if dav1d using them would indicate they're beneficial
<fflogger>
[newticket] kasper93: Ticket #11545 ([avcodec] [dec:libzvbi_teletextdec] Error while opening decoder: Internal bug, should not have happened) created https://trac.ffmpeg.org/ticket/11545
<michaelni>
toots5446, really, whatever is simple and clean. Because whatever is choosen it might turn out to fail in some corner case (maybe timestamp resets, maybe a corrupted frame) its easier to adjust when its in git and we have testcases than just thinking about theory
<mkver>
kasper93: Why don't you just ping these patches?
zsoltiv has joined #ffmpeg-devel
zsoltiv_ has joined #ffmpeg-devel
<kasper93>
I did, I'm trying new strategy to track patches that noone wants to merge
<fflogger>
[editedticket] MasterQuestionable: Ticket #11542 ([ffmpeg] gdigrab sometimes fails to capture specific windows on some machines) updated https://trac.ffmpeg.org/ticket/11542#comment:15
<kasper93>
would it be possible to add v7 to trac? The latest version you can tag is 6.1.1
<jkqxz>
Is there an explanation anywhere of how the frame/slice threading for an lavc decoder works?
<jkqxz>
I seem to have slice threading working trivially by just calling execute2 at the right moment, but I'm not seeing how to connect the pieces together to get frame threading by similar magic.
<jkqxz>
(And I'm pretty sure for frame threading I need to do more to manage context structures.)
<mkver>
jkqxz: If it is an intra-only codec (with no dependencies between frames), then just use s/ff_get_buffer/ff_thread_get_buffer/ and add the AV_CODEC_CAP_FRAME_THREADS.
<jkqxz>
How is that serialising use of private context structures?
<mkver>
If not, you will need an update_thread_context callback. The paradigm here is as follows: Thread A parses the header and sets ctx fields appropriately, then call ff_thread_get_buffer() and finally ff_thread_finish_setup(). After this, thread A decodes its frame and signals decoding progress, typically via ff_progress_frame_report(). Thread A must not modify any field read by update_thread_context after that.
<mkver>
Every worker thread has its own private context.
<mkver>
FFCodec.init is called on every worker thread.
<jkqxz>
decoder->init got called N times?
<mkver>
Yes.
<jkqxz>
Huh. Ok.
<mkver>
(Decoders can check whether they are the first frame worker thread in order to parse extradata only once.)
<mkver>
What decoder are we talking about?
<jkqxz>
Making that change does seem to run. Disabling slice threads and decoding a three-frame file gets close to a 3x speedup.
<jkqxz>
I will run in tsan to make sure I haven't got some other bad interaction.
<mkver>
You can't really have races with intra-only codecs without update_thread_context.
<mkver>
What decoder are we talking about?
<mkver>
It's not mjpeg, isn't it?
<jkqxz>
If the approach is essentially to make N instances of the decoder and uses them completely independently then indeed it should be good.
<jkqxz>
APV.
<jkqxz>
How do frame and slice threads interact here?
<jkqxz>
On an input with strong slice threading opportunities enabling both makes it much slower.
<mkver>
Only one threading type is active for any AVCodecContext.
<mkver>
Frame threading is prefered generally over slice threading, see lavc/pthreads.c
<mkver>
(Hypothetically users can select to use frame threads and override the execute/execute2 callbacks to use both.)
<jkqxz>
Is there any underlying reason for that choice or is it just it hasn't been implemented?
<mkver>
It would not work with the way progress is signaled for frame-threaded decoders (progress is a simple number, typically meaning "the number of macroblock rows that have been decoded and can be referenced").
^Neo has quit [Ping timeout: 248 seconds]
<jkqxz>
Right, for other decoders which do have interdependency. But if I do have a decoder where every slice is independent then it could be implemented.
<jkqxz>
Yep. And it would need the new cap to distinguish existing codecs which support slice xor frame from codecs which can support slice or frame.
<mkver>
I am not sure it would need a new cap. Users can already choose to both, but lavc will just use one (there is field "active_thread_type" (set by lavc) and thread_type (or so)).
<mkver>
s/new cap/new public cap/
<jkqxz>
True, it could be internal-only.
jamrial has quit []
jamrial has joined #ffmpeg-devel
TheAIDev has joined #ffmpeg-devel
<TheAIDev>
any plans for subtitle rework/patches soon?