klange changed the topic of #osdev to: Operating System Development || Don't ask to ask---just ask! || For 3+ LoC, use a pastebin (for example https://gist.github.com/) || Stats + Old logs: http://osdev-logs.qzx.com New Logs: https://libera.irclog.whitequark.org/osdev || Visit https://wiki.osdev.org and https://forum.osdev.org || Books: https://wiki.osdev.org/Books
X-Scale28 has quit [Quit: Client closed]
xx0a_q has joined #osdev
xx0a_q has quit [Client Quit]
xx0a_q has joined #osdev
xx0a_q has quit [Client Quit]
xx0a_q has joined #osdev
jeaye is now known as j
j is now known as Guest3803
Guest3803 is now known as jeaye
jeaye is now known as jeaye_
gog has quit [Quit: byee]
Marsh has joined #osdev
MiningMarsh has quit [Ping timeout: 276 seconds]
Marsh is now known as MiningMarsh
ignucio has joined #osdev
obrien has quit [Remote host closed the connection]
heat_ has joined #osdev
heat has quit [Read error: Connection reset by peer]
karenthedorf has quit [Ping timeout: 244 seconds]
heat_ is now known as heat
ignucio has quit [Quit: Leaving]
ignucio has joined #osdev
Matt|home has joined #osdev
blockhead has joined #osdev
X-Scale has joined #osdev
andydude is now known as andydude2
Matt|home has quit [Ping timeout: 256 seconds]
heat has quit [Read error: Connection reset by peer]
heat has joined #osdev
heat has quit [Ping timeout: 252 seconds]
<azonenberg> So I'm chasing a really annoying heisenbug in some ARM firmware
<azonenberg> tl;dr the observed symptom is iperf throughput on my platform dropping by 50% randomly whenever I make apparently unrelated changes to code
<azonenberg> e.g. adding a print statement called once during boot
<azonenberg> I strongly suspect something is sensitive to alignment and runs a lot slower in some memory layouts than others
<azonenberg> anyone have ideas on how to zero in on *what* is the problem?
moire has quit [Remote host closed the connection]
moire has joined #osdev
moire has quit [Remote host closed the connection]
moire has joined #osdev
<clever> azonenberg: ive heard about similar issues with profiling some big codebase, it turns out that the env vars are at the top of the stack, so the alignment of the stack shifts, based on the size of the env, which then had major implications on the benchmarks
<clever> essentially invalidating the last decade of optimizations based on benchmarks
<azonenberg> lol fuuuun
<azonenberg> i dont think its stack alignment because changing a few function calls in an init function (which returns the stack to the same state on completion) can sometimes have huge impacts on code called much later in the application
<azonenberg> I think its flash alignment
<clever> ive also seen comments in the rpi headers, that the whacky layout of h264 frames in memory, is designed around the dram controller and how bytes map to banks
<clever> so you can avoid conflicts in the open dram rows
<azonenberg> yeah i've done equally silly stuff in asic designs using weird memory layouts to match to sram banks
<bslsk05> ​github.com: linux/include/uapi/drm/drm_fourcc.h at rpi-6.6.y · raspberrypi/linux · GitHub
<clever> read this comment
<clever> about all i can offer, is to try printing the addresses of things you think are critical, and try making them match, maybe both in physical and virtual?
goliath has joined #osdev
<clever> azonenberg: oh, also, ive run into alignment issues with shaders and irq vector tables, the hw essentially ignores the lower N bits of the addr, rounding it down to the correct alignment
<clever> azonenberg: and if you didnt align the data in ram correctly, the hw reads from completely the wrong addr, and does undefined things
<clever> could something like that be happening in the boot firmware, and impacting the dram or ethernet?
<azonenberg> there is no dram in my case, its all sram
<azonenberg> cortex-m7 dtcm for most of the fun stuff
<clever> ah, how is iperf even running on that thing?
<azonenberg> It's not iperf per se it's an iperf-compatible server (or at least a subset of iperf3 protocol, right now only supports reverse mode in UDP)
<azonenberg> bare metal no OS
<clever> ahhh
<azonenberg> connected to an external FPGA via the parallel memory controller
<azonenberg> with a gigabit MAC and TX/RX FIFOs in the FPGA
netbsduser has joined #osdev
<azonenberg> (the MCU only has a 10/100 MAC on it, this lets me push higher speeds - and more importantly gives me a connection to the FPGA, the iperf is mostly a stress test of that link)
<clever> id get out a scope or something, configure it to blast packets non-stop in some direction, and see what kind of duty cycle your getting on the various busses
<clever> is the firmware hanging abnormally long in between packets?
<azonenberg> With the happy firmware it's saturating the bus
<azonenberg> (the fpga-mcu bus)
<azonenberg> pushing 528 Mbps
<azonenberg> with the sad firmware it gets about 220
<azonenberg> i've been refactoring the code making small changes unrelated to ethernet and it's randomly bouncing between those two results
<clever> another thought that comes to mind
<clever> the flash XIP window on the rp2040, has 64bit cache lines, but its wonky
<clever> each 64bit line, has 2 present flags, and a tag
<azonenberg> This is a STM32H735 executing out of on-die flash
<clever> when there is a 32bit fetch and cache miss, it will only fill half the cache line
<azonenberg> AFAIK the only caching on the flash is what the cortex-m7 has internally
<azonenberg> I conjecture misalignment of some hot path to CM7 cache lines is the problem
<clever> so if your code is 32bit aligned, it might be wasting half a cache line here and there
<azonenberg> and i'm getting stalls on the I-side L1$
<clever> but if your code is 64bit aligned, it wont waste cache lines
<clever> can you try something like `objdump -t foo.elf` on all of the good and bad firmwares, and see what patterns stand out?
<clever> are certain hot functions aligned differently in the good and bad case?
<azonenberg> i've been doing that. so far on the stuff i've spot checked one particular pair of good/bad firmwares had a 16 byte shift
<azonenberg> everything was 16 byte aligned in flash just moved one 16-byte line over
<clever> ah
<clever> inter_core_header hdr __attribute__((aligned(16))) = {
<clever> dma_cb dmacontrol1 __attribute__((aligned(32)));
<clever> you could maybe try adding this to various functions, to force a coarser alignment, and see if it changes anything?
* clever heads off to bed
jeaye_ is now known as jeaye
netbsduser has quit [Ping timeout: 260 seconds]
gcoakes has quit [Ping timeout: 248 seconds]
foudfou has quit [Remote host closed the connection]
foudfou has joined #osdev
Stellacy has quit [Quit: Leaving]
xx0a_q has quit [Ping timeout: 245 seconds]
Stellacy has joined #osdev
xx0a_q has joined #osdev
stefanct__ has joined #osdev
stefanct has quit [Read error: Connection reset by peer]
stefanct__ is now known as stefanct
X-Scale has quit [Ping timeout: 256 seconds]
kfv has joined #osdev
X-Scale has joined #osdev
X-Scale66 has joined #osdev
X-Scale has quit [Ping timeout: 256 seconds]
Left_Turn has joined #osdev
GeDaMo has joined #osdev
kof673 has quit [Remote host closed the connection]
X-Scale66 has quit [Ping timeout: 256 seconds]
seds has joined #osdev
Left_Turn has quit [Ping timeout: 276 seconds]
Reinfeld has quit [Ping timeout: 248 seconds]
Left_Turn has joined #osdev
Left_Turn has quit [Ping timeout: 260 seconds]
Left_Turn has joined #osdev
Stellacy has quit [Quit: Leaving]
xx0a_q has quit [Ping timeout: 248 seconds]
xx0a_q has joined #osdev
Gooberpatrol66 has quit [Ping timeout: 248 seconds]
MiningMarsh has quit [Quit: ZNC 1.9.1 - https://znc.in]
kfv has quit [Remote host closed the connection]
MiningMarsh has joined #osdev
kfv has joined #osdev
kfv has quit [Remote host closed the connection]
kfv has joined #osdev
kfv has quit [Remote host closed the connection]
X-Scale has joined #osdev
kfv has joined #osdev
<nikolar> KERNAL
<zid> It's spelled cornice
<guideX> when you pop your kernel, are you now out of kernel mode and in popcorn mode?
<sham1> Karnel
kfv has quit [Remote host closed the connection]
<nikolar> COLONEL
kfv has joined #osdev
<zid> It is
<zid> warm
<nikolar> what is
<zid> it
<zid> The glorious eternal it
<nikolar> IT
<nikolar> oh i get it, "Information Technology"
<goliath> guideX, apply pushcorn to get back
X-Scale has quit [Ping timeout: 256 seconds]
kfv has quit [Remote host closed the connection]
kfv has joined #osdev
<sham1> There certainly is a kernel of truth there
<nikolar> :|
obrien has joined #osdev
foudfou has quit [Remote host closed the connection]
foudfou has joined #osdev
MiningMarsh has quit [Quit: ZNC 1.9.1 - https://znc.in]
Stellacy has joined #osdev
<sham1> Oh, I'm sorry. Should have said kernal of truch
<sham1> Truth, even
<nikolar> there we go
X-Scale has joined #osdev
<zid> It's spelled colonal
MiningMarsh has joined #osdev
<nikolar> :O
Starfoxxes has joined #osdev
heat has joined #osdev
heat has quit [Read error: Connection reset by peer]
heat has joined #osdev
<Ermine> it's spelled colonial
<heat> kernial
X-Scale has quit [Quit: Client closed]
Stellacy has quit [Read error: Connection reset by peer]
edr has joined #osdev
<heat> ugh glob isn't working properly
<heat> :(
<sham1> Globial
<nikolar> why not
<heat> i dont know, gnu make is failing to wildcard src/*/*.c
<heat> wildcarding src/* and then src/*/*.c works
<sham1> I would have expected it to be src/**/*.c
<sham1> Anyway, glob bad
<heat> i just want the musl build to work
andydude2 is now known as andydude
<heat> doing ls src/*/*.c is working properly at least, on both bash and dash
MiningMarsh has quit [Quit: ZNC 1.9.1 - https://znc.in]
<nikolar> huh interesting
<nikolar> what if you replace the first * with **
MiningMarsh has joined #osdev
<heat> doesnt work
<sham1> Glob-trodding
<heat> oh this is lovely gnu make has its own directory cache
memset has quit [Remote host closed the connection]
memset has joined #osdev
kfv has quit [Remote host closed the connection]
kfv has joined #osdev
heat has quit [Read error: Connection reset by peer]
heat_ has joined #osdev
kfv has quit [Remote host closed the connection]
<heat_> ah i fixed it, my readdir d_type was broken on ext2
kfv has joined #osdev
vdamewood has quit [Quit: My Mac has gone to sleep. ZZZzzz…]
X-Scale has joined #osdev
lojik has quit [Quit: ZNC 1.8.2 - https://znc.in]
lojik has joined #osdev
kfv has quit [Remote host closed the connection]
X-Scale has quit [Ping timeout: 256 seconds]
gcoakes has joined #osdev
kfv has joined #osdev
X-Scale has joined #osdev
gcoakes has quit [Ping timeout: 264 seconds]
kfv has quit [Ping timeout: 252 seconds]
kfv has joined #osdev
kfv has quit [Remote host closed the connection]
kfv has joined #osdev
X-Scale has quit [Ping timeout: 256 seconds]
goliath has quit [Quit: SIGSEGV]
kfv has quit [Remote host closed the connection]
kfv has joined #osdev
<nikolar> lel
blockhead has quit []
xx0a_q has quit [Quit: WeeChat 4.3.5]
hwpplayer1 has joined #osdev
jistr has quit [Remote host closed the connection]
kfv has quit [Remote host closed the connection]
hwpplayer1 has quit [Read error: Connection reset by peer]
<Ermine> yay, linux desktop gods answered me
FreeFull has quit [Quit: rebooting]
jistr has joined #osdev
kfv has joined #osdev
FreeFull has joined #osdev
kfv has quit [Remote host closed the connection]
kfv has joined #osdev
<vin> I use MAP_NORESERVE on a huge map and perform writes to a small part of that map (flags = MAP_SHARED | MAP_ANONYMOUS | MAP_HUGETLB | MAP_HUGE_2MB | MAP_NORESERVE). I expected the writes to this map will show up in RSS, but I don't is that expected?
<heat_> no
<heat_> are you measuring rss properly using smaps?
<vin> No I am still using statm, I haven't gotten around writign a program that parses pagemap yet
<heat_> smaps tells you the vma's rss
<heat_> *precisely*
<heat_> i already told you the rss counter in status/statm isn't accurate
<vin> Yes yes, but I am working on gbs of data, so I figured it should be okay for now and wondered if MAP_NORESERVE is causing the incorrect RSS. I will test with smap sums now
kfv has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
<vin> heat_: I can confirm it is not an issue with RSS but something to do with MAP_NORESERVE here is a simple test file that reproduces the issue https://0x0.st/XfID.cc
<vin> Although I write 5 GiB worth of data, teh RSS remains 3448832 bytes
<heat_> dude what
<heat_> that's not how your parse smals
<heat_> smaps
<heat_> wait, you're doing it in a loop
<vin> why not? Might not be the fastest way but it does the job
<heat_> i don't know what's happening honestly. does it only happen when MAP_NORESERVE?
<vin> No looks like even when I remove MAP_NORESERVE any writes to the mapped area is not tracked by RSS!! https://0x0.st/Xflr.cc
<vin> Thats crazy
<heat_> it might be that it's only tracked by AnonHugePages or something
josuedhg has joined #osdev
<vin> The reason I am not using THP is because the madvise (MADV_HUGEPAGE/MADV_COLLAPSE) is simply a hint and does not right away colapse the base pages to 2 MiB pages :/
<bslsk05> ​lore.kernel.org: hugetlb pages not accounted for in rss
<heat_> MADV_COLLAPSE is not a hint
<vin> from the man page "MADV_COLLAPSE operates on the current state of memory of the calling process and makes no persistent changes or guarantees on how pages will be mapped, constructed, or faulted in the future."
<heat_> yes, *in the future*
<heat_> thats why they are transparent
<vin> you mean on the next page walk?
<vin> *page table
<heat_> no
<heat_> nothing stops them from breaking down the thps randomly, but that's why they are transparent
<heat_> once collapsed it doesn't mean they won't be broken down
<heat_> MADV_COLLAPSE is a "hey, you should really really really try and collapse what you can"
<vin> right. This is "try and collapse" so it can return without colapsing anything?
<heat_> yes
<heat_> it will try to, but if you dont have huge pages available it'll simply not do anything
<heat_> because, again, *transparent* huge pages
goliath has joined #osdev
xenos1984 has quit [Ping timeout: 248 seconds]
xenos1984 has joined #osdev
<vin> yes, I have two optins either switch to base pages and madvise (assuming RSS will work correctly after colapse) or continue using hugeTLB and rely on Private_Hugettlb of smaps to caculate the correct RSS
Matt|home has joined #osdev
xenos1984 has quit [Ping timeout: 252 seconds]
<Ermine> heat_: that sound port issue is apparently caused by linuks kernal
<heat_> POGGARS
<heat_> should've used onyx bozos
<Ermine> real shit
xenos1984 has joined #osdev
<nikolar> Onyx doesn't even have a sound subsystem
<heat_> see? sound issues fixed ✅
<heat_> BUGLESS
seds has quit [Remote host closed the connection]
<nikolar> You can't have bugs if there's no code
<nikolar> Genius
<Ermine> truly MINIMALIST line of reasoning
eck has quit [Quit: PIRCH98:WIN 95/98/WIN NT:1.0 (build 1.0.1.1190)]
<heat_> sound is BLOAT indeed
eck has joined #osdev
<Ermine> it bloats up air with motion
<Ermine> all in all, this journey showed me that whole linux sound architecture is quite enigmatic
gog has joined #osdev
<heat_> linux desktop eh
<gog> no
<heat_> sophia desktop when
<Ermine> i.e. alsa exposes one set of abstractions, and pipewire exposes another set of abstractions
<Ermine> and it's not clear to me how those map to each other
<gog> heat_: sophia is being ported to .net
<Ermine> port it to java and use audioflinger
<gog> it will now be a microservice webdevsktop
<heat_> it's joever
<gog> no it's coconutpilled and contextmaxxing
<gog> it will have all the contexts
<heat_> is it brat
<gog> HttpContext, AuthenticationContext
<gog> BratContext
<heat_> OKAY NOW WE'RE TALKING
<gog> yes
<heat_> did yall know it's completely undefined to recover from a double fault
<Ermine> VMAContext
<gog> nah it's fine
<gog> you can just reinit everything in the double fault handler
<gog> is fine
<heat_> you actually cannot
<heat_> sdm tells you that
<gog> sdm is a coward
<heat_> bozos
<heat_> Ermine, i can build musl on onyx onyx goated
<Ermine> heat_: great job!
<heat_> thank
memset has quit [Ping timeout: 260 seconds]
memset has joined #osdev
josuedhg has quit [Ping timeout: 256 seconds]
gildasio has quit [Ping timeout: 260 seconds]
FreeFull has quit [Quit: New kernel version]
gildasio has joined #osdev
Matt|home has quit [Quit: Client closed]
<Ermine> so... the only correct way to proceed in a double fault case is to cause triple fault?
Matt|home has joined #osdev
sympt has joined #osdev
<nikolar> ¯⁠\⁠_⁠(⁠ツ⁠)⁠_⁠/⁠¯
<nikolar> Probably
FreeFull has joined #osdev
<Ermine> btw OS/2 used triple fault to reboot
<heat_> no, the only correct way to proceed in a double fault is to halt the system
<Ermine> oh, indeed, that's what Volume 3A page 6-33 says
xx0a_q has joined #osdev
netbsduser has joined #osdev
memset has quit [Ping timeout: 260 seconds]
Matt|home has quit [Quit: Client closed]
xx0a_q has quit [Quit: WeeChat 4.3.5]
josuedhg has joined #osdev
Starfoxxes has quit [Remote host closed the connection]
sympt has quit [Ping timeout: 276 seconds]
Matt|home has joined #osdev
Gooberpatrol66 has joined #osdev
X-Scale has joined #osdev
* azonenberg muffled swearing in the general direction of cortex-m7 lacking decent performance counters
<azonenberg> i want to see all kinds of microarchitectural details like DTCM accesses per unit time and more importantly number of cycles with a DTCM0 / DTCM1 single access, no access, or a dual issue
<azonenberg> it looks like the M85 adds a PMU
<azonenberg> (optionally at least)
<azonenberg> while the M7 doesnt have support for it
xx0a_q has joined #osdev
gog has quit [Ping timeout: 252 seconds]
xx0a_q has quit [Quit: WeeChat 4.3.5]
Matt|home has quit [Ping timeout: 256 seconds]
gorgonical has joined #osdev
heat_ has quit [Read error: Connection reset by peer]
heat_ has joined #osdev
<gorgonical> Thanks guys
<gorgonical> For being so helpful over the years when I was getting my start doing this
<nikolar> gorgonical: you ok
<gorgonical> yes yes of course
<gorgonical> just sentimental today
<nikolar> ah ok
<gorgonical> I'm writing my acknowledgments for my dissertation and I've been waiting literally six years to thank you all
<nikolar> heh did you write "special thanks to #osdev"
<gorgonical> I did
<nikolar> neat
<nikolar> i mean i wasn't around for the most of it and when i was, i really didn't help much :P
<gorgonical> My first work was a hypervisor for x86 and my advisor was very busy at the time and this community gave me so much info and guidance
<nikolar> that's cool
<gorgonical> geist in particular has a knack for tolerating questions from beginners
GeDaMo has quit [Quit: 0wt 0f v0w3ls.]
<nikolar> well he is the wise elder here
<nikolar> i've been bugging him with pdp-11 stuff .P
<nikolar> :P
<gorgonical> zid too even if sometimes he is grumpy about it
<gorgonical> maybe he just reads as grumpy to an american and it's just the britishness
guideX has quit [Quit: Leaving]
<heat_> luve u bby
<nikolar> yeah zid can be helpful if he doesn't consider you an idiot .P
<nikolar> oi heat
<gorgonical> thanks much heat_
<gorgonical> in fact so many of you are generous with your time and knowledge
<nikolar> or just generous with annoyance :P
<gorgonical> to a determined beginner that's the same thing lol
<zid> yea I am not grumpy
<zid> Like how your saccarine message earlier made me think you were suicidal, due to your americanness :P
<zid> saccharine*
<gorgonical> im glad nikolar mentioned it because it does indeed look bad
<nikolar> zid: it also made me think he's suicidial lol
<zid> You have a fulfilling life as a walmart greeter if all else fails, gorgonical
<gorgonical> a true nightmare
<nikolar> indeed
<zid> Average british person would try to break them, if they tried that here
<zid> "Why are you so happy? Your haircut is a travesty."
<nikolar> kek
<heat_> is gnu.org down for everyone?
<gorgonical> best shopping experience i ever had was in glasgow because nobody said a word to me any of the times I went to lidl. It was great
<nikolar> heat_: seems to be for me
<gorgonical> heat_: for me also
<bslsk05> ​imgur.com: Irish Twitter at 90 today - Album on Imgur
<nikolar> gorgonical: that's a typical shopping experiance here too
<heat_> sad. i wanted to grok $$
<zid> 10/10 collection of tweets, btw
<nikolar> why do you need gnu.org heat_
<heat_> <heat_> sad. i wanted to grok $$
<gorgonical> holy shit, "con kearney"
<heat_> i *think* i need $$ in a makefile
<nikolar> isn't $$ just an escaped $
<heat_> no, there are some weird semantics when it comes to the read-in phase or whatever they call it i think
<nikolar> oh weird
<zid> The taxi one is great
<nikolar> all i've ever used it for is as an escaped $
<gorgonical> getting dressed down for your outfit by the cabbie is brutal
<zid> "Go inside and change I'll turn off the meter" is horrendously savage
<bslsk05> ​<fsfstatus> Our infrastructure at Core Site is unreachable at the moment. Major outage including fsf.org, gnu.org, and most services.
<zid> nikolar: Can you timewarp me 30 mins
<heat_> oopsie
<nikolar> no unfortunately
<zid> fucker
<nikolar> haven't yet figured out how to do that
<nikolar> actually i have an idea
<zid> oh I thought you just didn't wanna spend the time crystals on me
<zid> and you were being a stingy git
<nikolar> set an alarm for 30min in the future
<nikolar> and go sleep
<zid> I'd wake up so fucking tired
<zid> and I'd never pull it off anyway
<nikolar> el
<heat_> has anyone in the history of the world fully understood how to write gnu make makefiles?
<heat_> fuck me, this is gibberish
<nikolar> lel
<heat_> "The target should be rebuilt also when the command line has changed since the last invocation. This is not supported by Make itself, so Kbuild achieves this by a kind of meta-programming." of fucking course
<nikolar> where are you reading that from
<nikolar> because i am 100% that make doesn't track the command line
<zid> This is not supported by make.
<heat_> kbuild
<zid> Kbuild implements it because make doesn't
<zid> nikolar did you learn to read from an american shampoo bottle
<heat_> i may or may not be trying to finally transplant kconf into onyx and struggling mightily with the spaghetti they call "kbuild"
<nikolar> zid: maybe
<nikolar> you'll never know
<heat_> ok i'll use gnu info for once
<nikolar> do you know how to use gnu info
guideX has joined #osdev
goliath has quit [Quit: SIGSEGV]
<heat_> it's not that hard
<guideX> os debugging is hard, the breakpoints don't work
<guideX> and there's no 'exception' so the error occurs somewhere else, not within the try catch
<guideX> best case: you halt the system, and show the error
<guideX> or maybe show a window with the error, if you can rescue the situation first I guess
<guideX> I can't wait until I can send emails, then I will send myself emails as a way of debugging maybe
frkazoid333 has quit [Read error: Connection reset by peer]
X-Scale has quit [Ping timeout: 256 seconds]
<azonenberg> guideX: lol, embedded and kernel debugging can be quite interesting
<azonenberg> i'm working on a platform now that randomly halves network throughput
<azonenberg> based on - my best guess - alignment of some critical code in flash
<heat_> have you tried the compiler alignment options?
<azonenberg> heat_: so thate the thing, i dont know what needs to be aligned how
<heat_> try aligning everything :)
<azonenberg> i have several binaries that are slow and some fast, based on apparently trivial changes like adding a print statement to the boot process
<azonenberg> i'm looking at all of the functions related to networking
<azonenberg> i suspect it's a specific inner loop - not even a function - that has to be aligned to L1D$ lines
<azonenberg> sorry I$
<azonenberg> the data layout doesnt change between fast and slow binaries
<azonenberg> i've tried dumping symbol tables and diffing them to see if anything jumps out
<azonenberg> so far i can tell they're different, but multiple fast images are also different
<azonenberg> and i havent found any commonality between all the fast images that distinguishes them from all the slow ones
<gorgonical> if this is embedded work all sorts of fun things could be wrong. could be i-cache pollution if the text sections are misaligned between versions, too
netbsduser has quit [Ping timeout: 264 seconds]
<azonenberg> gorgonical: yeah. I'm on a cortex-m7, single threaded event loop, 32 kB each I and D cache
<azonenberg> UDP iperf performance is fluctuating randomly between ~210 and ~528 Mbps as i make apparently unrelated changes to firmware
<azonenberg> it's consistently one of the two, so there's a binary good/bad state somewhere
<gorgonical> it's a far shot but maybe put it somewhere on the list to consult exactly how the i-cache is associative. Could be the text sections are being moved between compiler flag sets and the loop isn't fitting right into the i-cache
<azonenberg> which leads me to suspect it's related to either some loop missing in I$ or perhaps instructions in a loop are being misaligned and it fails to coalesce dual issues to TCM properly or something
<azonenberg> As of now code is all in flash (I'm not using ITCM), stack and ethernet frames in DTCM, and using axi sram for less speed critical stuff
<azonenberg> Experimenting with putting some of the code in ITCM is very much on the todo list, i just need to add some additional support code to my firmware to actually make that happen
<azonenberg> i.e. the linker script has to reserve ITCM space and my startup code has to memcpy the relevant routines from flash to ITCM
<azonenberg> The point is more that i want to understand the problem first, then fix it
<azonenberg> if i dont understand it, i risk doing what i did a couple days ago
<azonenberg> where i unrolled a loop and it perturbed the problem away
<azonenberg> i thought it was fixing it
<azonenberg> then a day later as i refactored some init code it came back
<azonenberg> So i want to identify the root cause and fix it permanently
<azonenberg> this is where i wish i had performance counters on the M7 so i could actually measure cache hit rates and such
<azonenberg> on intel and nvidia microarchitectures i make heavy use of platform instrumentation to determine how to optimize
<azonenberg> and on this arm i'm flying blind
<gorgonical> what a thought. I've only ever used the performance counters to provide debug functionality to a vm guest once
<mcrod> sockets are awful to fuck with
<mcrod> hi
<gorgonical> It never occurred to me until just now that you could totally sleuth that problem out with the perf counters
<nikolar> oi
<azonenberg> gorgonical: yeah thats what they're intended for
<azonenberg> arm -A processors have the PMU
<azonenberg> as does the cortex-m85
<azonenberg> the M7 does not as far as i can tell
<azonenberg> i use performance counters all over my hardware too
<gorgonical> i don't know what setup you have but if you can change cache parameters you might be able to get more info
<azonenberg> The overall system here is a stm32h735 paired with a xilinx spartan-7
<azonenberg> I have AXI on the MCU bridged through the parallel memory controller to APB on the FPGA
<azonenberg> there's a gigabit RGMII MAC on the FPGA with APB-mapped TX/RX FIFOs
<azonenberg> (throughput of the parallel memory controller is slow enough going to something faster like AHB/AXI would have bloated gate count with no performance gains)
<azonenberg> In the "good" state where the arm is running at the faster speed, i'm able to saturate this bus
<azonenberg> my FPGA-side perf counters show 75.02% of clock cycles actively moving data, 24.91% in the idle state with chip select deasserted between read/write bursts (minimum 2 clock idle period before you can start another burst)
<azonenberg> and only 0.06%, or 85K clock cycles out of 137.5 million per second, of clocks in which a bus transaction could be issued to the FPGA but was not
<azonenberg> while the APB bus is only about 50% loaded because it's 32 bits with parallel address and data, while the external memory bus is 16 bits with sequential address and data
xenos1984 has quit [Read error: Connection reset by peer]
<azonenberg> So you can't ever saturate the APB and thus it's not the bottleneck. As I expected
<azonenberg> But once i saw the bus was saturated, i knw i had optimized the arm side enough
<azonenberg> i called it mission accomplished then tweaked some unrelated code and it slowed down by 50% :P
<azonenberg> same fpga bitstream so i know the issue is on the arm side
<azonenberg> i'm >99% sure the problem is instruction related
<gorgonical> The only hint is that tweaking unrelated code surely changes the code and/or data layout and you have a cache alignment issue. Otherwise it sounds like you're just gonna have to slog through it
<azonenberg> because i dumped the memory map of good and bad binaries and the .bss, .tcmbss, .data,and .rodata layouts are identical
<azonenberg> .text changes (duh)
<gorgonical> You could, I guess, intentionally fuck with the data cache alignment in a known way
<azonenberg> but i have dumps of one slow and two fast binaries so far
<azonenberg> and i havent found what the two fast have in common, alignment wise, that is different from the slow
<gorgonical> Oh and the data is all the same. Well then yeah seems like a done deal that it's i-cache
<azonenberg> yeah i'm >99% sure it's instruction related
<azonenberg> but is it alignment to flash words or L1 lines
<azonenberg> and what code (or immediate data in .text)
<gorgonical> How big is a flash word here?
<azonenberg> That is what i'm stuck on right now
<gorgonical> Cause if it's large enough maybe you could do some compiler stuff and try to forcibly align sections to flash word sizes
<azonenberg> Flash supports AXI transactions of up to 64 bits in width but the physical array structure is i believe 256 bits
<azonenberg> because that's the ECC codeword and write block size
<gorgonical> oh, hmm I see how you mean
<guideX> I wonder how more complex os's like linux and windows can use devices it wasn't designed for like that
<guideX> I feel like all the devices I support are like the last devices I will ever support in my os xD
<gorgonical> by physical array structure you mean to suggest each flash cell needs a minimum of four transactions to read the whole content, right?
Matt|home has joined #osdev
<azonenberg> I mean that each physical flash word is 256 bytes wide and you need a burst of four 64-bit axi beats to read it all
<azonenberg> There is a latency cost associated with doing this (the flash is clocked slower than the CPU)
<azonenberg> but i think it has readahead for linear code, i have to check the details
<azonenberg> but basically if you branch out of linear order and miss in L1 there's a penalty for reading a new flash line
<gorgonical> readahead still induces a latency for cross-word reads though right?
<azonenberg> I think so. These are the microarchitectural details i'm still reading up on
<Matt|home> hi..
<azonenberg> The CM7 D-cache is 4 way set associative
<azonenberg> instruction is 2 way set associative
<azonenberg> lines are 32 bytes (same as a flash word) for both
<azonenberg> only 2 ways of associativity is interesting. so perhaps what i have is several pieces of hot code competing for the same cache lines and thrashing
<azonenberg> and the alternate build shifts one relative to the other
<azonenberg> but again this is where i really want performance counters so i know if i'm having L1 misses or flash bottlenecks or both...
<azonenberg> Moving some of the hotter functions to ITCM is definitely an easy thing to test but knowing whether i've fixed the problem for good is another question
<azonenberg> especially since the act of moving code to ITCM is going to perturb the layout of .text
<azonenberg> lol
<gorgonical> yeah I was just reminding myself of how 2-way associativity impacts this
<gorgonical> I might suggest in response to your anxiety about solving the problem for good that if that solves it, it's not a problem you can even solve "for good"
<gorgonical> If you have 2-way set associative then the best you can do is fit two hot blocks in each set, and however many sets you have in the cache. Anything else is a compromise
<gorgonical> Some thrashing will be inevitable
<azonenberg> yeah but the cache is fairly large, 32 kB x 2 ways is 16 kB per way
xenos1984 has joined #osdev
<azonenberg> so if this is the case i just need to constrain hot stuff from a single code path to be at different addresses mod 16 kB
<azonenberg> it's also something i can understand and e.g. write pass/fail scripts for
<gorgonical> yes that's what I meant
<azonenberg> What i dont want is an unexplained performance issue
<zid> assuming the tag bits are there
<zid> what if the tag bit is on.. bit 0
<zid> and alternating addresses have opposite association
<azonenberg> zid: i 'm assuming ags are on lower bits but we'll find out
<azonenberg> This is something i can dig into in more detail
<gorgonical> I used bad terminology there because I don't think about caches a lot. If it's 2-way then you have two cache lines per block, but how many blocks in the whole cache?
<azonenberg> gorgonical: 32 kB total size, 2 way, 128 byte lines
<azonenberg> sorry 128 bit
<azonenberg> so 32 kB / 32 bytes = 1024 lines
<azonenberg> so 512 unique addresses in the cache, two lines per address
<gorgonical> I should hope 512 slots for i-cache is sufficient or you have one hell of a large tight loop
<zid> (I assume in real chips, the tag bits are not on the low bits, because things tend to end up being aligned on the low bits already for the performance of OTHER features)
gog has joined #osdev
<clever> azonenberg: oh, something else from the rp2040, you can tag certain functions to live in ram rather then flash
<clever> it tells the linker to assign a ram addr, and adds it to the special .data init stuff, which copies from flash->ram on boot
dostoyevsky has quit [Quit: leaving]
<clever> on the rp2040, ram is far faster then flash, so you can increase perf with that
dostoyevsky has joined #osdev
netbsduser has joined #osdev
X-Scale has joined #osdev
Left_Turn has quit [Read error: Connection reset by peer]
<heat_> it's prudent to shadow the whole thing in RAM if you can
\Test_User has quit [Remote host closed the connection]
Stellacy has joined #osdev
\Test_User has joined #osdev
<clever> heat_: yep, pico has an extra linker mode called copytoram, that just shadows all of the flash
X-Scale has quit [Ping timeout: 256 seconds]
<azonenberg> clever: you can do that on stm32h7 too, it has 64 kB of hard wired ITCM (instruction tightly coupled memory, dedicated SRAM bus for instructions that doesn't compete with the other memory buses)
<azonenberg> plus an additional three 64 kB blocks that you can allocate to ITCM or general system RAM as you see fit
<azonenberg> i.e. you can have 64-256 kB of ITCM trading off that against data storage memory
<azonenberg> plus a fixed 128 kB of DTCM for data
<azonenberg> and the rest of the ram lives on the main system AXI bus
<azonenberg> I just havent set up the necessary directives in my linker script to make any code live in ITCM so right now that memory is unused
<clever> azonenberg: ive got the same problem on the RP1 as well
<clever> the RP1 has a dual-core cortex-M3, which has 3 bus ports leaving each core
<clever> tightly coupled iram, tightly coupled dram, and the system bus
<clever> 8kb of iram, 8kb of dram, and 64kb of shared sram (visible to dma, M3's, and the pcie master)