ChanServ changed the topic of #linux-rockchip to: Rockchip development discussion
vagrantc has quit [Quit: leaving]
kevery has joined #linux-rockchip
stikonas has quit [Remote host closed the connection]
archetyp has quit [Quit: Leaving]
alpernebbi has quit [Ping timeout: 252 seconds]
alpernebbi has joined #linux-rockchip
kevery1 has joined #linux-rockchip
kevery has quit [Ping timeout: 265 seconds]
kevery1 is now known as kevery
kevery1 has joined #linux-rockchip
kevery has quit [Ping timeout: 252 seconds]
kevery1 is now known as kevery
lurchi_ has joined #linux-rockchip
lurchi__ has quit [Ping timeout: 268 seconds]
<mriesch> mmind00: do you have minute?
kevery1 has joined #linux-rockchip
kevery has quit [Ping timeout: 260 seconds]
kevery1 is now known as kevery
<mmind00> mriesch: sure ... maybe even 2 minutes ;-)
<mriesch> mmind00: morning :-)
<mriesch> surely you heard about the issue with the stmmac on the rk3399 boards?
<mmind00> I remember it, yeah ... wasn't that supposed to get reverted?
<mriesch> i don't know what the plan is exactly and that is exactly what i wanted to ask you
<mriesch> from my point of view: the stmmac core handles pm and we can remove it from the rk glue code. historically, the rk glue code was first to handle pm and so i reverted this handling.
<mriesch> and i think in the long run we would like to have the stmmac core to handle it. unfortunately, it breaks things on the rk3399 for reasons (yet) unknown, which i naively attribute to a different issue that was simply masked
<mriesch> so i think we should investigate whether there is an underlying issue and not simply revert the revert. but what is your opinion here? or what is the usual procedure here?
<mmind00> I haven't looked at the dwmac for quite a while .... I spend yesterday fixing another 5.15-rc issue that made parts of my boardfarm unbootable: https://lore.kernel.org/all/20210913224926.1260726-1-heiko@sntech.de/ ;-)
<mmind00> I guess the normal way of operation is that if you don't have an immediate fix, to revert to the last "working" state and then do a proper new thingy
<mriesch> yeah i saw that. i bet no one saw that one coming either :-)
<mriesch> ok, as to the proper new thingy: i don't have any rk3399 hardware or time in the next week to investigate this further. realistically, can/will anyone else take a look at it?
<mps> mriesch: mmind00: my rk3399 gru-kevin chromebook boots fine with linux-5.15-rc1
<CounterPillow> mps: yes but does ethernet work
<mps> CounterPillow: it doesn't has ethernet, just wifi, so I don't know
<CounterPillow> then that's not really relating to the issue at hand. the problem is that ethernet is broken on rk3399
<mps> ah, ok and sorry. looks like I misread above messages
<CounterPillow> the obvious move is to revert the revert because it introduces a regression, any followup investigation can then be made when someone has the time and ability to do so. Technically, I have an rk3399 board and could look into it, but practically I don't even know where to start
<mriesch> CounterPillow: RockPro64?
<CounterPillow> mriesch: yep
<punit> I spent some time staring at the call traces with and without the commit from mriesch but nothing obvious jumped out
<mriesch> CounterPillow: i agree that this is the obvious move, but i am afraid the followup investigation is not going to happen. this is why i am making some racket here :-)
<punit> I thought I had narrowed it down to the clocks as the pm_sync leads to clock resume calls but that didn't do anything. The clocks it seems are already enabled
<CounterPillow> punit: I looked at the code a few days ago and checked for that as well, and yes, the probe function does enable the clocks
<punit> At which point I thought that it'll take more than the time I am able scrounge on evenings and weekends
<CounterPillow> mriesch: that is not good reasoning for introducing a regression
<mriesch> CounterPillow: punit: have you tried moving the pm_runtime_{enable,get_sync} around? in the end this is the difference the revert introduces
<CounterPillow> I haven't tested the code at all; I just read it to start thinking about where I would poke it
* mmind00 has at least found a board I have where it breaks (rk3399-puma)
<mriesch> CounterPillow: well don't blame me, i didn't expect to get this applied to 5.14 (i could stop the process to get it applied to 5.13-stable). this was intened to enter 5.15-rc1, then we would have some weeks to test it, look for a proper solution, or revert it if there is no proper solution available
<punit> I haven't moved the calls - so far I was looking to see what they actually do
<punit> Yeah, I didn't expect it to be applied so late in the cycle
<punit> Especially for something which is not a breakage
<mriesch> mmind00: punit: CounterPillow: i have to rush to a meeting, i'll catch you later. thanks for the chat!
<CounterPillow> mriesch: sorry if that came across the wrong way. I don't blame you for breaking it, I am in fact glad that someone is looking into rockchip related pm stuff. I do however think that keeping a regression instead of immediately reverting it is not a good motivator, it's just an annoyance to users.
<mriesch> CounterPillow: no worries. just wanted to say that this didn't go as intended, but now the situation is as it is and we have to find a reasonable way out.
<mriesch> anyway, maybe i can provide some starting points -- if you guys are willing to invest some time into testing/acquiring more information maybe we can work something out
<mriesch> ok, so before the calls to pm_runtime_enable and pm_runtime_get_sync happened in rk_gmac_powerup, which was called by rk_gmac_probe
<mriesch> now they are called (effectively) in stmmac_dvr_probe at a later point. so i would perform some kind of bisecting.
<mriesch> i.e., simply inserting the two calls at different points in stmmac_dvr_probe in order to identify at which point the mac must be powered on
stikonas has joined #linux-rockchip
<punit> It looks like the pm_*() calls in stmmac_dvr_probe() are not equivalent to what's happening in rk_gmac_powerup()
<punit> But if I stick a pm_runtime_get_sync() right after the pm_*() calls in stmmac, things work as expected
xalius has joined #linux-rockchip
xalius has joined #linux-rockchip
xalius has quit [Changing host]
xalius has quit [Client Quit]
<mriesch> punit: so basically it is get_sync() vs get_noresume() at this point?
<mriesch> do you see an error or something peculiar in the dmesg btw? e.g., MDIO bus registration failed?
matthias_bgg has joined #linux-rockchip
<punit> I didn't check the logs this time but in the previous instances there is no difference in logs between success / failure
<punit> Another hint, if I disable pm_runtime_put() at the end of stmmac_dvr_probe() everything works
<punit> It's almost like something that depends on the clocks is failing to take a reference and so they get turned off
warpme_ has joined #linux-rockchip
<mriesch> punit: huh ok? interesting...
<punit> mriesch: Also, get_sync() and get_noresume() are doing very different things. I am not sure if the get_sync() vs get_noresume() is the only difference
<CounterPillow> if running the resume handler fixes it then it probably means we're not enabling some clock or something in probe that we should be enabling there, but are instead relying on the resume callback to do this.
<punit> Right - but the resume handler does get called in stmmac_open()
archetyp has joined #linux-rockchip
kevery1 has joined #linux-rockchip
kevery has quit [Ping timeout: 268 seconds]
kevery1 is now known as kevery
<mriesch> do you get any "cannot get clock xyz" warnings?
<punit> Not seeing any of those
<punit> dwmac-rk.c has quite bad error handling - it seems to continue even after dev_err() instead of returning with failure
<punit> Another thing that worked - I replaced the pm_rumtime_*() calls in rk_gmac_powerup() with pm_runtime_get_noresume() and voila thing work. It goes on to prove that something is failing to take a reference without which the clocks turn off at the end of stmmac_dvr_probe().
kevery has quit [Quit: kevery]
<mriesch> do you think it has to be clocks or could the phy be suspended at the wrong time as well?
kevery has joined #linux-rockchip
<mriesch> just found that phy-handle is missing from rk3399-rockpro.dtsi
<punit> The phy-handle is not mandatory aiui. There are other paths for the probe to work
<punit> Regarding the phy, it gets detected without the phy-handle -
<punit> rk_gmac-dwmac fe300000.ethernet eth0: PHY [stmmac-0:00] driver [Generic PHY] (irq=POLL)
<mriesch> punit: yes, but (just theory here) if the handle is not specified, plat->phy_node might be NULL and the phy_node clock is not considered?!
<mriesch> cf. rk_gmac_clk_init, last part
<punit> How would your patch affect this?
<mriesch> well thanks to your tests we know that the initial pm_runtime_get_sync() solves our troubles. but my patch takes this away and reveals that something is not powered that should be powered, right?
<mriesch> then the question remains what this something is exactly
<punit> Hmm... if that is the case, then dropping the pm_runtime_put() at the end of stmmac_dvr_probe() wouldn't fix the issue, no?
<mriesch> punit: yes, i think that you tested dropping the _put and things worked
<mriesch> or did i misunderstand something there?
<mriesch> it might be a stab in the dark but can you try to put a phy-handle property in your dts and see whether it changes something?
<punit> Nope, that was it
<mriesch> (alternatively, you could comment out the if (plat->phy_node && bsp_priv->integrated_phy) at the end of rk_gmac_clk_init
<punit> That one is easier - a device tree change is a bit more involved in my setup
<punit> Let me give that a try before I step out
<punit> Though looking at the code, phy_node is not populated so I am not sure that's going to work
<mriesch> punit: you are right, if anything we need the clock
<punit> I will try the device tree change later and report if that helped
<punit> Gotta run for now!
<mriesch> punit: ok great! see you!
<punit> One thing to mull about, should the pm ops in dwmac-rk.c have been updated by your patch? It's part of the platform_driver structure at the end of the file
<mriesch> good question.
<mriesch> but i don't think so, as the rk_gmac_power{up,down} methods are called in the rk-specific handlers
<mriesch> which in turn call gmac_clk_enable -> here the phy clock would be {en,dis}abled if provided
kevery has quit [Quit: kevery]
alpernebbi has quit [Ping timeout: 268 seconds]
alpernebbi has joined #linux-rockchip
stikonas has quit [Remote host closed the connection]
stikonas has joined #linux-rockchip
stikonas has quit [Remote host closed the connection]
stikonas has joined #linux-rockchip
indy has quit [Ping timeout: 245 seconds]
indy has joined #linux-rockchip
<mriesch> punit: in non-functional case they leds on the rj45 jack are out, right?
<mriesch> but the voltage vcc3v3_s3 (or vcc_lan) are on, right?
<punit> vcc3v3_s3 seems to be on.
<punit> [ 1.162118] vcc3v3_s3: supplied by vcc3v3_sys
<punit> The board is remote - let me see if I can find out the state of the LEDs
<punit> mriesch: No change in leds between the working and non-working boot
<punit> It's orange in both the instance
<mriesch> punit: thanks for checking
matthias_bgg has quit [Read error: Connection reset by peer]
<mriesch> hm. so in the end the only thing we know is that we need to add an extra pm_runtime_get or to remove a pm_runtime_put. might there be another unmatched put/get pair?
Rathann has joined #linux-rockchip
<punit> I had a quick go at adding a phy-handle - since I don't know the actual thing I copied the nodes from orange-pi / nano-pi
<punit> Without looking too deeply, I get -
<punit> [ +0.015340] rk_gmac-dwmac fe300000.ethernet: Cannot register the MDIO bus
<punit> [ +0.003129] rk_gmac-dwmac fe300000.ethernet: stmmac_dvr_probe: MDIO bus (id: 0) registration failed
<punit> [ +0.003643] rk_gmac-dwmac: probe of fe300000.ethernet failed with error -16
<punit> ... and eth0 fails to probe.
<punit> Obviously, the phy-handle node has issues but I don't know what it should be.
<punit> I am leaning towards a missing reference on the clocks... will look into this
<mriesch> punit: ok, thanks. quite frankly i am not sure what the device tree should look like. the more recent socs seem to feature mdio nodes, maybe that's missing in your case
matthias_bgg has joined #linux-rockchip
<mriesch> punit: btw good point about pm_ops, just checked some of the other glue drivers and they seem to do things differently
<mriesch> not sure though whether it makes any difference, maybe just a matter of using the convenience helpers or not
<mriesch> punit: are you up for yet another crazy theory?
<mriesch> as you mentioned missing clocks, i had a look at gmac_clk_enable. in this function the handling of bsp_priv->clk_mac is commented out for reasons unknown (at least to me). could you uncomment them and try again?
<mriesch> this clock is not touched by other glue drivers, but the dwmac-rk does not use (yet) many of the helpers
wens has joined #linux-rockchip
<Esmil> mriesch: ethernet on my rk3399-rock-pi-4 has stopped working somewhere between 5.13.1 and 5.14.3. it detects link just fine, but no packages seems to be sent or received. is this the issue you're talking about?
<mriesch> Esmil: most probably, yes
<Esmil> cool, what do i need to revert to fix it?
<mriesch> try reverting 2d26f6e39afb88d32b8f39e76a51b542c3c51674 net: stmmac: dwmac-rk: fix unbalanced pm_runtime_enable warnings
<Esmil> building..
<mriesch> Esmil: care to test a hack afterwards?
<Esmil> mriesch: great succoss \o/
<Esmil> Sure
<mriesch> Esmil: good. and great! https://pastebin.com/dtmf9XGf
<mriesch> in theory with this patch applied the revert of 2d26f6e3 should not be necessary
<Esmil> hmm.. no. doesn't seem so
<Esmil> i only applied that paste without the revert
<mriesch> Esmil: well, would have been too good to be true. thanks for testing!
<Esmil> mriesch: let me know if you want to give it another shot. i have remote access to the serialport
<Esmil> ..so it's no problem to test
<mriesch> Esmil: thanks for the offer. i think i'll call it a day now.
<Esmil> sure np. thanks for the quick fix
<mriesch> i guess the commit will be reverted in 5.14 to provide a stable kernel.
<Esmil> i think it should be, yes
<mriesch> for 5.15 there is still the possibility to play around a bit, testing will be much appreciated then. i think this dwmac-rk glue code needs some cleanup anyway
<Esmil> i haven't tried 5.15-rc1. is broken there too?
<mriesch> anything else would surprise me. the patch was accepted quite late (5.14-rc7ish), so nothing really happened since
<mriesch> well, gotta run. evening, y'all
<Esmil> mriesch: o/
chewitt has joined #linux-rockchip
matthias_bgg has quit [Ping timeout: 268 seconds]
Rathann has quit [Remote host closed the connection]
superherointj has joined #linux-rockchip
archetyp has quit [Quit: Leaving]
archetyp has joined #linux-rockchip
archetyp has quit [Remote host closed the connection]
archetyp has joined #linux-rockchip
archetyp has quit [Remote host closed the connection]
warpme_ has quit [Quit: Connection closed for inactivity]
archetyp has joined #linux-rockchip
archetyp has quit [Remote host closed the connection]
archetyp has joined #linux-rockchip
<punit> I can confirm that ethernet on rk3399 is broken in v5.15-rc1
archetyp has quit [Remote host closed the connection]