ChanServ changed the topic of #openvswitch to: Open vSwitch, a Linux Foundation Collaborative Project || FAQ: http://docs.openvswitch.org/en/latest/faq/ || OVN meeting Thurs 9:15 am US Pacific || Use ovs-discuss@openvswitch.org for questions if you don't get an answer here. || Channel logs can be found at https://libera.irclog.whitequark.org/openvswitch
djhankb has quit [Remote host closed the connection]
djhankb has joined #openvswitch
djhankb has quit [Remote host closed the connection]
djhankb has joined #openvswitch
djhankb has quit [Remote host closed the connection]
donhw has quit [Read error: Connection reset by peer]
donhw has joined #openvswitch
djhankb has joined #openvswitch
djhankb has quit [Remote host closed the connection]
djhankb has joined #openvswitch
djhankb has quit [Remote host closed the connection]
djhankb has joined #openvswitch
djhankb has quit [Remote host closed the connection]
djhankb has joined #openvswitch
djhankb has quit [Remote host closed the connection]
djhankb has joined #openvswitch
erig has quit [Ping timeout: 256 seconds]
_whitelogger has joined #openvswitch
racosta_ has quit [Ping timeout: 240 seconds]
GNUmoon2 has quit [Remote host closed the connection]
GNUmoon2 has joined #openvswitch
djhankb has quit [Remote host closed the connection]
djhankb has joined #openvswitch
_whitelogger has joined #openvswitch
djhankb has quit [Remote host closed the connection]
djhankb has joined #openvswitch
erig has joined #openvswitch
djhankb has quit [Remote host closed the connection]
djhankb has joined #openvswitch
donhw has quit [Read error: Connection reset by peer]
donhw has joined #openvswitch
djhankb has quit [Remote host closed the connection]
djhankb has joined #openvswitch
froyo has joined #openvswitch
donhw has quit [Ping timeout: 240 seconds]
djhankb has quit [Remote host closed the connection]
djhankb has joined #openvswitch
donhw has joined #openvswitch
donhw has quit [Ping timeout: 256 seconds]
donhw has joined #openvswitch
djhankb has quit [Remote host closed the connection]
djhankb has joined #openvswitch
djhankb has quit [Remote host closed the connection]
djhankb has joined #openvswitch
djhankb has quit [Remote host closed the connection]
djhankb has joined #openvswitch
GNUmoon2 has quit [Remote host closed the connection]
GNUmoon2 has joined #openvswitch
djhankb has quit [Remote host closed the connection]
djhankb has joined #openvswitch
djhankb has quit [Remote host closed the connection]
djhankb has joined #openvswitch
djhankb has quit [Remote host closed the connection]
djhankb has joined #openvswitch
djhankb has quit [Remote host closed the connection]
djhankb has joined #openvswitch
racosta_ has joined #openvswitch
djhankb has quit [Remote host closed the connection]
djhankb has joined #openvswitch
djhankb has quit [Remote host closed the connection]
djhankb has joined #openvswitch
djhankb has quit [Remote host closed the connection]
djhankb has joined #openvswitch
spatel has joined #openvswitch
djhankb has quit [Remote host closed the connection]
djhankb has joined #openvswitch
djhankb has quit [Remote host closed the connection]
djhankb has joined #openvswitch
<imaximets> racosta_, regarding the dynamic limit question: your limit has been reduced down to 12K from default 200K, this means that revalidators are not able to revalidate all the datapath flows. So, either revalidation process is too expensive or revalidators do not get enough CPU time for some reason. Expensive revalidation can be caused by having way too many OpenFlow rules, for example, e.g. by having ACLs with negative matches.
djhankb has quit [Remote host closed the connection]
<racosta_> Hey imaximets, thanks for your response. Sure, I have a lot of openflow flow rules and the host is under a high load of CPU / RX IRQs on the interfaces due to the high volume of flows on the datapath.
djhankb has joined #openvswitch
<racosta_> My question would be, could the dynamic calculation be more flexible? or even disabled to use the value configured in other-config:flow-limit?
<racosta_> I've already increased the max-revalidator to 10000 ms and still the environment reaches the dynamic flow-limit.
<imaximets> If you're hitting the limit with just 12K datapath flows, increasing the limit is not a good solution. It means revalidation of 12K flows takes more than 2 seconds, which is bad.
<imaximets> Need to figure out why it takes so long and remove the cause.
<racosta_> yeah, it's really bad! I am evaluating disabling the dynamic calculation and using a static flow-limit value for testing.
<racosta_> I mean, put the maximum equal to the default 200k statically.
<racosta_> Do you see problems with this approach?
<racosta_> and let the revalidator take as long as it needs to, without limiting the flow-limit and dropping flows from the datapath because of this.
<imaximets> Your revalidators will be at 100% CPU at all times and datapath flows may not be updated for a long time.
<imaximets> How many OpenFlow rules do you have?
<racosta_> 1241303
<racosta_> ~1.2M
<imaximets> That's a lot. Are they distributed between tables or mostly concentrated in the same table?
<racosta_> At this time the revalidators are already using high % of CPU but not 100%
<racosta_> Distributed in different tables
<racosta_> The deployment is for OpenStack environment, so, using OVN and Neutron.
<racosta_> The 'datapath flows may not be updated for a long time' that you mentioned is limited by the max-revalidator timeout, right? I mean, If I set max-revalidator to 5 or 10 seconds, that would be the time that the flow would remain without updating the datapath in a high load use case.
tpires has joined #openvswitch
djhankb has quit [Remote host closed the connection]
djhankb has joined #openvswitch
<imaximets> racosta_, it's true only if revalidators can actually revalidate flows within those 10 seconds. And since revalidation of 12K on your system takes more than 2 seconds, revalidation of 200K will take more than 40 seconds, I guess...
<racosta_> I know, but I don't have more than 20k flows in the datapath at the moment.
<racosta_> The flows are created and removed all the time, maintaining an average of around 14k. I see the dump and sweep of upcalls changing all the time.
djhankb has quit [Remote host closed the connection]
<imaximets> They are getting removed because of the limit. Once you raise the limit the number of flows staying in the datapath will increase.
djhankb has joined #openvswitch
<imaximets> * likely
<racosta_> The dump phase is taking more than 1 second: https://paste.openstack.org/show/bBPf0uw3HG1Xc9Vd9RvG/
mmichelson has joined #openvswitch
<imaximets> OK, 62039 / 14796 * 1.1 = 4.6 seconds for the maximum seen number of flow. So, not that bad.
<imaximets> But it's still unclear why it takes 13ms per flow to revalidate. It may be just amount of OF rules, but hard to tell.
<mmichelson> Hi everyone, it's time to begin the upstream OVN development meeting. Looks like there's already some conversation happening in here.
<mmichelson> I can give a quick update.
<mmichelson> I put up a series about inclusive language in OVN. It's fairly simple, so if you have time to give it a review, please do.
zhouhan has joined #openvswitch
<mmichelson> I'm planning to make point releases of supported OVN versions today. I'll put the commits up shortly. If I have to wait until tomorrow to release them, that's fine.
<mmichelson> Generally, the next couple of weeks are going to be spotty for me because I'm moving house. I won't be in this meeting the next two weeks.
<mmichelson> That's all from me. Who would like to go next?
<_lore_> can I go next?
<mmichelson> _lore_, go right ahead
<_lore_> this week I worked on a ct problem related to asymmetric traffic
<_lore_> we have a use case where reply traffic is entering a different lsp in ovn with respect to the origianal one
<_lore_> I was thinking if we can have a use case where we assign to reg13[0..15] a datapath ct_zone id
<_lore_> that we can use for this kind of use case
<_lore_> what do you think?
<mmichelson> So the idea is that the same ct zone can be used by multiple LSPs on a hypervisor?
<_lore_> on all the lsp of the same logical switch
<zhouhan> what if the LSPs are on different HVs?
<_lore_> that will overwrite reg13 (on demand)
<_lore_> I think we need to get the id from northd
<_lore_> similar to snat-zone-id maybe?
shibirb has joined #openvswitch
<mmichelson> What are the risks of using the same CT zone for the LSPs?
<zhouhan> _lore_: "we have a use case where reply traffic is entering a different lsp" --> I am curious abou the use case? Could you share a link/description of the details?
<_lore_> sure
<imaximets> One risk is a shared zone limit.
<_lore_> imaximets: I think that use case will be limited with number of ports
<_lore_> zhouhan: ovn is implementing just a logical switch
<_lore_> with lsp0, lsp1 and lsp2
<imaximets> what do you mean by the limited with number of ports?
<_lore_> lsp1 and lsp2 are connected to external entities that implement routing
<_lore_> so let's say the pod connected to lsp0 ping an external entity, and the request is exiting the cluster through lsp0
<_lore_> sorry I mean lsp1
<_lore_> the reply for the external routing is entering lsp2 so in this case (if we have per-port ct-zone) will be marked as ct.inv
<_lore_> (assuming we have some rules to send the packet to connection tracking)
<_lore_> since we commit different zone
<_lore_> zhouhan: are we on the same page?
<zhouhan> but why would the reply come from a different lsp?
<_lore_> this is the user requirement
<zhouhan> is it a misconfiguration somewhere?
<_lore_> no, it is on purpose
<_lore_> imaximets: I do not the system will be huge, so lot of connections
<zhouhan> ok, just curious about the motivation of the requirement. Maybe you could share offline
<_lore_> ack, I will do
<_lore_> I was wondering why the ct_zone is done per-port and not per-datapath
<_lore_> imaximets: zhouhan ...^
<_lore_> for scalability problem?
<imaximets> _lore_, the size of the system doesn't matter. A single pod can create a million connections. Then the second pod will have DoS because we can not create more conntrack entries.
<_lore_> imaximets: I got the point. In fact I am not suggesting to enable it by default
<_lore_> but it would be useful to fix this kind of configurations, right?
<_lore_> I can't see any other way of fixing it
<_lore_> if you think it is ok I can work on a PoC
<imaximets> The requirement seems a little unreasonable. :)
<imaximets> But IDK
<zhouhan> _lore_: is the requirement requiring the LSPs on the same HV? Otherwise, it doesn't help even if changing the ct_zone per LS
<_lore_> a workaround could be to allow ct.inv traffic
<_lore_> zhouhan: why not?
<_lore_> if we get the value from northd I think it will work
<zhouhan> because request comes out from one HV1, and the response goes into another HV2, so the response will still be marked as invalid by CT on HV2
<_lore_> anwyay I do not want to get all the mtg time
<mmichelson> conntrack state isn't distributed.
<_lore_> why ?
<_lore_> ct is committed even on egress pipeline, right?
<zhouhan> On HV2 you don't see the request, right?
<_lore_> what I mean is
<_lore_> let's say for both lsp1 and lsp2 reg13[0,15] = X
<_lore_> since we commit on both lsp1 and lsp2 I think the reply will not be marked as inv
<_lore_> or am I missing something?
<zhouhan> CT commit for the request sent from lsp1 is on HV1, but the response goes into CT zone on HV2. CT entries are not synced between HVs
<mmichelson> Sorry someone came to my door, so I had to get up.
<_lore_> ok, but when we are on the egress pipeline (on hv2) we commit on the request, right?
<zhouhan> but the problem is the CT on HV2 would return status invalid, before the OVN pipeliine try to commit
<_lore_> let's say icmp request. We execute the egress pipeline on hv2, and we do ct_commit {} there, right?
<_lore_> so the icmp reply will not be invalid, if all ports on the logical switch have the same ct_zone
<_lore_> but maybe I am wrong :)
<zhouhan> usually we do commit for "new" status and if it is allowed by ACL. But for "invalid" status we would drop it directly. Unless you want to change the pipeline
<_lore_> ok, but the point is the request is new, right?
djhankb has quit [Remote host closed the connection]
<_lore_> in other words:
<zhouhan> the request is on HV1, right?
<_lore_> yes, but then it goes on hv2 for egress pipeline, right?
djhankb has joined #openvswitch
<_lore_> icmp echo request -> HV1 (ingress pipeline) ct_commit --> HV2 (egress pipeline) ct_commit --> exit
<zhouhan> the scenario I am discussing is when request is sent on HV1 to some external endpoint, and the response comes back from the external endpoint to HV2. Are you talking about something different?
<_lore_> no, the same
<zhouhan> why would the request go through HV2 (LSP2)?
<_lore_> because lsp1 is on hv1
<_lore_> *hv2
<zhouhan> lsp1 on HV1, lsp2 on HV2
<_lore_> lsp0 on HV1, lsp1 and lsp2 on HV2
<_lore_> icmp echo request entering lsp0 -> HV1 (ingress pipeline) ct_commit --> HV2 (egress pipeline) ct_commit --> exit thorugh lsp1
<imaximets> What guarantees that lsp1 and lsp2 are on the same node?
<_lore_> I can see your point
<_lore_> *now
<zhouhan> ok, sorry I was confused when I saw you typing "sorry I mean lsp1"
<_lore_> the issue is when lsp1 and lsp2 are on different HV
<zhouhan> yes, same question as imaximets
<_lore_> I think my solution works if lsp1 and lsp2 are on the same HV
<_lore_> but it does not if lsp0 and lsp1 are on hv1, while lsp2 is on hv2
<_lore_> right?
<mmichelson> Yes I think that was the scenario we were thinking of.
<zhouhan> right, I think we don't care much about lsp0 in your example, but lsp1 and lsp2 matter
<_lore_> yep
<zhouhan> probably we can have more detailed discussion in the ML regarding the requirement and then solution
<_lore_> I agree, I do not know if we can assume it or not
<_lore_> sure
<_lore_> thx a lot
<_lore_> btw zhouhan if you have any free cycles can you please take a look to my patch for BFD node IP in northd?
<zhouhan> _lore_ ok, let me check it
<mmichelson> _lore_ did you have anything else that you wanted to discuss?
<_lore_> nope, I am done, thx
<mmichelson> Thanks _lore_
<mmichelson> Who would like to go next?
<imaximets> May I?
<mmichelson> imaximets, go ahead
<imaximets> I don't really have much. I worked on a bug reported for OVS, where we overwrite ct tuple metadata in the kernel with zero bytes.
<imaximets> Posted a kernel fix for review today.
<imaximets> Not sure if it affects OVN, but it's kind of hard to track.
<imaximets> That's all I have.
<mmichelson> thanks imaximets . I also don't know if that has affected OVN.
<mmichelson> Who's next to give an update?
<zhouhan> I can go quickly
<zhouhan> I am waiting for response from mmichelson and dumitru regarding backporting the address set I-P improvement requested by Ales
<mmichelson> zhouhan, ack, I can have a look this afternoon. Can you provide a link to the series?
<zhouhan> Other than this, I suspect a recent bug fix in OVN upstream is impact HW offloading. I will dig more into it.
<zhouhan> that's it from me
<zhouhan> mmichelson: I am trying to find it
<mmichelson> zhouhan, ack
<mmichelson> Anybody else?
<mmichelson> zhouhan, thanks!
<mmichelson> And I guess nobody else wants to give an update. So that's all for today. Thanks
<mmichelson> Bye everyone
mmichelson has quit [Quit: Leaving]
<zhouhan> thanks, bye all
<imaximets> Thanks! Bye.
<_lore_> bye
<_lore_> thx
<imaximets> racosta_, back to our conversation, s/13ms/1ms/ per flow, but still a lot.
<racosta_> imaximets, yeah, I think it's because of the number of openflow flows because I have another ovs node with much fewer flows in the datapath and I see the high dump time and the dynamic calculation limiting flows too (I still don't know exactly what makes the revalidators take so long, there are thousands of IRQs in RX on the hosts, but the CPU usage is very lith at all).
<racosta_> What's killing me is the dynamic calculation updating the flow-limit value in 'small steps' and hitting the limit every time the revalidators runs.
<imaximets> racosta_, is it the same issue that you were looking at a couple days ago (high jitter) or is it different?
<racosta_> No no, the other one is exclusively related to icmpv6.
<racosta_> The problem with revalidators we're talking about affects all traffic types
<imaximets> ack
<imaximets> racosta_, FWIW, I sent a kernel patch today for one issue specifically related to ICMPv6, but I'm having a hard time relating it to your ICMPv6 problem, so I'm not sure if they are related. But anyway: https://lore.kernel.org/netdev/20240509094228.1035477-1-i.maximets@ovn.org/T/#u
<racosta_> I see, thanks. I'll take a look to see if it's related.
<racosta_> We only started to notice this problem when the classifier was changed to apply a more comprehensive mask. -> https://github.com/openvswitch/ovs/commit/132fa24b656e1bc45b6ce8ee9ab0206fa6930f65
<imaximets> yeah, I saw that in the report. I'll try to get a deeper look tomorrow.
djhankb has quit [Remote host closed the connection]
<imaximets> I have a guess that 'nd_target is redundant in the megaflow below' part is somehow related, but I need to check. The logic is tricky.
djhankb has joined #openvswitch
<racosta_> Sure, no rush! yeah, I think it may be related because the icmp flow doesn't match the previous one and is being re-included all the time (counter stats is always =0)
psilva has joined #openvswitch
<racosta_> So, back to the revalidator problem, do you know if there is any way to check where the thread is spending all this time?
<racosta_> That I can see in real time in production environment
<racosta_> My point still to be about the number of openflow flows because the dump phase takes more than 1 second, and I imagine that this phase is exclusively spent processing more than 1.2M of openflow flows
<racosta_> Please correct me if I'm wrong.
djhankb has quit [Remote host closed the connection]
djhankb has joined #openvswitch
shibirb has quit [Ping timeout: 250 seconds]
<imaximets> racosta_, I agree that simply the total amount of OF rules is the main suspect at this point. I'm not sure what would be a good way to debug that though, the only thing I can think of is attaching 'perf' to a process and observe which functions consume most of the time, e.g. with 'perf top'.
<imaximets> racosta_, ^
<racosta_> ack
<racosta_> I tried increasing min-revalidate-pps to see if the dynamic calculation could improve, but there is no direct relationship from what I suspected.
<imaximets> yeah, the time check has a higher priority.
<racosta_> The only thing that I can simulate in the lab and that can solve the problem is to statically configure the max-flows to a high value (not using dynamic calculation).
<racosta_> But this requires a software upgrade and I'm not sure about the real side impact.
<racosta_> Are these values (2000, 1300 and 1000) empirical or is there some math science involved? https://github.com/openvswitch/ovs/blob/cbc54b2fe05440adbdb4a6980aa294924a555572/ofproto/ofproto-dpif-upcall.c#L1065
<imaximets> Mostly empirical.
<imaximets> a.k.a. "unreasonably long"
<racosta_> ack
<imaximets> Hitting these values means that the system is not doing great, i.e. the load should be reduced. In your case 14 cores are running at 100% for more than a second on each revalidation cycle. That's a lot of CPU time.
<racosta_> I agree with you, but this host is a network node that centralizes all the ovs chassis traffic in the cloud. It's natural to have a lot of load and I have more than 50 colors to distribute the threads if necessary.
<racosta_> * cores
djhankb has quit [Remote host closed the connection]
djhankb has joined #openvswitch
<racosta_> The CPU load isn't high at all: https://paste.openstack.org/show/bbiRbQYt4Zot1fHHxl7q/
zhouhan has quit [Quit: Client closed]
<racosta_> imaximets, ^
<imaximets> 9 cores used for OVS is quite high. Maybe not for the server overall, I agree. But on it's own it's not good.
<racosta_> yep. I'm seeing a lot of 'over 4096 resubmit actions on bridge br-int while processing arp' as well. Could this have any relation?
<racosta_> Correct me if I'm wrong, but recirc for arp packets seems strange to me, right?
<imaximets> Some ARP traffic is not being delivered. That should not cause performance issues on its own, but may cause connectivity problems.
<imaximets> It's not a recirculation, these are resubmit() actions in the OF pipeline.
<racosta_> ohh, I see. Thanks
<imaximets> basically, OVN is trying to broadcast ARP on each port in the switch or router and there are too many ports.
<imaximets> So, the OF pipeline cannot be processed fully.
<racosta_> I got. So, I have approximately 2k arp packets per second and most of them are to unknown addresses that will not respond.
<imaximets> Can this be a reason why revalidation is slow? How do the datapath flows look like? Are them mostly for ARP or some other traffic?
<imaximets> I mean, if we have to process so many resubmits per flow, revalidation can be slow.
<racosta_> For the 16k flows, we have 2k arp entries.
<racosta_> But we have ~2K logical routers on the external network where broadcasts occur...
djhankb has quit [Remote host closed the connection]
djhankb has joined #openvswitch
<imaximets> racosta_, what is your OVN version?
<racosta_> OVN 22.03 and OVS 2.17
<racosta_> I think so, let me check
<imaximets> The patch was accepted within 21.12 time frame, so 22.03 should have it.
<racosta_> ack
<imaximets> FWIW, 23.06 seem to have a way to disable the broadcast: https://github.com/ovn-org/ovn/commit/37d308a2074515834692d442475a8e05310a152d
<racosta_> yeag, I had already seen this change. I wasn't thinking about porting this change for now. Do you think it could be related to the time spent by the revalidators?
djhankb has quit [Remote host closed the connection]
djhankb has joined #openvswitch
<imaximets> racosta_, it definitely contributes, but it's hard to tell how much.
djhankb has quit [Remote host closed the connection]
djhankb has joined #openvswitch
djhankb has quit [Remote host closed the connection]
djhankb has joined #openvswitch
djhankb has quit [Remote host closed the connection]
djhankb has joined #openvswitch
djhankb has quit [Remote host closed the connection]
djhankb has joined #openvswitch
donhw has quit [Read error: Connection reset by peer]
donhw has joined #openvswitch
froyo has quit [Remote host closed the connection]
froyo has joined #openvswitch
ihrachys_ has joined #openvswitch
ihrachys has quit [Ping timeout: 255 seconds]
djhankb has quit [Remote host closed the connection]
djhankb has joined #openvswitch
dobson has quit [Quit: Leaving]
djhankb has quit [Remote host closed the connection]
djhankb has joined #openvswitch
dobson has joined #openvswitch
djhankb has quit [Remote host closed the connection]
djhankb has joined #openvswitch