Constant high CPU usage on Raspberry Pi 4

Hmm. A couple more questions:

  • How quickly are the learned new path messages being generated? Like, once every few seconds or many per second?
  • I didn’t see the output of lldb> bt anywhere. Can you provide that for each thread?

I’m trying to replicate on my side. Thanks for your help.

EDIT: Are you using bridging at all? If so can you also send us the output of ip a and brctl show ?

There are many per second, I think I averaged over 1000 per second. I created a new debugger log in the support ticket as well as provided a fresh zerotier-cli dump.

And, I am not using bridging.

Edit: Maybe it’s half that, 500, since every “learned new path” log line has a “trying unknown path” line.

Ok, got your logs. I see a problem. It’s caused by a condition on your network but ZeroTier needs to handle this more elegantly. Here it is:

It looks like there are too many addresses available to reach your peer node. I see that the path structure is completely filled (64 address/port tuples). I stopped counting after I saw a ton of different ipv6 and ipv4 addresses reported. ZeroTier will count something as a path if it has a unique tuple of <local socket, remote address, remote peer>. This large number of addresses may be required for what you’re doing but here are the short-term mitigations you can try:

  • Removed unnecessary assigned addresses on local and remote interfaces
  • Add { settings { "allowSecondaryPort": false }} to your local.conf to only allow one local socket per address
  • If you must, you can increase ZT_MAX_PEER_NETWORK_PATHS from 64 to some bigger positive integer but my suspicion is that this would need to be a BIG number so I don’t suggest doing this.

That all said, ZeroTier shouldn’t eat this much CPU when this happens so I’m going to add a learning rate backoff that will be in the next release.

Let me know what you find, and thanks again for being so helpful.

I’ve noticed the many IPv6 addresses before. And honestly, I don’t know why there’s so many. They’re not anything I’ve assigned. I know my ISP has IPv6 enabled as well as my personal router, so I don’t know if they’re coming from my router or my ISP. But it’s not just the peer node, it’s also the Pi4. I looked at a handful of my devices on my network and they all have about 4 IPv6 address assigned to them. So, embarrassedly, short of disabling IPv6 on my router and/or my ISP, I’m not sure if I can reduce the number of IPv6 addresses.

I agree that upping the max network paths does sounds like a great option either, so I’ll be skipping that one.

That really only leaves the secondary Port option. I can change that and see how it goes. But, is there any downside to this? I would assume I’d need to do that on ZT devices on my local LAN, or should I set that on all my ZT devices?

Additionally, in the support ticket they suggesting blacklisting the IPv6 Unique Local Address. I am trying that currently and while it has very much quieted the debugger log chatter, I wonder if it comes at a cost like setting allowSecondaryPort may?

And to be honest about IPv6, I didn’t have it on originally but found ZeroTier documentation that stated that IPv6 should have better routing capabilities to remote peers. Which was something I wanted to I enabled IPv6.

Hullah, how many ipv6 address have you got on each interface?

It’s not uncommon you might get multiple ones, have a look at these:

Since it’s a pretty common behavior for IPv6, this is something ZeroTier needs to take into account…

From what I was searching/learning, I agree it doesn’t seem uncommon to have multiple IPv6 addresses.

The Wifi and LAN of the peer on my network each have:

  • 3 GUA
  • 2 ULA
  • 1 link-local

The WiFi and LAN of the Raspberry Pi 4 each have:

  • 1 GUA
  • 2 ULA
  • 1 link-local

Some other devices on my network:

Android Phone:

  • 2 GUA
  • 4 ULS
  • 1 link-local

A linux NAS device:

  • 1 GUA
  • 2 ULA
  • 1 link-local

I don’t think so? The ULAs are local lan only (correct me if i’m wrong). They wouldn’t help with NAT traversal. Nodes on the same physical LAN could use ULA addresses to peer, but they can use the
global addresses and ipv4 too.

In any case these are mitigations for something we need to solve correctly, as joseph mentioned.

In addition to secondary port, there’s a third port under portMappingEnabled:

{ "settings": { "allowSecondaryPort": false, "portMappingEnabled": false }}

That’s the UPnP port. You can likely disable that. Especially if you’re not using UPnP in your routers. Or if you have any nodes that aren’t behind NAT (cloud VMs)

Maybe blocking the fd00::/7 addresses is the most effective for the least trouble. It’s probably the local LAN where the most “paths” are being created.

Have you seen a reduction in cpu usage?

I did block the ULAs for a short while, but then switched to disabling the secondary port in order to allow it to still use the ULAs in case those were better routes. Doing that seems to at least not fill up the path structure and prevent the constant new path discovery I was seeing. I’m now getting expected CPU usage with a sample of less than 2% CPU usage over a 10 minutes (see picture).

I was wondering where the 3rd port was coming from, so that makes sense it’s coming from UPnP. Though for my 2 peers in question on this thread, they are on my home network and do utilize UPnP. Which does raise the question of expired and renewed UPnP ports. I would assume that as a UPnP mapped port expires a new one will be issued. Which in turn will add to the number of paths, but remove old ones as they expire. I assume ZeroTier deals with those already.

I do have 2 Cloud VMs that I could disable port mapping on. Though I don’t think I’ve been able to get a Direct P2P connection on my Cloud VMs (Azure and Oracle) even though I feel like I’m setting everything up correctly in config and opening UPD port 9993. I’d love to somehow confirm/deny that too.

2023-02-22_0904

Thanks.

For the cloud nodes, check zerotier-cli peers for “direct” or “relay” .

You kind of need to allow all outgoing udp, other nodes are behind NAT and get mapped to random ports.
I’m not sure what kind of options you have for the azure or oracle firewall.

If it’s iptables you can use -A OUTPUT -m owner --uid-owner zerotier-one -j ACCEPT to allow just zerotier-one to send anything.

I’m also having this high CPU usage issue, I reboot my Pi 4 or restart the zerotier-one service and it goes away for like 6 hours ish.

Would disabling IPv6 help perhaps? I am getting “200 peers”.

I’ve resorted to creating a cron job that restarts zerotier at 07:45am every morning, I suspect i’ll needa run this a few more times later as the day goes on and it starts eating all my CPU.

Updated my Almalinux servers and 2 x Synology DSM7 to latest and now they also have the issue, 100% CPU load on Zerotier after a few pings. Rolled back the Synologys to 1.8.10 and all OK.
Our backups and some internal services run over ZT and although I love to keep running ZT, I can’t support this much longer.
How do I re-install ZT 1.18.10 on Centos / Almalinux with yum?
thanks

Look through zerotier-cli peers -j and see if you have a ton of paths and zerotier-cli info -j if the list of surfaceAddresses is big or growing.

Thanks, I found that also yesterday.
Had to remove also the repo as it updated with the same speed last night to 1.10.3.
Now running on 1.8.10 and seems fine again.

Even after disabling the secondary port, this is still an issue. Not only on my RaspberryPi 4, but also on other peers on my home network (both Mac Book Pros). I’m yet again getting constant 100% CPU usage on my Pi4, and now getting ~33% constant CPU usage on my MBPs.

I don’t really want to disable IPv6 on my network, or blocking ZeroTier traffic on my IPv6 routes. Is a fix for this being worked on? Is there any timeline for a fix?

I was looking into this further and for one of my local peers (from MBP to Pi4) the paths were full again. So I looked at one IPv6/port combination open on the Pi4 connecting from the MBP. This one IPv6/port combo had 27 combinations because it had 27 different local sockets on the MBP. Why is there so many different local sockets? Are they building up over time?

I’m also not sure what port this is (which is 34029) because it’s not the standard 9993 port. Yet, at the same time this IPv6 only has 2 paths for the standard 9993 port. I suppose that may be because the path structure is full and the other paths for 9993 couldn’t fit. But another IPv6 address has 27 combinations for the 9993 port but only 3 for 34029.

My personal workstation just reached 1500 - 1600 paths again with three peers at 64 addresses and a few more at ≥50.

We are working on it. :+1:

Doesn’t seem to be too big, disabling IPv6 seems to allow it to run longer before getting choppy on the streaming but I can’t be sure.

My scenario is using TVHeadend remotely to stream video.

Main issue I’m having is after a random number of hours streaming video over zerotier it seems to become unresponsive until I restart the zerotier-one service on the remote/tvheadend server at which point it becomes responsive and flawless again (I can’t determine what’s causing it exactly as bandwidth certainly isn’t an issue, but it’s definitely zerotier related as every time I restart the zerotier-one service performance returns).

If I can provide logs or anything for you let me know and I’d be happy to assist, because other than this issue Zerotier is pretty magic in terms of ease (You’ve a good thing going here).

We are considering bumping the max paths again but we want to fully understand why people are finding themselves in this situation before we take the easy way out. Can you tell me more about the assigned addresses for your local node with ~1500 paths and one of the remote peers with 64+ paths?

For instance comment #19 was useful, Constant high CPU usage on Raspberry Pi 4 - #19 by Hullah