Constant high CPU usage on Raspberry Pi 4

Hi
I am seeing exactly the same on 2 Almalinux machines, a reboot solves it for around 15-30 minutes. And then CPU 100% and no more connection to the ZT IP from any remote machine. From the local machine ZT IP is still pingable though.
Did you manage to find the source of the issue and solve it?
Thanks
Jan

No, I’m still dealing with it. I’ve sent zt-travis a few 10 minute process monitors where the CPU core is at 100% for most if not all the time. But still no progress.

I am experiencing this as well.

Anyone here willing jump through a few hoops to get us profiling data for your specific setups? If we get right clues we might be able to issue a patch:

git clone https://github.com/zerotier/ZeroTierOne.git
cd ZeroTierOne
make one -j$(nproc) ZT_DEBUG=1

If compilation doesn’t work you may need some or all of the following:

sudo apt install git make lldb libssl-dev
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Then once compilation succeeds:

NOTE: Stop your system ZeroTier instance before continuing to the next step. This new debug instance will use the same identities and port and will be functionally identical.

sudo lldb ./zerotier-one
lldb> run

Wait for high CPU utilization and then:

^C
lldb> bt

And then send us the output.

Optionally sending a bt for each thread:

^C
lldb> thread list
lldb> t 1
lldb> bt
lldb> t 2
lldb> bt
...

If you did this a few times it could help us pin down where ZT is spending most of its time. Alternatively using gprof is an option but it’s a little harder to use.

I was able to get this compiled and running. But I’m unsure if you want just the output after issuing the bt command, or if you’re wanting everything starting from run till after the bt command? Sorry, I’m not familiar with lldb. At first, I thought you just wanted the output from the bt command, but after looking at the output of run, I’m guessing you want that too?

If you’re wanting the output of run as well, that may be difficult as it just flies by with messages. I did briefly look at them and they do seem to be consistant of 2 styles:

learned new path {IPv6 Address}/{Port} to {Peer on local LAN} (packet 14d974e3c9def66c local socket 366521568960 network 0000000000000000)
trying unknown path {IPv6 Address}/{Port} to {Peer on local LAN} (packet 359a42d256b1d44f verb 1 local socket 366521569312 network 0000000000000000)

There’s LOTS of these, repeatedly. They all seem to be for the same peer, and there’s about 6 IPv6 addresses/ports it’s working through. Let me know what parts of this output you want and I’ll get it to you. Additionally, this was a very small time sample and of course it was right after it started for the first time. A very large number of those flew by at first, then it briefly paused, then more, pause, more. Because there’s so many, I’m not sure if it’s repeating for other peers or just this one (my scroll buffer evidently isn’t big enough).


So, thinking about this other {Peer on local LAN}, that it seems to be trying paths to, it has two network interfaces: wired and wireless. So I turned it’s WiFi off and as soon as I did the logs went quiet with regular requesting configuration for network xxxxxxxxxxx messages, which I assume is normal. I then turned WiFi back on and it’s quieter than what it was. They still come, but not as frequent or in as of high quantity.

I don’t know if any of this is helpful, but let me know what you need and I’ll work to get it to you.

@Hullah No problem. Yes, I’m interested in the backtrace output as well as a sample of your logs.

You can send unredacted logs here privately to our team: Jira Service Management

Some questions:

  • How was CPU usage when you turned off the remote peer’s Wifi?
  • What version is each side using?

Btw, we recently resolved an issue that dealt with how paths were learned, maybe try this all again with our upcoming 1.10.3 release. It might be related.

Thank you for the response and suggestion to send the logs to the team. I have done that linking it back to this discussion. The reference number is ZT-4851 in case you need that.

Honestly I was concerned that my problem magically went away (because I had some network changes recently). I had the connection back up for a day or two and still had zero CPU time. But then I realized that I had never turned the WiFi connection back on the peer from when I had it off. After I turned the WiFi connection back on the peer and let it sit for a bit, I came back to it w/ full CPU core usage again and the same logs just flying by for the same peer on my local LAN.

For your questions:

  1. As alluded to in the previous paragraph, when the WiFi was off on the peer, I had almost zero CPU time.
  2. The version on the Raspberry Pi was what was on HEAD while running the debugger, git commit: 666fb7ea2d. The version on the peer on the local LAN is v1.10.2.

Additional info I thought of later: the Pi also has both of it’s network interfaces active and connected (LAN and WiFi). So while I had high CPU, both peers have active LAN and WiFi connections, I disabled the Pi’s WiFi to see if that mattered. The path logs kept coming. The key seems to be disabling the peers WiFi.

Ok thank you. We received your logs, I would try to upgrade both ends to 1.10.3 to see if that resolves it. Since this has to do with the path learning logic (the repeated learning and re-learning of paths) I wonder if it’s related to the duplicate path issue.

If that doesn’t fix it, there is a potential workaround using multipath’s active-backup mode. It will only use one link at a time but the learning logic is different. You’d put this config in the troubled nod’s local.conf:

{
  "settings": {
    "defaultBondingPolicy": "custom-active-backup",
    "policies": {
      "custom-active-backup": {
        "basePolicy": "active-backup",
        "failoverInterval": 60000
      }
    }
  }
}

Thanks for looking at it. I have updated to 1.10.3 and will let them sit for a day or two then check on the CPU usage. Since it’ll be an official release, I won’t have the debug logs. Do you think it’s worth me pulling latest and running that instead of the official release on the Pi?

Either works. Feel free to install the official release, if you’re still seeing high CPU usage I think we can assume it’s still the same bug.

Unfortunately it looks to still be occurring. I noticed it with the official release, and so I pulled latest and ran the latest from the dev branch. Can I provide you more information or test something that can help figure this bug out?

Hmm. A couple more questions:

  • How quickly are the learned new path messages being generated? Like, once every few seconds or many per second?
  • I didn’t see the output of lldb> bt anywhere. Can you provide that for each thread?

I’m trying to replicate on my side. Thanks for your help.

EDIT: Are you using bridging at all? If so can you also send us the output of ip a and brctl show ?

There are many per second, I think I averaged over 1000 per second. I created a new debugger log in the support ticket as well as provided a fresh zerotier-cli dump.

And, I am not using bridging.

Edit: Maybe it’s half that, 500, since every “learned new path” log line has a “trying unknown path” line.

Ok, got your logs. I see a problem. It’s caused by a condition on your network but ZeroTier needs to handle this more elegantly. Here it is:

It looks like there are too many addresses available to reach your peer node. I see that the path structure is completely filled (64 address/port tuples). I stopped counting after I saw a ton of different ipv6 and ipv4 addresses reported. ZeroTier will count something as a path if it has a unique tuple of <local socket, remote address, remote peer>. This large number of addresses may be required for what you’re doing but here are the short-term mitigations you can try:

  • Removed unnecessary assigned addresses on local and remote interfaces
  • Add { settings { "allowSecondaryPort": false }} to your local.conf to only allow one local socket per address
  • If you must, you can increase ZT_MAX_PEER_NETWORK_PATHS from 64 to some bigger positive integer but my suspicion is that this would need to be a BIG number so I don’t suggest doing this.

That all said, ZeroTier shouldn’t eat this much CPU when this happens so I’m going to add a learning rate backoff that will be in the next release.

Let me know what you find, and thanks again for being so helpful.

I’ve noticed the many IPv6 addresses before. And honestly, I don’t know why there’s so many. They’re not anything I’ve assigned. I know my ISP has IPv6 enabled as well as my personal router, so I don’t know if they’re coming from my router or my ISP. But it’s not just the peer node, it’s also the Pi4. I looked at a handful of my devices on my network and they all have about 4 IPv6 address assigned to them. So, embarrassedly, short of disabling IPv6 on my router and/or my ISP, I’m not sure if I can reduce the number of IPv6 addresses.

I agree that upping the max network paths does sounds like a great option either, so I’ll be skipping that one.

That really only leaves the secondary Port option. I can change that and see how it goes. But, is there any downside to this? I would assume I’d need to do that on ZT devices on my local LAN, or should I set that on all my ZT devices?

Additionally, in the support ticket they suggesting blacklisting the IPv6 Unique Local Address. I am trying that currently and while it has very much quieted the debugger log chatter, I wonder if it comes at a cost like setting allowSecondaryPort may?

And to be honest about IPv6, I didn’t have it on originally but found ZeroTier documentation that stated that IPv6 should have better routing capabilities to remote peers. Which was something I wanted to I enabled IPv6.

Hullah, how many ipv6 address have you got on each interface?

It’s not uncommon you might get multiple ones, have a look at these:

Since it’s a pretty common behavior for IPv6, this is something ZeroTier needs to take into account…

From what I was searching/learning, I agree it doesn’t seem uncommon to have multiple IPv6 addresses.

The Wifi and LAN of the peer on my network each have:

  • 3 GUA
  • 2 ULA
  • 1 link-local

The WiFi and LAN of the Raspberry Pi 4 each have:

  • 1 GUA
  • 2 ULA
  • 1 link-local

Some other devices on my network:

Android Phone:

  • 2 GUA
  • 4 ULS
  • 1 link-local

A linux NAS device:

  • 1 GUA
  • 2 ULA
  • 1 link-local

I don’t think so? The ULAs are local lan only (correct me if i’m wrong). They wouldn’t help with NAT traversal. Nodes on the same physical LAN could use ULA addresses to peer, but they can use the
global addresses and ipv4 too.

In any case these are mitigations for something we need to solve correctly, as joseph mentioned.

In addition to secondary port, there’s a third port under portMappingEnabled:

{ "settings": { "allowSecondaryPort": false, "portMappingEnabled": false }}

That’s the UPnP port. You can likely disable that. Especially if you’re not using UPnP in your routers. Or if you have any nodes that aren’t behind NAT (cloud VMs)

Maybe blocking the fd00::/7 addresses is the most effective for the least trouble. It’s probably the local LAN where the most “paths” are being created.

Have you seen a reduction in cpu usage?

I did block the ULAs for a short while, but then switched to disabling the secondary port in order to allow it to still use the ULAs in case those were better routes. Doing that seems to at least not fill up the path structure and prevent the constant new path discovery I was seeing. I’m now getting expected CPU usage with a sample of less than 2% CPU usage over a 10 minutes (see picture).

I was wondering where the 3rd port was coming from, so that makes sense it’s coming from UPnP. Though for my 2 peers in question on this thread, they are on my home network and do utilize UPnP. Which does raise the question of expired and renewed UPnP ports. I would assume that as a UPnP mapped port expires a new one will be issued. Which in turn will add to the number of paths, but remove old ones as they expire. I assume ZeroTier deals with those already.

I do have 2 Cloud VMs that I could disable port mapping on. Though I don’t think I’ve been able to get a Direct P2P connection on my Cloud VMs (Azure and Oracle) even though I feel like I’m setting everything up correctly in config and opening UPD port 9993. I’d love to somehow confirm/deny that too.

2023-02-22_0904

Thanks.

For the cloud nodes, check zerotier-cli peers for “direct” or “relay” .

You kind of need to allow all outgoing udp, other nodes are behind NAT and get mapped to random ports.
I’m not sure what kind of options you have for the azure or oracle firewall.

If it’s iptables you can use -A OUTPUT -m owner --uid-owner zerotier-one -j ACCEPT to allow just zerotier-one to send anything.