Constant high CPU usage on Raspberry Pi 4

Hullah · January 30, 2023, 6:30pm

I’ve noticed that the zerotier-one service on my Raspberry Pi 4 is consuming a full CPU core. If I restart the server, it’ll temporarily return to normal, but then after awhile, it’ll start consuming a full CPU core again. I even uninstalled ZeroTier, removed everything from the /var/lib/zerotier-one directory (except the identity files) and reinstalled. That seemed to help it for a longer time, but still eventually returned to consuming a full CPU core.

Any help on this is greatly appreciated. I’ve tried other solutions I’ve found but nothing seems to help.

zt-travis · January 31, 2023, 5:08pm

Hello,
Thanks for posting. That’s not normal…
Maybe there is a ip address conflict or something? Can you run zerotier-cli dump and paste the file in a message to me?

jw_dyan · February 6, 2023, 7:17am

Hi
I am seeing exactly the same on 2 Almalinux machines, a reboot solves it for around 15-30 minutes. And then CPU 100% and no more connection to the ZT IP from any remote machine. From the local machine ZT IP is still pingable though.
Did you manage to find the source of the issue and solve it?
Thanks
Jan

Hullah · February 6, 2023, 2:27pm

No, I’m still dealing with it. I’ve sent zt-travis a few 10 minute process monitors where the CPU core is at 100% for most if not all the time. But still no progress.

jvanbev1 · February 9, 2023, 5:20pm

I am experiencing this as well.

zt-joseph · February 9, 2023, 5:55pm

Anyone here willing jump through a few hoops to get us profiling data for your specific setups? If we get right clues we might be able to issue a patch:

git clone https://github.com/zerotier/ZeroTierOne.git
cd ZeroTierOne
make one -j$(nproc) ZT_DEBUG=1

If compilation doesn’t work you may need some or all of the following:

sudo apt install git make lldb libssl-dev
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Then once compilation succeeds:

NOTE: Stop your system ZeroTier instance before continuing to the next step. This new debug instance will use the same identities and port and will be functionally identical.

sudo lldb ./zerotier-one
lldb> run

Wait for high CPU utilization and then:

^C
lldb> bt

And then send us the output.

Optionally sending a bt for each thread:

^C
lldb> thread list
lldb> t 1
lldb> bt
lldb> t 2
lldb> bt
...

If you did this a few times it could help us pin down where ZT is spending most of its time. Alternatively using gprof is an option but it’s a little harder to use.

Hullah · February 10, 2023, 6:37am

I was able to get this compiled and running. But I’m unsure if you want just the output after issuing the bt command, or if you’re wanting everything starting from run till after the bt command? Sorry, I’m not familiar with lldb. At first, I thought you just wanted the output from the bt command, but after looking at the output of run, I’m guessing you want that too?

If you’re wanting the output of run as well, that may be difficult as it just flies by with messages. I did briefly look at them and they do seem to be consistant of 2 styles:

learned new path {IPv6 Address}/{Port} to {Peer on local LAN} (packet 14d974e3c9def66c local socket 366521568960 network 0000000000000000)
trying unknown path {IPv6 Address}/{Port} to {Peer on local LAN} (packet 359a42d256b1d44f verb 1 local socket 366521569312 network 0000000000000000)

There’s LOTS of these, repeatedly. They all seem to be for the same peer, and there’s about 6 IPv6 addresses/ports it’s working through. Let me know what parts of this output you want and I’ll get it to you. Additionally, this was a very small time sample and of course it was right after it started for the first time. A very large number of those flew by at first, then it briefly paused, then more, pause, more. Because there’s so many, I’m not sure if it’s repeating for other peers or just this one (my scroll buffer evidently isn’t big enough).

So, thinking about this other {Peer on local LAN}, that it seems to be trying paths to, it has two network interfaces: wired and wireless. So I turned it’s WiFi off and as soon as I did the logs went quiet with regular requesting configuration for network xxxxxxxxxxx messages, which I assume is normal. I then turned WiFi back on and it’s quieter than what it was. They still come, but not as frequent or in as of high quantity.

I don’t know if any of this is helpful, but let me know what you need and I’ll work to get it to you.

zt-joseph · February 14, 2023, 5:34pm

@Hullah No problem. Yes, I’m interested in the backtrace output as well as a sample of your logs.

You can send unredacted logs here privately to our team: Jira Service Management

Some questions:

How was CPU usage when you turned off the remote peer’s Wifi?
What version is each side using?

Btw, we recently resolved an issue that dealt with how paths were learned, maybe try this all again with our upcoming 1.10.3 release. It might be related.

Hullah · February 16, 2023, 3:48am

Thank you for the response and suggestion to send the logs to the team. I have done that linking it back to this discussion. The reference number is ZT-4851 in case you need that.

Honestly I was concerned that my problem magically went away (because I had some network changes recently). I had the connection back up for a day or two and still had zero CPU time. But then I realized that I had never turned the WiFi connection back on the peer from when I had it off. After I turned the WiFi connection back on the peer and let it sit for a bit, I came back to it w/ full CPU core usage again and the same logs just flying by for the same peer on my local LAN.

For your questions:

As alluded to in the previous paragraph, when the WiFi was off on the peer, I had almost zero CPU time.
The version on the Raspberry Pi was what was on HEAD while running the debugger, git commit: 666fb7ea2d. The version on the peer on the local LAN is v1.10.2.

Additional info I thought of later: the Pi also has both of it’s network interfaces active and connected (LAN and WiFi). So while I had high CPU, both peers have active LAN and WiFi connections, I disabled the Pi’s WiFi to see if that mattered. The path logs kept coming. The key seems to be disabling the peers WiFi.

zt-joseph · February 16, 2023, 5:51pm

Ok thank you. We received your logs, I would try to upgrade both ends to 1.10.3 to see if that resolves it. Since this has to do with the path learning logic (the repeated learning and re-learning of paths) I wonder if it’s related to the duplicate path issue.

If that doesn’t fix it, there is a potential workaround using multipath’s active-backup mode. It will only use one link at a time but the learning logic is different. You’d put this config in the troubled nod’s local.conf:

{
  "settings": {
    "defaultBondingPolicy": "custom-active-backup",
    "policies": {
      "custom-active-backup": {
        "basePolicy": "active-backup",
        "failoverInterval": 60000
      }
    }
  }
}

Hullah · February 16, 2023, 7:28pm

Thanks for looking at it. I have updated to 1.10.3 and will let them sit for a day or two then check on the CPU usage. Since it’ll be an official release, I won’t have the debug logs. Do you think it’s worth me pulling latest and running that instead of the official release on the Pi?

zt-joseph · February 16, 2023, 8:50pm

Either works. Feel free to install the official release, if you’re still seeing high CPU usage I think we can assume it’s still the same bug.

Hullah · February 17, 2023, 1:46pm

Unfortunately it looks to still be occurring. I noticed it with the official release, and so I pulled latest and ran the latest from the dev branch. Can I provide you more information or test something that can help figure this bug out?

zt-joseph · February 17, 2023, 6:20pm

Hmm. A couple more questions:

How quickly are the learned new path messages being generated? Like, once every few seconds or many per second?
I didn’t see the output of lldb> bt anywhere. Can you provide that for each thread?

I’m trying to replicate on my side. Thanks for your help.

EDIT: Are you using bridging at all? If so can you also send us the output of ip a and brctl show ?

Hullah · February 17, 2023, 6:58pm

There are many per second, I think I averaged over 1000 per second. I created a new debugger log in the support ticket as well as provided a fresh zerotier-cli dump.

And, I am not using bridging.

Edit: Maybe it’s half that, 500, since every “learned new path” log line has a “trying unknown path” line.

zt-joseph · February 17, 2023, 7:23pm

Ok, got your logs. I see a problem. It’s caused by a condition on your network but ZeroTier needs to handle this more elegantly. Here it is:

It looks like there are too many addresses available to reach your peer node. I see that the path structure is completely filled (64 address/port tuples). I stopped counting after I saw a ton of different ipv6 and ipv4 addresses reported. ZeroTier will count something as a path if it has a unique tuple of <local socket, remote address, remote peer>. This large number of addresses may be required for what you’re doing but here are the short-term mitigations you can try:

Removed unnecessary assigned addresses on local and remote interfaces
Add { settings { "allowSecondaryPort": false }} to your local.conf to only allow one local socket per address
If you must, you can increase ZT_MAX_PEER_NETWORK_PATHS from 64 to some bigger positive integer but my suspicion is that this would need to be a BIG number so I don’t suggest doing this.

That all said, ZeroTier shouldn’t eat this much CPU when this happens so I’m going to add a learning rate backoff that will be in the next release.

Let me know what you find, and thanks again for being so helpful.

Hullah · February 18, 2023, 5:43am

I’ve noticed the many IPv6 addresses before. And honestly, I don’t know why there’s so many. They’re not anything I’ve assigned. I know my ISP has IPv6 enabled as well as my personal router, so I don’t know if they’re coming from my router or my ISP. But it’s not just the peer node, it’s also the Pi4. I looked at a handful of my devices on my network and they all have about 4 IPv6 address assigned to them. So, embarrassedly, short of disabling IPv6 on my router and/or my ISP, I’m not sure if I can reduce the number of IPv6 addresses.

I agree that upping the max network paths does sounds like a great option either, so I’ll be skipping that one.

That really only leaves the secondary Port option. I can change that and see how it goes. But, is there any downside to this? I would assume I’d need to do that on ZT devices on my local LAN, or should I set that on all my ZT devices?

Additionally, in the support ticket they suggesting blacklisting the IPv6 Unique Local Address. I am trying that currently and while it has very much quieted the debugger log chatter, I wonder if it comes at a cost like setting allowSecondaryPort may?

And to be honest about IPv6, I didn’t have it on originally but found ZeroTier documentation that stated that IPv6 should have better routing capabilities to remote peers. Which was something I wanted to I enabled IPv6.

Boilerplate4U · February 18, 2023, 10:36pm

Hullah, how many ipv6 address have you got on each interface?

It’s not uncommon you might get multiple ones, have a look at these:

IPv6:Which Source Address is used when you have many IPv6 addresses ? Default Address Selection — EtherealMind
linux - Why IPv6 allows a single interface with multiple addresses? - Stack Overflow
connection problems and multiple ipv6 adress - Microsoft Q&A

Since it’s a pretty common behavior for IPv6, this is something ZeroTier needs to take into account…

Hullah · February 19, 2023, 6:08pm

From what I was searching/learning, I agree it doesn’t seem uncommon to have multiple IPv6 addresses.

The Wifi and LAN of the peer on my network each have:

3 GUA
2 ULA
1 link-local

The WiFi and LAN of the Raspberry Pi 4 each have:

1 GUA
2 ULA
1 link-local

Some other devices on my network:

Android Phone:

2 GUA
4 ULS
1 link-local

A linux NAS device:

1 GUA
2 ULA
1 link-local

zt-travis · February 21, 2023, 6:06pm

I don’t think so? The ULAs are local lan only (correct me if i’m wrong). They wouldn’t help with NAT traversal. Nodes on the same physical LAN could use ULA addresses to peer, but they can use the
global addresses and ipv4 too.

In any case these are mitigations for something we need to solve correctly, as joseph mentioned.

In addition to secondary port, there’s a third port under portMappingEnabled:

{ "settings": { "allowSecondaryPort": false, "portMappingEnabled": false }}

That’s the UPnP port. You can likely disable that. Especially if you’re not using UPnP in your routers. Or if you have any nodes that aren’t behind NAT (cloud VMs)

Maybe blocking the fd00::/7 addresses is the most effective for the least trouble. It’s probably the local LAN where the most “paths” are being created.

Have you seen a reduction in cpu usage?