Poor zerotier performance with opnsense

vitech · May 23, 2024, 1:23pm

I’m having performance “bottlenecks” when using opnsense with zerotier. This is complicated as I’ve done a full day’s worth of testing to get where I am. I am a network engineer, so I’m not exactly new to networking, nor do I think I have something horribly misconfigured. But, I’m definitely open to being told I have something wrong.

I have multiple sites, and I’ve been in a “testing” phase for years. Multiple customers are very interested in zerotier, but with the performance issues, it’s a no-go until this is resolved. I did previously post about this issue in 2022, but got no answers. I decided to redo my testing, and here is my results…

I’ve always been relatively disappointed in the zerotier VPN performance despite us using it for testing for almost 3 years. I’ve always thought ZT was “just slow” until I did some googling and more testing. A few websites compared ZT to be nearly identical to IPSEC and others. When I saw some people doing 400Mb/sec, I started questioning if my performance is expected or if something is actually wrong on my part. So I started digging deeper…

I have 2 different ZT networks and different sites. All are using Opnsense and all are on the latest versions of opnsense and the os-zerotier plugin as of 5/23/2024. I’ve chosen 2 sites close to each other by latency (and geographically). In this case, the 2 sites are about 5 miles apart geographically. Of course, that doesn’t mean that the network won’t go through some ridiculously long path to get between each other, but the latency is lowest between these 2 points. I’m doing these tests at 5AM local time for both sides to ensure that daytime workloads wouldn’t affect my test.

At the time of this writing, opnsense is on version 24.1.7_4-amd64 and os-wireguard plugin is on version 1.3.2_4. That translates to the zerotier client software being version 1.14.0.

Let’s pretend I have 2 sites. Site-A has router-A and desktop-A (it’s business-class internet account, but at a home) with internet speeds of 500Mb down and 35Mb up. My ISP actually gives 10% more than rated to make sure people don’t complain about “slow speeds”.

Site two has router-B and Server-B with internet speeds of 1Gb down and 1Gb up (this is also a home, but has fiber to the home).

Here’s the specs for the two routers:

Site-A:
Intel Atom C2758 (8c/8t) with 32GB of RAM.

Site-B:
Intel Atom C3758 (8c/8t) with 16GB of RAM.

Despite their relatively low-powered CPUs, these would appear to be overkill. We spec’d these systems some years ago and they’ve never been a bottleneck before, even when I experimented with wireguard. Sadly, I don’t have wireguard configured, so I can’t easily compare speeds. But I may end up doing that next “just because”. However wireguard isn’t something my customer’s are looking for. So it would literally be only for testing. The load average on both systems is typically less than 0.30 over 15 minutes. I’ve tried to determine if I’m missing a specific CPU instruction set on one side (or both), but not really been able to prove this. Seemed the deeper I dug, the less likely I was to believe this scenario was occurring.

On to the testing…

I did some ping testing:
Site-A router to Site-B router via external IPs:
20 packets transmitted, 20 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 7.879/13.015/14.817/2.033 ms

Site-B router to Site-A router via external IPs:
20 packets transmitted, 20 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 10.220/13.795/16.772/1.349 ms

Site-A router to Site-B router via zerotier IPs:
20 packets transmitted, 20 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 11.397/12.501/14.094/0.699 ms

Site-B router to Site-A router via zerotier IPs:
20 packets transmitted, 20 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 11.411/12.788/14.402/0.888 ms

Site-A router to Site-B desktop:
20 packets transmitted, 20 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 10.951/13.201/18.739/1.685 ms

Site-B router to Site-A desktop:
20 packets transmitted, 20 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 11.186/12.699/13.960/0.786 ms

I performed speed tests using speedtest.net and a Windows desktop on each end and:
Site-A - 480Mb/sec down and 33Mb/sec up. Seems pretty good for expecting 500Mb/sec down and 35Mb/sec up.
Site-B – 948Mb/sec down and 949Mb/sec up. Again, seems pretty good for expecting 1Gb/sec down and 1Gb/sec up.

My sites have a config file that is similar to this:

{
“physical”: {
“192.168.0.0/16”: { “blacklist”: true },
“172.16.0.0/8”: { “blacklist”: true },
“10.0.0.0/8”: { “blacklist”: true },
“127.0.0.0/8”: { “blacklist”: true },
“10.8.0.0/16”: { “blacklist”: true },
“192.168.6.0/24”: { “trustedPathId”: 12345 }
},
“settings”: {
“primaryPort”: 9992,
“portMappingEnabled”: false,
“allowSecondaryPort”: false,
“interfacePrefixBlacklist”: [ “ovpn” ],
“allowTcpFallbackRelay”: false
}
}

The physical interfaces that are blocked vary from site to site depending on what subnets are local, etc.
All sites involved in testing also have only 1 zerotier network configured.
I then have the appropriate firewall rules to allow traffic from the appropriate port with IPv4 TCP and UDP traffic.

Now for some iperf3 testing to see what kind of throughput I get:

Site-A router to Site-B router via external IPs:
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 40.4 MBytes 33.9 Mbits/sec 21 sender
[ 5] 0.00-10.01 sec 39.5 MBytes 33.1 Mbits/sec receiver

Just a tad below 35Mb/sec, so roughly saturation for the uplink from Site-A.

Site-B router to Site-A router via external IPs:
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 436 MBytes 365 Mbits/sec 154 sender
[ 5] 0.00-10.01 sec 435 MBytes 364 Mbits/sec receiver

Being that the receiving side is limited to 500Mb/sec down, this isn’t great, but isn’t terrible. Wish it was better, but like with all ISPs, the oversubscribe their internet. Moving on…

Site-A router to Site-B router via zerotier IPs:

[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 39.8 MBytes 33.3 Mbits/sec 30 sender
[ 5] 0.00-10.02 sec 38.9 MBytes 32.5 Mbits/sec receiver

Again, just a tad below saturation speed for one side, so I’m happy with the result.

Site-B router to Site-A router via zerotier IPs:
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.02 sec 160 MBytes 134 Mbits/sec 575 sender
[ 5] 0.00-10.03 sec 159 MBytes 133 Mbits/sec receiver

Okay, so that’s very ugly. Just lost over ½ my performance from the external IP test, and a LOT more retransmits. I then did a 60 second test to see if it’s a fluke, and it is not. The performance is consistently poor, and I rack up 50-65 retransmits every second during the iperf test…

[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-60.00 sec 956 MBytes 134 Mbits/sec 3445 sender
[ 5] 0.00-60.03 sec 956 MBytes 134 Mbits/sec receiver

Now, in rare occasions I see that iperf3 tests don’t reflect reality. So I did some SMB transfers from one site to the other via VPN. This was basically a test of what a real-world workload would look like for my customers if they were using zerotier. I’d love to do a test over the external internet, but I cannot for security reasons.

When receiving a large 4GB file from a Site-B server to site-A desktop, I get approximately 8.5MB/sec (roughly 80Mb/sec). This is again, about ½ of the speeds I saw with iperf3, which was already ½ of what an iperf3 test from the external IPs. So I’m getting about ¼ of the speed I was somewhat hoping for.
CPU on the Site-B router is about 30% for the zerotier-one process, and the system is 93.8% idle.
CPU on the Site-A router is 40% for the zerotier-one process, and the system is 90% idle.
So I’m not inclined to think that I’m CPU bound.

I then did a packet capture on Desktop-A using wireshark when the SMB transfer was running.

The SMB transfer seems just fine, but approximately every second or so, I get over 60 duplicate ACKs (yikes) in a sudden “storm” of duplicate ACKs.
So that indicates something is “not right”.
I did do a packet capture when doing iperf3 from Site-B server to Site-A desktop, and I got the rather poor speed of 64Mb/sec. I didn’t see the “storm” of duplicate acks like I did with SMB. But I’m left asking “what is going on?”

To sum all of this up, aside from my 35Mb/sec upload speed on one side, when I try to do things that should be fine from one site to the other where speeds should be >250Mb/sec, and possibly as high as 500Mb/sec, I cannot even get to 100Mb/sec. Aside from the 35Mb/sec uplink bottleneck on one side, other testing from router to router seem to be about ½ of the performance of using external IPs. But if I then put any kind of desktop or server on either side, the performance seems to be cut in ½ again.

So where should I go from here? Anyone have anything they’d like me to test? I’m out of ideas, and I have multiple customers that would love to use Zerotier in their business if I can get this sorted out.

l0crian · May 25, 2024, 5:27pm

I have a box with a C3758 in it, and I get this for iPerf results between ZeroTier nodes.:

[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   599 MBytes   502 Mbits/sec    0             sender
[  5]   0.00-10.00  sec   597 MBytes   501 Mbits/sec                  receiver

This is between hosts on LAN, so it’s best case scenario. I’m only doing firewall, and no NAT on these boxes. Also, the other end isn’t a C3758, so it won’t fully be apples to apples. I do get >500Mbps both sending and receiving though, so that should generally rule out any remote side performance variability. It is also on VyOS, so it’s Linux vs. BSD. Linux generally has slightly better routing performance over BSD, but certainly not 3x the performance.

It’s possible the C2758 is a bottleneck (I don’t have one to test). You can run separate iPerf tests between the C2758/C3758 and a more powerful host (both locally, and across your WAN) to see if either one is a bottleneck. The site local tests will also help you rule out the connection between sites as an issue (as well as help identify any loss that could be occurring on physical interfaces at the sites). Assuming you don’t observe any bottlenecks, we can look at a few things.

Ensure that the peers you’re talking between are directly connected, and not being tunneled or relayed. Given the speed of >100Mbps, I’m guessing it is direct.
Run a long lived iPerf test (like 300 seconds) between the hosts, and monitor the peer state for changes (use the watch command). If there’s a lot of path changes occurring, it could cause problems. You can also run: zerotier-cli info -j | jq '.config.settings.listeningOn' to see if zerotier is listening on any interfaces you weren’t aware of. Then just add them to your blacklist scheme you have.
Your throughput will be less than the physical path, so it will prevent packets from being queued. TCP could just be doing what it knows to do since it sees drops as congestion. The impact of this will only get worse as latency increases. You can apply a shaper to the sending side of your iPerf tests and see if that cleans up the throughput. Start at 200Mbps, and if your traffic cleans up, bump up the values in 50Mbps increments until you start seeing a limit. Then keep the shaper at that level. You can also just do UDP iPerf tests and keep bumping up the target until you start seeing a large number of lost datagrams.
Monitor the throughput on Site-A’s ZeroTier interface. It’s possible it is seeing congestion (>35Mbps or so based on what you wrote), and it’s actually the TCP acks that are causing TCP to back off.
Check for fragmentation. You may need to limit the MTU or MSS of the path. OPNsense should support MSS Clamping, so should be an easy one to test. Just drop it to 1350 or something to quickly see if that helps anything. You can also look for fragmentation in packet captures to fully verify if it’s happening.

If you do determine that the hardware is a bottleneck, you have a couple of options:

Obvious one is to upgrade the hardware. I have a box with a 13700H that does 4.5Gbps with a single ZeroTier instance
Scale out horizontally and run additional ZeroTier instances. This would require you to enable ECMP routing to take full advantage of it.

I showed both of those in this article where I hit over 21Gbps using decent hardware and ECMP:

Other things to consider is:

What’s the latency between sites?
Are you running any intensive functions in OPNsense that would peg the CPU (like IDP/IDS)? This could decrease performance of ZeroTier.

vitech · May 25, 2024, 11:34pm

Firstly, thank you for the post. Your details and data are far more than I ever thought I’d get!

Some of what you asked for requires testing, and I’m currently testing wireguard again just to see what kind of performance numbers I get between these 2 test sites. Right now I get about 12-15MB/sec with SMB. That’s about 50-100% more than Zerotier. However, it’s not my first choice because if I put a bunch of sites together, I have to either do a spoke and wheel (which comes with it’s own limitations on performance and reliability) or create a rather large set of wireguard s2s connections. Not terrible to do, but just a pain in the butt to manage. I just loved the simplicity of zerotier and how easy it is to manage all of the sites’ connections from 1 panel.

I did try to rule this out. I really can’t for 100% certainty (I have high expectations before I make that conclusion), but the zerotier thread never hits more than about 30% CPU usage on a single core and no cores max out when testing. I was wondering if I’m missing some optimizing CPU instruction set because the C2758 is rather old, but I’m not sure how to actually check that on a BSD executable. I will be going down this rabbit hole at some point as I’d really like to learn how to check what CPU instruction sets are supported/required for a given app.

Yes, everything is directly connected. All network connections are wired, and 1Gb or greater. Most of these networks are fully 10Gb or 40Gb because “why not?”. There are no relays or tunnels setup.

You know, I thought about this 2 years ago, and I never actually tried that. I may have to give that a go for science!

I actually did play around with MTUs 2 years ago. With the same 2 sites being tested, I found that altering the MTU either resulted in the same or worse performance. However, being that a lot has changed from 2 years ago, this may warrant retesting.

I also did play around with new_reno and cc_cubic algorithms just to see if it would matter. From my experience, I expected that cc_cubic would be the better choice. However in this particular scenario, it didn’t seem to matter. Quite a few others are supported on opnsense, but they seemed very inappropriate for this particular scenario.

I’ve actually been considering upgrading the hardware for the C2758 (that’s my home router). However I prefer to stick with server-grade hardware, something that isn’t going to be a power hog since these will run 24x7, isn’t too expensive, and preferably Intel based.

I generally stick with Intel based systems because they are more likely to “just work” with the FreeBSD drivers in opnsense. While AMD systems that would be well supported on FreeBSD are a thing, they’re harder to find. AMD just doesn’t dump the kind of developer resources at FreeBSD support that Intel does. Also right now the system is very small, physically the size of a shoe box. Its hard to find all of my requirements and also have a small size.

I love the idea of scaling out on zerotier. I actually had never considered that idea. I may have to test it out!

I’ll report back what happens.

Thanks again for your rather long and informative post.

l0crian · May 26, 2024, 12:38am

The hardware I referenced is Intel, inexpensive (can be <$500), and not a power hog (it actually uses a mobile chip in it instead of a desktop CPU). It is however not server grade, so if that’s a hard requirement then you could be stuck. If you’re buying something from a company like Supermicro, you can easily get 2 (a fair bit more realistically, but we’ll say 2 for the sake of argument) good boxes for the price of just one of those. But just make the best choice for your environment and requirements.

WireGuard is able to leverage all cores, whereas ZeroTier (at least currently) only uses a single core. In my testing, ZeroTier generally has better per-core performance, but overall performance is better with WireGuard. This can be improved by using the multiple ZT instances like I mentioned, along with ECMP (basically create a multithreaded ZT solution.

The fact that you were only able to get 50-100% more performance by using WireGuard would indicate that the same issue you’re seeing with ZeroTier is present there as well. 8 cores, even the lower powered one from the Atom processor should be able to hit >1Gbps using WG. This is actually good since it can help to narrow your focus.

Are the hosts you’re conducting the testing from connected at 40Gbps? If so, that gives further credence to what I mentioned with the shaping. You have an 80:1 oversubscription ratio (assuming you can do 500Mbps of ZT traffic), with no method of queuing. You’re effectively trying to shove a river through a straw, and TCP will keep bouncing against that limit and continuously going into slow-start when you don’t have a queuing mechanism to have a graceful tail drop. That is also potentially evident in your packet captures with the duplicate acks. You’ll experience this as well at 10Gbps, just to a lesser degree. I’ve seen this countless times within customer’s networks, mostly when a provider has a difference in the port and access rates of a delivered circuit.

Hopefully that gives you some starting points to help narrow things down so you can finally get things working how you want it.

system · June 25, 2024, 12:39am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.