I’m having performance “bottlenecks” when using opnsense with zerotier. This is complicated as I’ve done a full day’s worth of testing to get where I am. I am a network engineer, so I’m not exactly new to networking, nor do I think I have something horribly misconfigured. But, I’m definitely open to being told I have something wrong.
I have multiple sites, and I’ve been in a “testing” phase for years. Multiple customers are very interested in zerotier, but with the performance issues, it’s a no-go until this is resolved. I did previously post about this issue in 2022, but got no answers. I decided to redo my testing, and here is my results…
I’ve always been relatively disappointed in the zerotier VPN performance despite us using it for testing for almost 3 years. I’ve always thought ZT was “just slow” until I did some googling and more testing. A few websites compared ZT to be nearly identical to IPSEC and others. When I saw some people doing 400Mb/sec, I started questioning if my performance is expected or if something is actually wrong on my part. So I started digging deeper…
I have 2 different ZT networks and different sites. All are using Opnsense and all are on the latest versions of opnsense and the os-zerotier plugin as of 5/23/2024. I’ve chosen 2 sites close to each other by latency (and geographically). In this case, the 2 sites are about 5 miles apart geographically. Of course, that doesn’t mean that the network won’t go through some ridiculously long path to get between each other, but the latency is lowest between these 2 points. I’m doing these tests at 5AM local time for both sides to ensure that daytime workloads wouldn’t affect my test.
At the time of this writing, opnsense is on version 24.1.7_4-amd64 and os-wireguard plugin is on version 1.3.2_4. That translates to the zerotier client software being version 1.14.0.
Let’s pretend I have 2 sites. Site-A has router-A and desktop-A (it’s business-class internet account, but at a home) with internet speeds of 500Mb down and 35Mb up. My ISP actually gives 10% more than rated to make sure people don’t complain about “slow speeds”.
Site two has router-B and Server-B with internet speeds of 1Gb down and 1Gb up (this is also a home, but has fiber to the home).
Here’s the specs for the two routers:
Site-A:
Intel Atom C2758 (8c/8t) with 32GB of RAM.
Site-B:
Intel Atom C3758 (8c/8t) with 16GB of RAM.
Despite their relatively low-powered CPUs, these would appear to be overkill. We spec’d these systems some years ago and they’ve never been a bottleneck before, even when I experimented with wireguard. Sadly, I don’t have wireguard configured, so I can’t easily compare speeds. But I may end up doing that next “just because”. However wireguard isn’t something my customer’s are looking for. So it would literally be only for testing. The load average on both systems is typically less than 0.30 over 15 minutes. I’ve tried to determine if I’m missing a specific CPU instruction set on one side (or both), but not really been able to prove this. Seemed the deeper I dug, the less likely I was to believe this scenario was occurring.
On to the testing…
I did some ping testing:
Site-A router to Site-B router via external IPs:
20 packets transmitted, 20 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 7.879/13.015/14.817/2.033 ms
Site-B router to Site-A router via external IPs:
20 packets transmitted, 20 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 10.220/13.795/16.772/1.349 ms
Site-A router to Site-B router via zerotier IPs:
20 packets transmitted, 20 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 11.397/12.501/14.094/0.699 ms
Site-B router to Site-A router via zerotier IPs:
20 packets transmitted, 20 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 11.411/12.788/14.402/0.888 ms
Site-A router to Site-B desktop:
20 packets transmitted, 20 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 10.951/13.201/18.739/1.685 ms
Site-B router to Site-A desktop:
20 packets transmitted, 20 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 11.186/12.699/13.960/0.786 ms
I performed speed tests using speedtest.net and a Windows desktop on each end and:
Site-A - 480Mb/sec down and 33Mb/sec up. Seems pretty good for expecting 500Mb/sec down and 35Mb/sec up.
Site-B – 948Mb/sec down and 949Mb/sec up. Again, seems pretty good for expecting 1Gb/sec down and 1Gb/sec up.
My sites have a config file that is similar to this:
{
“physical”: {
“192.168.0.0/16”: { “blacklist”: true },
“172.16.0.0/8”: { “blacklist”: true },
“10.0.0.0/8”: { “blacklist”: true },
“127.0.0.0/8”: { “blacklist”: true },
“10.8.0.0/16”: { “blacklist”: true },
“192.168.6.0/24”: { “trustedPathId”: 12345 }
},
“settings”: {
“primaryPort”: 9992,
“portMappingEnabled”: false,
“allowSecondaryPort”: false,
“interfacePrefixBlacklist”: [ “ovpn” ],
“allowTcpFallbackRelay”: false
}
}
The physical interfaces that are blocked vary from site to site depending on what subnets are local, etc.
All sites involved in testing also have only 1 zerotier network configured.
I then have the appropriate firewall rules to allow traffic from the appropriate port with IPv4 TCP and UDP traffic.
Now for some iperf3 testing to see what kind of throughput I get:
Site-A router to Site-B router via external IPs:
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 40.4 MBytes 33.9 Mbits/sec 21 sender
[ 5] 0.00-10.01 sec 39.5 MBytes 33.1 Mbits/sec receiver
Just a tad below 35Mb/sec, so roughly saturation for the uplink from Site-A.
Site-B router to Site-A router via external IPs:
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 436 MBytes 365 Mbits/sec 154 sender
[ 5] 0.00-10.01 sec 435 MBytes 364 Mbits/sec receiver
Being that the receiving side is limited to 500Mb/sec down, this isn’t great, but isn’t terrible. Wish it was better, but like with all ISPs, the oversubscribe their internet. Moving on…
Site-A router to Site-B router via zerotier IPs:
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 39.8 MBytes 33.3 Mbits/sec 30 sender
[ 5] 0.00-10.02 sec 38.9 MBytes 32.5 Mbits/sec receiver
Again, just a tad below saturation speed for one side, so I’m happy with the result.
Site-B router to Site-A router via zerotier IPs:
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.02 sec 160 MBytes 134 Mbits/sec 575 sender
[ 5] 0.00-10.03 sec 159 MBytes 133 Mbits/sec receiver
Okay, so that’s very ugly. Just lost over ½ my performance from the external IP test, and a LOT more retransmits. I then did a 60 second test to see if it’s a fluke, and it is not. The performance is consistently poor, and I rack up 50-65 retransmits every second during the iperf test…
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-60.00 sec 956 MBytes 134 Mbits/sec 3445 sender
[ 5] 0.00-60.03 sec 956 MBytes 134 Mbits/sec receiver
Now, in rare occasions I see that iperf3 tests don’t reflect reality. So I did some SMB transfers from one site to the other via VPN. This was basically a test of what a real-world workload would look like for my customers if they were using zerotier. I’d love to do a test over the external internet, but I cannot for security reasons.
When receiving a large 4GB file from a Site-B server to site-A desktop, I get approximately 8.5MB/sec (roughly 80Mb/sec). This is again, about ½ of the speeds I saw with iperf3, which was already ½ of what an iperf3 test from the external IPs. So I’m getting about ¼ of the speed I was somewhat hoping for.
CPU on the Site-B router is about 30% for the zerotier-one process, and the system is 93.8% idle.
CPU on the Site-A router is 40% for the zerotier-one process, and the system is 90% idle.
So I’m not inclined to think that I’m CPU bound.
I then did a packet capture on Desktop-A using wireshark when the SMB transfer was running.
The SMB transfer seems just fine, but approximately every second or so, I get over 60 duplicate ACKs (yikes) in a sudden “storm” of duplicate ACKs.
So that indicates something is “not right”.
I did do a packet capture when doing iperf3 from Site-B server to Site-A desktop, and I got the rather poor speed of 64Mb/sec. I didn’t see the “storm” of duplicate acks like I did with SMB. But I’m left asking “what is going on?”
To sum all of this up, aside from my 35Mb/sec upload speed on one side, when I try to do things that should be fine from one site to the other where speeds should be >250Mb/sec, and possibly as high as 500Mb/sec, I cannot even get to 100Mb/sec. Aside from the 35Mb/sec uplink bottleneck on one side, other testing from router to router seem to be about ½ of the performance of using external IPs. But if I then put any kind of desktop or server on either side, the performance seems to be cut in ½ again.
So where should I go from here? Anyone have anything they’d like me to test? I’m out of ideas, and I have multiple customers that would love to use Zerotier in their business if I can get this sorted out.