Performance issues if I use anything except the nodes in opnsense

I’m having performance “bottlenecks” when using opnsense. This is complicated as I’ve done a full day’s worth of testing to get where I am.

I’ve always been relatively disappointed in the zerotier VPN performance despite us using it for almost a year. I’ve always thought ZT was “just slow” until I did some googling and more testing. A few websites compared ZT to be nearly identical to IPSEC and others. When I saw some people doing 400Mb/sec, I started questioning if my performance is expected or if something is actually wrong on my part. So I started digging deeper…

I have 2 different ZT networks and different sites. All are using Opnsense and all are on the latest versions. I’ve chosen 2 sites close to each other by latency (and geographically) that I can do testing on that are not using HA with Opnsense (High Availability adds another layer of complexity, so I’m trying to forgo dealing with that until I can get non-HA sites working better… baby steps).

Let’s pretend I have 2 sites. Site one has router-A and desktop-A (it’s business internet account, but at a home) with internet speeds of 500Mb down and 35Mb up. Site two has router-B and Server-B with internet speeds of 1Gb down and 1Gb up (this is also a home, but has fiber to the home).

My sites have a config file that is like this:

{ "physical": { "192.168.0.0/16": { "blacklist": true }, "10.0.0.0/16": { "blacklist": true } }, "settings": { "primaryPort": 9993, "portMappingEnabled": false, "allowSecondaryPort": false, "allowTcpFallbackRelay": false } }

I then have the appropriate firewall rules to allow traffic from 9993.

If I do iperf3 from router-A to router-B, I get 36Mbit/sec. That saturates the uplink at site A so I’m okay with that.

If I do iperf3 from router-B to router-A, I get 312Mb/sec. I’m okay with that speed since it is 14ms between the two routers. More would be great, but I’m gonna take this issue as a bunch of baby steps instead of being upset that it doesn’t saturate 1Gb. (I’m expecting to need some very large buffers if I really want to do 1Gb with 14ms of latency. I’m no expert at adjusting buffers on opnsense, and I haven’t found a link that would teach me enough to figure it out on my own, so I’m gonna work on the issues I do understand first.)

For “the lolz” I did do iperf3 from router-B to router-A using the external IPs to compare against the VPN. I got 443Mb/sec.

To summarize thus far:

  1. I can saturate the upstream at site A.
    2, I get speeds of 443Mb without the VPN, and 312Mb with the VPN going from router-B to Router-A over the VPN.

If I then connect to an SMB share on Server-B (TrueNAS) and start downloading a file to Desktop-A (Windows 10) and I get 9-12MB/sec fluctuating (I’ll just call it a 100Mb link for simplicity). That’s about 1/3 of what I got with iperf3 from router-B to router-A. So I did a bunch of things here because I wanted to see exactly what was going wrong:

  1. I did a packet capture on Desktop-A using wireshark. Things go smoothly for the smb transfer, but every 1 second or so I get inundated with a crapload of TCP DUP ACKs. I’ll get 100 to 120 DUP ACKs in less than 2ms. At the same time I’ll get a dozen or so TCP out of orders and a fast retransmit. Then another second of what looks like normal SMB traffic, followed by another huge group of dup acks, out of order packets, and fast retransmits.
  2. I did an iperf3 test from Server-B to Desktop-A. I get 9-12MB/sec with 5-50 retransmits almost every second per iperf3. SMB and iperf3’s performance seems to be inline with each other, so that means I probably don’t have an SMB problem. I also did check with Wireshark and I have the same dup acks and such as I mentioned in #1 above.
  3. I did iperf3 from Desktop-A to Router-A and vice versa and I got 931Mb/sec (basically a 1Gb connection).
  4. I did an iperf3 from Server-B to Router-B and vice versa and I got 930Mb/sec (basically a 1Gb connection).
  5. So I did iperf3 from Router-B to Desktop-A and I get about 100Mb/sec with 5-20 retransmits every second from iperf3 output.
  6. I did iperf3 from Server-B to Router-A and I got about 100Mb/sec with 5-20 retransmits every second from iperf3 output.

So it seems if I go from router to router directly, all is good. I can also go from desktop on either side to its associated router, and all is good. I get good speeds and no retransmits. But as soon as I want to go from a desktop or server on one site to anything on the other (including the router on the other side), performance and reliability take a nasty nosedive.

To put it another way, I can go from router to router just fine, but if I want to actually use the VPN in any meaningful way using other servers and desktops, it’s unreliable and slow.

Any ideas where to even start to investigate this issue?

I did try emailing the zerotier plugin maintainer for opnsense to start a conversation several weeks ago, but he didn’t respond. I’d really rather not bother him with additional emails unless I can really prove this is a zerotier issue.

As this is affecting business functionality, the company is open to the idea of paying for someone to troubleshoot and identify the issue. But I don’t know where to start. Is it an Opnsense problem and I need an Opnsense expert? Is it a Zerotier problem? If so, is it the plugin itself or is it the zerotier code. Is it just a configuration problem on my part. I know that Zerotier documentation at OPNsense | ZeroTier Documentation makes it pretty clear that Zerotier, Inc doesn’t maintain the opnsense implementation. From some of the “official” posts in the Zerotier forums I get the impression Zerotier, Inc. has the attitude of “we don’t do opnsense, so if it works, great, and if it doesn’t, don’t talk to us about it either”.

Thanks to whoever read this all the way to the end. I realize this is a lot to swallow.

Thanks for doing all that work and writing this up. We’re discussing and asking around if anyone has any ideas.

Thanks for all of the info. Not sure what’s happening but can you test the latency between two troubled devices (over ZT) and also record the latency variance over time? I’d like to know if the variance is higher when over ZT. I’m wondering if something inside ZT is creating too noisy of an environment for the SMB protocol to function properly.

Can you blacklist 172.16.0.0/12 in local.conf too, just in case?

Somehow I wasn’t aware of these updates. I’ll apply that blacklist right now as well as start collecting ping data.

Blacklist is applied. I did 5 minutes of pings between my desktop at my house and a server at the other site for 5 minutes.

I had min/max/avg of 8ms/16ms/12ms.

All but 2 pings were less than 14ms

That sounds better? Is iperf any better?

Not really. Here’s a 10 second iperf3 test (in theory, up to 500Mb/sec should be possible).

root@firewall:~ # iperf3 -s
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
Accepted connection from 192.168.1.2, port 3970
[  5] local 192.168.6.1 port 5201 connected to 192.168.1.2 port 46109
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  19.5 MBytes   163 Mbits/sec   48    158 KBytes
[  5]   1.00-2.00   sec  19.9 MBytes   167 Mbits/sec   52    215 KBytes
[  5]   2.00-3.00   sec  20.4 MBytes   171 Mbits/sec   54    335 KBytes
[  5]   3.00-4.00   sec  18.3 MBytes   154 Mbits/sec   36    175 KBytes
[  5]   4.00-5.00   sec  10.8 MBytes  90.5 Mbits/sec    2    191 KBytes
[  5]   5.00-6.00   sec  11.1 MBytes  93.2 Mbits/sec    2    210 KBytes
[  5]   6.00-7.00   sec  12.0 MBytes   100 Mbits/sec    2    202 KBytes
[  5]   7.00-8.00   sec  9.76 MBytes  81.9 Mbits/sec    4    134 KBytes
[  5]   8.00-9.00   sec  10.7 MBytes  90.0 Mbits/sec    1    169 KBytes
[  5]   9.00-10.00  sec  11.0 MBytes  92.3 Mbits/sec    1    148 KBytes
[  5]  10.00-10.02  sec  26.8 KBytes  16.6 Mbits/sec    0    148 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.02  sec   143 MBytes   120 Mbits/sec  202             sender
-----------------------------------------------------------

That is from router to router over the VPN (which is now markedly worse).

I did an iperf3 over the open internet without the VPN between the sites and I got:

root@firewall:~ # iperf3 -s
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
Accepted connection from a.b.c.d, port 9459
[  5] local a.b.c.d port 5201 connected to e.f.g.h port 40789
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  23.5 MBytes   197 Mbits/sec   18    565 KBytes
[  5]   1.00-2.00   sec  28.5 MBytes   239 Mbits/sec   33    318 KBytes
[  5]   2.00-3.00   sec  18.8 MBytes   157 Mbits/sec    0    395 KBytes
[  5]   3.00-4.00   sec  26.1 MBytes   219 Mbits/sec    0    482 KBytes
[  5]   4.00-5.00   sec  24.6 MBytes   206 Mbits/sec    0    552 KBytes
[  5]   5.00-6.00   sec  33.9 MBytes   284 Mbits/sec    0    635 KBytes
[  5]   6.00-7.00   sec  25.2 MBytes   211 Mbits/sec    2    359 KBytes
[  5]   7.00-8.00   sec  21.4 MBytes   179 Mbits/sec    0    438 KBytes
[  5]   8.00-9.00   sec  28.3 MBytes   237 Mbits/sec    0    523 KBytes
[  5]   9.00-10.00  sec  29.9 MBytes   251 Mbits/sec    0    602 KBytes
[  5]  10.00-10.01  sec   597 KBytes   430 Mbits/sec    0    603 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.01  sec   261 MBytes   218 Mbits/sec   53             sender

So “better” but still markedly worse than I was getting. I’m not sure if it’s because of the time of day or the settings changes. I’ll test this again tonight and tomorrow morning to see if it is any better.

Thanks for your help so far.

Okay, so between the two sites are a lot of offline fiber. We had a really bad storm here 36 hours ago, taking down a bunch of trees, etc. and my ISP had some of their backbone affected. They’re expecting to be back up by the weekend.

In the meantime, I’ve done more comparisons, and zerotier is still significantly slower than iperf tests done across the open internet. Pings continue to be “about the same” between the sites regardless of whether you ping the WAN or the LAN interfaces.

Any other suggestions?

Edit: I did notice that the zerotier interfaces have a IPv6 address. I’m not using IPv6 anywhere, but opnsense does some limited ipv6 out of the box. I may not have fully disabled ipv6. This could be a dead end, but is there a way to disable ipv6 for zerotier?

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.