Performance issues if I use anything except the nodes in opnsense

I’m having performance “bottlenecks” when using opnsense. This is complicated as I’ve done a full day’s worth of testing to get where I am.

I’ve always been relatively disappointed in the zerotier VPN performance despite us using it for almost a year. I’ve always thought ZT was “just slow” until I did some googling and more testing. A few websites compared ZT to be nearly identical to IPSEC and others. When I saw some people doing 400Mb/sec, I started questioning if my performance is expected or if something is actually wrong on my part. So I started digging deeper…

I have 2 different ZT networks and different sites. All are using Opnsense and all are on the latest versions. I’ve chosen 2 sites close to each other by latency (and geographically) that I can do testing on that are not using HA with Opnsense (High Availability adds another layer of complexity, so I’m trying to forgo dealing with that until I can get non-HA sites working better… baby steps).

Let’s pretend I have 2 sites. Site one has router-A and desktop-A (it’s business internet account, but at a home) with internet speeds of 500Mb down and 35Mb up. Site two has router-B and Server-B with internet speeds of 1Gb down and 1Gb up (this is also a home, but has fiber to the home).

My sites have a config file that is like this:

{ "physical": { "192.168.0.0/16": { "blacklist": true }, "10.0.0.0/16": { "blacklist": true } }, "settings": { "primaryPort": 9993, "portMappingEnabled": false, "allowSecondaryPort": false, "allowTcpFallbackRelay": false } }

I then have the appropriate firewall rules to allow traffic from 9993.

If I do iperf3 from router-A to router-B, I get 36Mbit/sec. That saturates the uplink at site A so I’m okay with that.

If I do iperf3 from router-B to router-A, I get 312Mb/sec. I’m okay with that speed since it is 14ms between the two routers. More would be great, but I’m gonna take this issue as a bunch of baby steps instead of being upset that it doesn’t saturate 1Gb. (I’m expecting to need some very large buffers if I really want to do 1Gb with 14ms of latency. I’m no expert at adjusting buffers on opnsense, and I haven’t found a link that would teach me enough to figure it out on my own, so I’m gonna work on the issues I do understand first.)

For “the lolz” I did do iperf3 from router-B to router-A using the external IPs to compare against the VPN. I got 443Mb/sec.

To summarize thus far:

  1. I can saturate the upstream at site A.
    2, I get speeds of 443Mb without the VPN, and 312Mb with the VPN going from router-B to Router-A over the VPN.

If I then connect to an SMB share on Server-B (TrueNAS) and start downloading a file to Desktop-A (Windows 10) and I get 9-12MB/sec fluctuating (I’ll just call it a 100Mb link for simplicity). That’s about 1/3 of what I got with iperf3 from router-B to router-A. So I did a bunch of things here because I wanted to see exactly what was going wrong:

  1. I did a packet capture on Desktop-A using wireshark. Things go smoothly for the smb transfer, but every 1 second or so I get inundated with a crapload of TCP DUP ACKs. I’ll get 100 to 120 DUP ACKs in less than 2ms. At the same time I’ll get a dozen or so TCP out of orders and a fast retransmit. Then another second of what looks like normal SMB traffic, followed by another huge group of dup acks, out of order packets, and fast retransmits.
  2. I did an iperf3 test from Server-B to Desktop-A. I get 9-12MB/sec with 5-50 retransmits almost every second per iperf3. SMB and iperf3’s performance seems to be inline with each other, so that means I probably don’t have an SMB problem. I also did check with Wireshark and I have the same dup acks and such as I mentioned in #1 above.
  3. I did iperf3 from Desktop-A to Router-A and vice versa and I got 931Mb/sec (basically a 1Gb connection).
  4. I did an iperf3 from Server-B to Router-B and vice versa and I got 930Mb/sec (basically a 1Gb connection).
  5. So I did iperf3 from Router-B to Desktop-A and I get about 100Mb/sec with 5-20 retransmits every second from iperf3 output.
  6. I did iperf3 from Server-B to Router-A and I got about 100Mb/sec with 5-20 retransmits every second from iperf3 output.

So it seems if I go from router to router directly, all is good. I can also go from desktop on either side to its associated router, and all is good. I get good speeds and no retransmits. But as soon as I want to go from a desktop or server on one site to anything on the other (including the router on the other side), performance and reliability take a nasty nosedive.

To put it another way, I can go from router to router just fine, but if I want to actually use the VPN in any meaningful way using other servers and desktops, it’s unreliable and slow.

Any ideas where to even start to investigate this issue?

I did try emailing the zerotier plugin maintainer for opnsense to start a conversation several weeks ago, but he didn’t respond. I’d really rather not bother him with additional emails unless I can really prove this is a zerotier issue.

As this is affecting business functionality, the company is open to the idea of paying for someone to troubleshoot and identify the issue. But I don’t know where to start. Is it an Opnsense problem and I need an Opnsense expert? Is it a Zerotier problem? If so, is it the plugin itself or is it the zerotier code. Is it just a configuration problem on my part. I know that Zerotier documentation at OPNsense | ZeroTier Documentation makes it pretty clear that Zerotier, Inc doesn’t maintain the opnsense implementation. From some of the “official” posts in the Zerotier forums I get the impression Zerotier, Inc. has the attitude of “we don’t do opnsense, so if it works, great, and if it doesn’t, don’t talk to us about it either”.

Thanks to whoever read this all the way to the end. I realize this is a lot to swallow.

Thanks for doing all that work and writing this up. We’re discussing and asking around if anyone has any ideas.

Thanks for all of the info. Not sure what’s happening but can you test the latency between two troubled devices (over ZT) and also record the latency variance over time? I’d like to know if the variance is higher when over ZT. I’m wondering if something inside ZT is creating too noisy of an environment for the SMB protocol to function properly.

Can you blacklist 172.16.0.0/12 in local.conf too, just in case?