Intermittent packet loss makes zerotier unusable for myself and multiple other users in Japan

Intermittent packet loss makes zerotier unusable for myself and multiple other users in Japan

Over the last year or two myself and several colleagues have attempted to use zerotier as a method of remote accessing our PCs. All of us have experienced intermittent packet loss over a path listed as DIRECT. In some cases this packet loss has been observed over a local network connection with no wifi involved.

Here’s an example:

Machine 1:

❯ sudo zerotier-cli peers
200 peers
<ztaddr>   <ver>  <role> <lat> <link> <lastTX> <lastRX> <path>
1111111111 -      PLANET    93 DIRECT 8608     3365     ip2/9993
2222222222 -      PLANET   154 DIRECT 3458     3301     ip3/9993
3333333333 -      PLANET   231 DIRECT 8608     13526    ip4/9993
4444444444 -      PLANET   102 DIRECT 8608     3355     ip5/9993
5555555555 1.10.2 LEAF       8 DIRECT 39055    221897   ip1/57720
6666666666 1.10.2 LEAF     159 DIRECT 2852     2852     ip6/29740
7777777777 1.10.2 LEAF      10 DIRECT 19       7397     ip1/4906    <-- Machine 2

Machine 2:

❯ sudo zerotier-cli peers
200 peers
<ztaddr>   <ver>  <role> <lat> <link> <lastTX> <lastRX> <path>
5ae0cc8da8 1.8.10 LEAF       8 DIRECT 1205     1205     ip1/1027
1111111111 -      PLANET    96 DIRECT 1256     1407     ip2/9993
2222222222 -      PLANET   163 DIRECT 5788     5627     ip3/9993
3333333333 -      PLANET   259 DIRECT 5788     456      ip4/9993
4444444444 -      PLANET   107 DIRECT 5788     608      ip5/9993
5555555555 1.10.2 LEAF      -1 DIRECT 230362   230362   ip1/57720
6666666666 1.10.2 LEAF     240 DIRECT 1671     1502     ip6/21024
8888888888 1.10.2 LEAF      15 DIRECT 354      2723     ip7/1451  <-- Machine 1
9999999999 1.8.10 LEAF      -1 RELAY

Example ping with bouts of packet loss:

64 bytes from 172.22.131.39: icmp_seq=164 ttl=64 time=11.023 ms
64 bytes from 172.22.131.39: icmp_seq=165 ttl=64 time=15.906 ms
64 bytes from 172.22.131.39: icmp_seq=166 ttl=64 time=12.990 ms
Request timeout for icmp_seq 167
Request timeout for icmp_seq 168
Request timeout for icmp_seq 169
64 bytes from 172.22.131.39: icmp_seq=170 ttl=64 time=8.397 ms
64 bytes from 172.22.131.39: icmp_seq=171 ttl=64 time=13.644 ms
64 bytes from 172.22.131.39: icmp_seq=172 ttl=64 time=9.459 ms
64 bytes from 172.22.131.39: icmp_seq=173 ttl=64 time=8.479 ms
64 bytes from 172.22.131.39: icmp_seq=174 ttl=64 time=23.692 ms
64 bytes from 172.22.131.39: icmp_seq=175 ttl=64 time=10.655 ms
64 bytes from 172.22.131.39: icmp_seq=176 ttl=64 time=11.099 ms
64 bytes from 172.22.131.39: icmp_seq=177 ttl=64 time=8.291 ms
64 bytes from 172.22.131.39: icmp_seq=178 ttl=64 time=9.656 ms
64 bytes from 172.22.131.39: icmp_seq=179 ttl=64 time=10.141 ms
Request timeout for icmp_seq 180
64 bytes from 172.22.131.39: icmp_seq=181 ttl=64 time=8.235 ms
64 bytes from 172.22.131.39: icmp_seq=182 ttl=64 time=10.487 ms
64 bytes from 172.22.131.39: icmp_seq=183 ttl=64 time=8.062 ms
64 bytes from 172.22.131.39: icmp_seq=184 ttl=64 time=11.191 ms
64 bytes from 172.22.131.39: icmp_seq=185 ttl=64 time=10.657 ms
64 bytes from 172.22.131.39: icmp_seq=186 ttl=64 time=9.737 ms
Request timeout for icmp_seq 187
Request timeout for icmp_seq 188
Request timeout for icmp_seq 189
Request timeout for icmp_seq 190
64 bytes from 172.22.131.39: icmp_seq=191 ttl=64 time=18.199 ms
64 bytes from 172.22.131.39: icmp_seq=192 ttl=64 time=18.189 ms
64 bytes from 172.22.131.39: icmp_seq=193 ttl=64 time=8.917 ms
64 bytes from 172.22.131.39: icmp_seq=194 ttl=64 time=9.375 ms
64 bytes from 172.22.131.39: icmp_seq=195 ttl=64 time=14.346 ms
64 bytes from 172.22.131.39: icmp_seq=196 ttl=64 time=16.188 ms
Request timeout for icmp_seq 197
Request timeout for icmp_seq 198
^C
--- grater ping statistics ---
200 packets transmitted, 183 packets received, 8.5% packet loss
round-trip min/avg/max/stddev = 7.506/11.637/24.979/3.412 ms

Of note, the packet loss occurs around every 30-45 seconds and lasts for 3-10 seconds each time. It’s loss per connection, not across all connections. ie. if I have two pings running in parallel, they both begin to fail at different points in time.

We’ve noticed this between:

  • macOS and macOS
  • linux and macOS
  • macOS and iPadOS
  • macOS and windows

The only common theme seems to be that we are all located in Japan. But because the routing seems to be direct this likely shouldn’t be an issue? If taking a non-zerotier route between two test machines, there is zero packet loss.

Any ideas?

Adding extra information: I’ve tested dropping MTU down to 1280 just in case the higher 2800 has something to do with it, but that didn’t help.

One of my friends ran in ping session all day and the packet loss only started at a certain time (around 09:30 UTC). At that point in time, pinging the target system directly has no packet loss.

Hi peppy,
Sorry we missed you. That is very strange.
Does this happen between devices that are on the same physical LAN?
What type of routers/firewalls are involved? It sounds like maybe a NAT mapping is breaking, and then fixing itself.

If you do zerotier-cli info -j on the newest versions of zerotier, it should have a “surfaceAddresses” field. Does that list keep growing?

Hi Travis, sorry for the delay in response - the issue is sometimes quite intermittent and hard to capture.

peppy and myself share one endpoint “A” in reproducing this issue. We both reproduce the issue when the other endpoint is our individual home networks, OR when the other endpoint is the same local network “A”. All endpoints are on different ISPs.

Here’s a video showing what we’re seeing with pings and output of “surfaceAddresses”. I’ve blurred out most of the details that is confidential/I believe not important to this issue.

https://drive.google.com/file/d/1rabLr2zui4ZNrd0uqziYmOMZbyi9mqhQ/view

  • On the left in white is a macOS client. It is on endpoint “A”. It is pinging the client on the right.
  • On the right in black is a Linux client that’s connected to via a remote session. It is on my home network. It is pinging the client on the left, and you can see it intermittently dropping the remote session corresponding to the packet losses.

Timeline of events:
00:00:09 - Linux client | surfaceAddresses change
00:00:39 - Linux client | surfaceAddresses change
00:00:53 - Both clients | Packets dropped
00:01:09 - Linux client | surfaceAddresses change
00:01:12 - macOS client | surfaceAddress added
00:01:13 - Linux client | surfaceAddress added
00:01:14 - Linux client | surfaceAddress added
00:01:24 - Linux client | surfaceAddress removed
00:01:39 - Linux client | surfaceAddresses change
00:01:40 - Both clients | Packets dropped
00:01:58 - Linux client | surfaceAddress added
00:02:09 - Linux client | surfaceAddresses change
00:02:24 - Linux client | surfaceAddresses change
00:02:39 - Linux client | surfaceAddresses change
00:02:40 - Both clients | Packets dropped
00:02:58 - Linux client | surfaceAddresses change

  • As far as I can tell, full cone NAT is enabled on the Linux client’s network.
  • Should the Linux client have a secondary port?

Interesting.
I’d look in to the NAT on the linux side. What type of router is it?

surfaceAddresses are the address:port that the Roots see for that node. The roots pass that on to the other node. If that mapping expires, then there is a hiccup while a new connection with a new port is made.

Sorry to jump into somebody else’s thread, but this issue they’re reporting is very similar to an issue we’ve encountered as well.

If there are no surfaceAddresses listed, what does that imply? What could prevent those from populating?

(If this is a bigger question, I’m happy to create a separate topic for it.)

Hello. The info wasn’t always exposed. I think it was added in the last release. Also, it appears to not have made it into the Windows version yet.
If surfaceAddresses is there but it’s an empty [] I would guess UDP is completely blocked.