Post

VyOS 1.5, Segment Routing & GRE

I was reading about Segment Routing, and I wanted to give that technology a try.

I already had a GNS3 lab built with vSRX devices, with a basic IS-IS+LDP topology, running iBGP between PE routers and a L3VPN service. Unfortunately after trying to enable SR on the vSRX devices I found that you cannot configure SRGB on them. I’d have to re-do my lab with vMX and that’s not fun for me, it takes it out of “weekend fun project” territory and more into “homework”.

So I thought, why not deploy Segment Routing on my own network?. Not directly on AS203528 yet (I don’t want to mess with it due to a current outage I have), but on my internal network. It all runs VyOS, IGP IS-IS and iBGP (with 4 RRs). There’s no MPLS deployed there yet. After all this network is supposed to be a Lab (even though it has taken more of a “production” role).

Initial deployment

First of all I updated all the nodes (except RRs) on my Internal VyOS network to the latest 1.5 nightly at the time (1.5-rolling-202312290919). Easy enough. All good so far.

Then it was just enabling SR within IS-IS (It already has default SRGB/SRLB values) and assigning on each router the index values for the loopback prefixes. And also enabling MPLS on the internally-facing interfaces.

Here is an example, first are the loopback IPs of one of my routers, and then the deployed config to enable SR.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
dum0         192.168.254.34/32                 xx:xx:xx:xx:xx:xx  default   1500  u/u    Loopback / Tunnel source
             2a0e:8f02:21d2:ffff::34/128
dum1         fc0e:8f02:21d2:ffff::34/128       xx:xx:xx:xx:xx:xx  default   1500  u/u    IPv6 iBGP next hop


set protocols isis segment-routing maximum-label-depth '15'
set protocols isis segment-routing prefix 2a0e:8f02:21d1:ffff::34/128 index value '341'
set protocols isis segment-routing prefix 192.168.254.34/32 index value '340'
set protocols isis segment-routing prefix fc0e:8f02:21d1:ffff::34/128 index value '342'

set protocols mpls interface 'eth0'
set protocols mpls interface 'eth1'

After deploying that everywhere - it just worked:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
OSR2A1:~$ sh ip route 192.168.254.13
Routing entry for 192.168.254.13/32
  Known via "isis", distance 115, metric 45210, best
  Last update 1d00h32m ago
  * 172.27.19.17, via eth1, label 16130, weight 1

OSR2A1:~$ sh ipv6 route 2a0e:8f02:21d1:ffff::13
Routing entry for 2a0e:8f02:21d1:ffff::13/128
  Known via "isis", distance 115, metric 45210, best
  Last update 1d00h33m ago
  * fe80::9434:bdff:fe26:3f79, via eth1, label 16131, weight 1

OSR2A1:~$ sh ipv6 route fc0e:8f02:21d1:ffff::13
Routing entry for fc0e:8f02:21d1:ffff::13/128
  Known via "isis", distance 115, metric 45210, best
  Last update 1d00h33m ago
  * fe80::9434:bdff:fe26:3f79, via eth1, label 16132, weight 1

Problems

It all worked nice in the beginning, or so I thought. Later on the day I did find that there was some really sporadic performance degradation happening on my network, to/from the IPv6 Internet.

My internal network is connected to the Public v6 Internet through a redundant set of firewalls (OSR1FW1/OSR1FW2), both of these are on different servers at home.

Each of these firewalls is connected to two Border routers I have at home, OSR1BR1/OSR1BR2. The connection between Firewall & Border Router on the same server is through a local VLAN. However the connection to the FW/BR on the different server is through a GRETAP tunnel traversing my core.

The active firewall will do ECMP to/from both of the border routers. So that would partly explain the inconsistent experience and degradation.

Diagram

Below is the example configuration of such a GRETAP tunnel across my core, for the Firewall to BR connection: (OSR1CR4 side looks pretty much the same)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
OSR1CR2# sh interfaces ethernet eth12
 description "OSR1FW2 - OSR1BR1 VLAN 543"
 offload {
     gro
     gso
     sg
     tso
 }


OSR1CR2# sh interfaces bridge br2
 description "OSR1FW2 - OSR1BR1 VLAN 543"
 enable-vlan
 ipv6 {
     address {
         no-default-link-local
     }
 }
 member {
     interface eth12 {
         allowed-vlan 100
         native-vlan 100
     }
     interface tun2 {
         allowed-vlan 100
         native-vlan 100
     }
 }

OSR1CR2# sh interfaces tunnel tun2
 description "OSR1FW2 - OSR1BR1 VLAN 543"
 encapsulation gretap
 mtu 1600
 parameters {
     ip {
         key 543
     }
 }
 remote 192.168.254.13
 source-address 192.168.254.11

I noticed that, on the core router where the tunnel exists, whenever the performance issue was seen, the core-facing interface (in this case OSR1CR2 to OSR1CR1) would drop packets on TX.:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
fabrizzio@OSR1CR2:~$ sh interfaces ethernet eth0
eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1800 qdisc mq state UP group default qlen 1000
    link/ether f6:65:87:96:70:22 brd ff:ff:ff:ff:ff:ff
    altname enp0s18
    altname ens18
    inet 172.27.16.38/30 brd 172.27.16.39 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 2a0e:8f02:21d1:feed:0:1:10:12/126 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::f465:87ff:fe96:7022/64 scope link 
       valid_lft forever preferred_lft forever
    Description: To OSR1CR1

    RX:       bytes  packets  errors  dropped  overrun       mcast
         1261617086  3428022       0       20        0           0
    TX:       bytes  packets  errors  dropped  carrier  collisions
         5598511155  6117186       0     5212        0           0 <<<<<<<<<<<<<<<

fabrizzio@OSR1CR2:~$ sh ip route 192.168.254.13
Routing entry for 192.168.254.13/32
  Known via "isis", distance 115, metric 1210, best
  Last update 14:42:50 ago
  * 172.27.16.37, via eth0, label 16130, weight 1

Here is an IPerf3 test running OSR1FW2 <> OSR1BR1, to test the performance across the GRETAP tunnel. The results are pretty bad, this should be close to 2 Gbit/s. Tons of retransmissions and SACK’s were seen. The tcpdump was taken at OSR1CR4 interface (facing OSR1FW2) and it’s visible that the traffic enters OSR1CR4 however the far end is missing some segments.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
OSR1FW2:~$ iperf3 -c 172.27.1.17
Connecting to host 172.27.1.17, port 5201
[  5] local 172.27.1.18 port 39192 connected to 172.27.1.17 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   139 KBytes  1.13 Mbits/sec   43   8.48 KBytes       
[  5]   1.00-2.00   sec  45.2 KBytes   371 Kbits/sec   40   8.48 KBytes       
[  5]   2.00-3.00   sec  0.00 Bytes  0.00 bits/sec   38   8.48 KBytes       
^C- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-3.09   sec   184 KBytes   487 Kbits/sec  121             sender
[  5]   0.00-3.09   sec  0.00 Bytes  0.00 bits/sec                  receiver
iperf3: interrupt - the client has terminated


16:39:41.862143 IP 172.27.1.18.39192 > 172.27.1.17.5201: Flags [S], seq 1766734757, win 64240, options [mss 1460,sackOK,TS val 2424729071 ecr 0,nop,wscale 7], length 0
16:39:41.862850 IP 172.27.1.17.5201 > 172.27.1.18.39192: Flags [S.], seq 3455293594, ack 1766734758, win 65160, options [mss 1460,sackOK,TS val 4290898765 ecr 2424729071,nop,wscale 7], length 0
16:39:41.863012 IP 172.27.1.18.39192 > 172.27.1.17.5201: Flags [.], ack 1, win 502, options [nop,nop,TS val 2424729072 ecr 4290898765], length 0
16:39:41.863051 IP 172.27.1.18.39192 > 172.27.1.17.5201: Flags [P.], seq 1:38, ack 1, win 502, options [nop,nop,TS val 2424729072 ecr 4290898765], length 37
16:39:41.863749 IP 172.27.1.17.5201 > 172.27.1.18.39192: Flags [.], ack 38, win 509, options [nop,nop,TS val 4290898766 ecr 2424729072], length 0
16:39:41.864481 IP 172.27.1.17.5201 > 172.27.1.18.39180: Flags [P.], seq 3:4, ack 166, win 508, options [nop,nop,TS val 4290898766 ecr 2424729070], length 1
16:39:41.864492 IP 172.27.1.17.5201 > 172.27.1.18.39180: Flags [P.], seq 4:5, ack 166, win 508, options [nop,nop,TS val 4290898766 ecr 2424729070], length 1
16:39:41.864616 IP 172.27.1.18.39180 > 172.27.1.17.5201: Flags [.], ack 5, win 502, options [nop,nop,TS val 2424729073 ecr 4290898764], length 0
16:39:41.864658 IP 172.27.1.18.39192 > 172.27.1.17.5201: Flags [P.], seq 38:7278, ack 1, win 502, options [nop,nop,TS val 2424729073 ecr 4290898766], length 7240
16:39:41.864700 IP 172.27.1.18.39192 > 172.27.1.17.5201: Flags [P.], seq 7278:14518, ack 1, win 502, options [nop,nop,TS val 2424729073 ecr 4290898766], length 7240
16:39:41.864773 IP 172.27.1.18.39192 > 172.27.1.17.5201: Flags [.], seq 14518:15966, ack 1, win 502, options [nop,nop,TS val 2424729074 ecr 4290898766], length 1448
16:39:41.865324 IP 172.27.1.17.5201 > 172.27.1.18.39192: Flags [.], ack 38, win 509, options [nop,nop,TS val 4290898767 ecr 2424729072,nop,nop,sack 1 {14518:15966}], length 0
16:39:41.865437 IP 172.27.1.18.39192 > 172.27.1.17.5201: Flags [.], seq 38:1486, ack 1, win 502, options [nop,nop,TS val 2424729074 ecr 4290898767], length 1448
16:39:41.866101 IP 172.27.1.17.5201 > 172.27.1.18.39192: Flags [.], ack 1486, win 498, options [nop,nop,TS val 4290898768 ecr 2424729074,nop,nop,sack 1 {14518:15966}], length 0
16:39:41.866222 IP 172.27.1.18.39192 > 172.27.1.17.5201: Flags [P.], seq 1486:7278, ack 1, win 502, options [nop,nop,TS val 2424729075 ecr 4290898768], length 5792
16:39:42.070774 IP 172.27.1.18.39192 > 172.27.1.17.5201: Flags [.], seq 1486:2934, ack 1, win 502, options [nop,nop,TS val 2424729280 ecr 4290898768], length 1448
16:39:42.071778 IP 172.27.1.17.5201 > 172.27.1.18.39192: Flags [.], ack 2934, win 490, options [nop,nop,TS val 4290898974 ecr 2424729280,nop,nop,sack 1 {14518:15966}], length 0
16:39:42.072008 IP 172.27.1.18.39192 > 172.27.1.17.5201: Flags [P.], seq 2934:7278, ack 1, win 502, options [nop,nop,TS val 2424729281 ecr 4290898974], length 4344
16:39:42.072042 IP 172.27.1.18.39192 > 172.27.1.17.5201: Flags [.], seq 7278:8726, ack 1, win 502, options [nop,nop,TS val 2424729281 ecr 4290898974], length 1448
16:39:42.072743 IP 172.27.1.17.5201 > 172.27.1.18.39192: Flags [.], ack 2934, win 490, options [nop,nop,TS val 4290898975 ecr 2424729280,nop,nop,sack 2 {7278:8726}{14518:15966}], length 0
16:39:42.072898 IP 172.27.1.18.39192 > 172.27.1.17.5201: Flags [.], seq 2934:4382, ack 1, win 502, options [nop,nop,TS val 2424729282 ecr 4290898975], length 1448
16:39:42.073582 IP 172.27.1.17.5201 > 172.27.1.18.39192: Flags [.], ack 4382, win 479, options [nop,nop,TS val 4290898975 ecr 2424729282,nop,nop,sack 2 {7278:8726}{14518:15966}], length 0
16:39:42.073776 IP 172.27.1.18.39192 > 172.27.1.17.5201: Flags [P.], seq 4382:7278, ack 1, win 502, options [nop,nop,TS val 2424729283 ecr 4290898975], length 2896
16:39:42.073817 IP 172.27.1.18.39192 > 172.27.1.17.5201: Flags [P.], seq 15966:18862, ack 1, win 502, options [nop,nop,TS val 2424729283 ecr 4290898975], length 2896


The odd thing is if I repeat the capture on the OSR1CR4 interface facing the core (OSR1CR3) we see that the segments from OSR1FW2 to OSR1BR1 are indeed already missing and not being sent out towards the Core.

The weird tcpdump filter is because from OSR1CR4 towards GRE destination OSR1CR2 we use MPLS label 16110. Due to penultimate hop popping (my hands keep typing pooping by themselves :D ), the traffic on the reverse direction will arrive without a label; I need to be able to capture both directions.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
fabrizzio@OSR1CR4:~$ sh ip route 192.168.254.11
Routing entry for 192.168.254.11/32
  Known via "isis", distance 115, metric 1210, best
  Last update 19:23:51 ago
  * 172.27.16.41, via eth0, label 16110, weight 1

root@OSR1CR4:~# tcpdump -i eth0 "(src 192.168.254.11 && dst 192.168.254.13) or (mpls 16110 && (src 192.168.254.13 && dst 192.168.254.11))" | grep 0x21f

<snipped>

16:45:49.477938 MPLS (label 16110, tc 0, [S], ttl 64) IP 192.168.254.13 > 192.168.254.11: GREv0, key=0x21f, length 82: IP 172.27.1.18.37834 > 172.27.1.17.5201: Flags [S], seq 1179634274, win 64240, options [mss 1460,sackOK,TS val 2425096687 ecr 0,nop,wscale 7], length 0
16:45:49.478705 IP 192.168.254.11 > 192.168.254.13: GREv0, key=0x21f, length 82: IP 172.27.1.17.5201 > 172.27.1.18.37834: Flags [S.], seq 3242204630, ack 1179634275, win 65160, options [mss 1460,sackOK,TS val 4291266381 ecr 2425096687,nop,wscale 7], length 0
16:45:49.478851 MPLS (label 16110, tc 0, [S], ttl 64) IP 192.168.254.13 > 192.168.254.11: GREv0, key=0x21f, length 74: IP 172.27.1.18.37834 > 172.27.1.17.5201: Flags [.], ack 1, win 502, options [nop,nop,TS val 2425096688 ecr 4291266381], length 0
16:45:49.478865 MPLS (label 16110, tc 0, [S], ttl 64) IP 192.168.254.13 > 192.168.254.11: GREv0, key=0x21f, length 111: IP 172.27.1.18.37834 > 172.27.1.17.5201: Flags [P.], seq 1:38, ack 1, win 502, options [nop,nop,TS val 2425096688 ecr 4291266381], length 37
16:45:49.479638 IP 192.168.254.11 > 192.168.254.13: GREv0, key=0x21f, length 74: IP 172.27.1.17.5201 > 172.27.1.18.37834: Flags [.], ack 38, win 509, options [nop,nop,TS val 4291266382 ecr 2425096688], length 0
16:45:49.480309 IP 192.168.254.11 > 192.168.254.13: GREv0, key=0x21f, length 75: IP 172.27.1.17.5201 > 172.27.1.18.37828: Flags [P.], seq 3:4, ack 166, win 508, options [nop,nop,TS val 4291266382 ecr 2425096686], length 1
16:45:49.480330 IP 192.168.254.11 > 192.168.254.13: GREv0, key=0x21f, length 75: IP 172.27.1.17.5201 > 172.27.1.18.37828: Flags [P.], seq 4:5, ack 166, win 508, options [nop,nop,TS val 4291266382 ecr 2425096686], length 1
16:45:49.480471 MPLS (label 16110, tc 0, [S], ttl 64) IP 192.168.254.13 > 192.168.254.11: GREv0, key=0x21f, length 74: IP 172.27.1.18.37828 > 172.27.1.17.5201: Flags [.], ack 5, win 502, options [nop,nop,TS val 2425096689 ecr 4291266379], length 0
16:45:49.480677 MPLS (label 16110, tc 0, [S], ttl 64) IP 192.168.254.13 > 192.168.254.11: GREv0, key=0x21f, length 1522: IP 172.27.1.18.37834 > 172.27.1.17.5201: Flags [.], seq 14518:15966, ack 1, win 502, options [nop,nop,TS val 2425096689 ecr 4291266382], length 1448
16:45:49.481335 IP 192.168.254.11 > 192.168.254.13: GREv0, key=0x21f, length 86: IP 172.27.1.17.5201 > 172.27.1.18.37834: Flags [.], ack 38, win 509, options [nop,nop,TS val 4291266383 ecr 2425096688,nop,nop,sack 1 {14518:15966}], length 0
16:45:49.481478 MPLS (label 16110, tc 0, [S], ttl 64) IP 192.168.254.13 > 192.168.254.11: GREv0, key=0x21f, length 1522: IP 172.27.1.18.37834 > 172.27.1.17.5201: Flags [.], seq 38:1486, ack 1, win 502, options [nop,nop,TS val 2425096690 ecr 4291266383], length 1448
16:45:49.482121 IP 192.168.254.11 > 192.168.254.13: GREv0, key=0x21f, length 86: IP 172.27.1.17.5201 > 172.27.1.18.37834: Flags [.], ack 1486, win 498, options [nop,nop,TS val 4291266384 ecr 2425096690,nop,nop,sack 1 {14518:15966}], length 0
16:45:49.686840 MPLS (label 16110, tc 0, [S], ttl 64) IP 192.168.254.13 > 192.168.254.11: GREv0, key=0x21f, length 1522: IP 172.27.1.18.37834 > 172.27.1.17.5201: Flags [.], seq 1486:2934, ack 1, win 502, options [nop,nop,TS val 2425096896 ecr 4291266384], length 1448
16:45:49.687786 IP 192.168.254.11 > 192.168.254.13: GREv0, key=0x21f, length 86: IP 172.27.1.17.5201 > 172.27.1.18.37834: Flags [.], ack 2934, win 490, options [nop,nop,TS val 4291266590 ecr 2425096896,nop,nop,sack 1 {14518:15966}], length 0
16:45:49.687950 MPLS (label 16110, tc 0, [S], ttl 64) IP 192.168.254.13 > 192.168.254.11: GREv0, key=0x21f, length 1522: IP 172.27.1.18.37834 > 172.27.1.17.5201: Flags [.], seq 2934:4382, ack 1, win 502, options [nop,nop,TS val 2425096897 ecr 4291266590], length 1448
16:45:49.688011 MPLS (label 16110, tc 0, [S], ttl 64) IP 192.168.254.13 > 192.168.254.11: GREv0, key=0x21f, length 1522: IP 172.27.1.18.37834 > 172.27.1.17.5201: Flags [.], seq 4382:5830, ack 1, win 502, options [nop,nop,TS val 2425096897 ecr 4291266590], length 1448


Troubleshooting

First of all I disabled all enabled offloads via VyOS on the core-facing interface:

1
2
3
4
5
fabrizzio@OSR1CR4:~$ configure 
[edit]
fabrizzio@OSR1CR4# delete interfaces ethernet eth0 offload 
[edit]
fabrizzio@OSR1CR4# commit

This did not fix the issue.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
OSR1CR4:~$ sh interfaces ethernet eth0
eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1800 qdisc mq state UP group default qlen 1000
    link/ether 3a:e6:63:fc:e9:68 brd ff:ff:ff:ff:ff:ff
    altname enp0s18
    altname ens18
    inet 172.27.16.42/30 brd 172.27.16.43 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 2a0e:8f02:21d1:feed:0:1:11:12/126 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::38e6:63ff:fefc:e968/64 scope link 
       valid_lft forever preferred_lft forever
    Description: To OSR1CR3

    RX:       bytes  packets  errors  dropped  overrun       mcast
         7400341272  9874881       0       28        0           0
    TX:       bytes  packets  errors  dropped  carrier  collisions
         1002401375  4853418       0    36577        0           0 <<<<

Ethtool shows no queue drops

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
fabrizzio@OSR1CR4:~$ ethtool -S eth0
NIC statistics:
     rx_queue_0_packets: 9878837
     rx_queue_0_bytes: 7405521629
     rx_queue_0_drops: 0
     rx_queue_0_xdp_packets: 0
     rx_queue_0_xdp_tx: 0
     rx_queue_0_xdp_redirects: 0
     rx_queue_0_xdp_drops: 0
     rx_queue_0_kicks: 2066
     tx_queue_0_packets: 4853851
     tx_queue_0_bytes: 1002468305
     tx_queue_0_xdp_tx: 0
     tx_queue_0_xdp_tx_drops: 0
     tx_queue_0_kicks: 4595142
     tx_queue_0_tx_timeouts: 0

fabrizzio@OSR1CR4:~$ ethtool -k eth0
Features for eth0:
rx-checksumming: on [fixed]
tx-checksumming: on
	tx-checksum-ipv4: off [fixed]
	tx-checksum-ip-generic: on
	tx-checksum-ipv6: off [fixed]
	tx-checksum-fcoe-crc: off [fixed]
	tx-checksum-sctp: off [fixed]
scatter-gather: off
	tx-scatter-gather: off
	tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: off
	tx-tcp-segmentation: off
	tx-tcp-ecn-segmentation: off
	tx-tcp-mangleid-segmentation: off
	tx-tcp6-segmentation: off
generic-segmentation-offload: off
generic-receive-offload: off
large-receive-offload: off [fixed]
rx-vlan-offload: off [fixed]
tx-vlan-offload: off [fixed]
ntuple-filters: off [fixed]
receive-hashing: off [fixed]
highdma: on [fixed]
rx-vlan-filter: on [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: on [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: off [fixed]
tx-gre-csum-segmentation: off [fixed]
tx-ipxip4-segmentation: off [fixed]
tx-ipxip6-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
tx-udp_tnl-csum-segmentation: off [fixed]
tx-gso-partial: off [fixed]
tx-tunnel-remcsum-segmentation: off [fixed]
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
tx-udp-segmentation: off [fixed]
tx-gso-list: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off [fixed]
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: off [fixed]
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
rx-gro-hw: on
tls-hw-record: off [fixed]
rx-gro-list: off
macsec-hw-offload: off [fixed]
rx-udp-gro-forwarding: off
hsr-tag-ins-offload: off [fixed]
hsr-tag-rm-offload: off [fixed]
hsr-fwd-offload: off [fixed]
hsr-dup-offload: off [fixed]

Seeing this, I had to dig deeper and do some research. I found something related to MPLS + offloading drops on LKML and openvswitch as well as a tangentially-related blog post at Cloudflare

I decided that the best path forward would be to try and figure out if/why is the kernel dropping the packets. I added the Debian Bookworm repos onto OSR1CR2 (far end of GRE tunnel) and installed dropwatch.

I ran it with dropwatch -l kas , then configured alertmode packet set alertmode packet, saw what was normal (lots of ICMPv6 drops…) and re-ran the IPerf3 test. This is what started popping out. Protocol 0x8847 (MPLS Unicast) is a good hint that it’s the traffic I care about. Length is big, maybe due to the various offloads doing their things.

1
2
3
4
5
6
7
drop at: validate_xmit_skb+0x29c/0x320 (0xffffffff8d6b1a6c)
origin: software
timestamp: Mon Jan  1 11:52:45 2024 028417108 nsec
protocol: 0x8847
length: 3008
original length: 3008
drop reason: NOT_SPECIFIED

After digging around using Google, I found this which pointed me to offloads being the potential culprit. I had already disabled offloads on the core-facing interface that is actually dropping the packets.

I tried disabling the offloads on the interface at OSR1CR4 facing OSR1FW2. Also no change on the performance.

The last option I had was to check both the Bridge tying up everything and the tunnel interface. They have no offloads configured.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
fabrizzio@OSR1CR4# show interfaces bridge br2
 description "OSR1FW2 - OSR1BR1 VLAN 543"
 enable-vlan
 ipv6 {
     address {
         no-default-link-local
     }
 }
 member {
     interface eth10 {
         allowed-vlan 100
         native-vlan 100
     }
     interface tun2 {
         allowed-vlan 100
         native-vlan 100
     }
 }
[edit]

However the tunnel interface on VyOS does have several offloads enabled by default:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
fabrizzio@OSR1CR4:~$ ethtool -k tun2
Features for tun2:
rx-checksumming: off [fixed]
tx-checksumming: on
	tx-checksum-ipv4: off [fixed]
	tx-checksum-ip-generic: on
	tx-checksum-ipv6: off [fixed]
	tx-checksum-fcoe-crc: off [fixed]
	tx-checksum-sctp: off [fixed]
scatter-gather: on
	tx-scatter-gather: on
	tx-scatter-gather-fraglist: on
tcp-segmentation-offload: on
	tx-tcp-segmentation: on
	tx-tcp-ecn-segmentation: on
	tx-tcp-mangleid-segmentation: on
	tx-tcp6-segmentation: on
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: off [fixed]
tx-vlan-offload: off [fixed]
ntuple-filters: off [fixed]
receive-hashing: off [fixed]
highdma: on
rx-vlan-filter: off [fixed]
vlan-challenged: off [fixed]
tx-lockless: on [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: off [fixed]
tx-gre-csum-segmentation: off [fixed]
tx-ipxip4-segmentation: off [fixed]
tx-ipxip6-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
tx-udp_tnl-csum-segmentation: off [fixed]
tx-gso-partial: off [fixed]
tx-tunnel-remcsum-segmentation: off [fixed]
tx-sctp-segmentation: on
tx-esp-segmentation: off [fixed]
tx-udp-segmentation: on
tx-gso-list: on
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off [fixed]
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: off [fixed]
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
rx-gro-hw: off [fixed]
tls-hw-record: off [fixed]
rx-gro-list: off
macsec-hw-offload: off [fixed]
rx-udp-gro-forwarding: off
hsr-tag-ins-offload: off [fixed]
hsr-tag-rm-offload: off [fixed]
hsr-fwd-offload: off [fixed]
hsr-dup-offload: off [fixed]

After disabling them one by one, the issue was solved when TCP segmentation offload (TSO) was disabled. Close to 2 Gbit/s seen now on the IPerf3 test. I was able to enable GSO & GRO again without any issues here.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
fabrizzio@OSR1CR4:~$ ethtool -K tun2 gso off
fabrizzio@OSR1CR4:~$ ethtool -K tun2 gro off
fabrizzio@OSR1CR4:~$ ethtool -K tun2 tso off

fabrizzio@OSR1FW2:~$ iperf3 -c 172.27.1.17
Connecting to host 172.27.1.17, port 5201
[  5] local 172.27.1.18 port 57380 connected to 172.27.1.17 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   214 MBytes  1.79 Gbits/sec  370   2.14 MBytes       
[  5]   1.00-2.00   sec   206 MBytes  1.73 Gbits/sec    0   2.45 MBytes       
[  5]   2.00-3.00   sec   218 MBytes  1.82 Gbits/sec  121   1.16 MBytes       
[  5]   3.00-4.00   sec   218 MBytes  1.82 Gbits/sec    0   1.06 MBytes       
[  5]   4.00-5.00   sec   218 MBytes  1.82 Gbits/sec    0   1.23 MBytes       
[  5]   5.00-6.00   sec   214 MBytes  1.79 Gbits/sec  202   1.60 MBytes       
[  5]   6.00-7.00   sec   218 MBytes  1.82 Gbits/sec  471   2.19 MBytes       
[  5]   7.00-8.00   sec   206 MBytes  1.73 Gbits/sec  383   1.30 MBytes       
[  5]   8.00-9.00   sec   172 MBytes  1.45 Gbits/sec    0   2.62 MBytes       
[  5]   9.00-10.00  sec   194 MBytes  1.63 Gbits/sec    0   1.80 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  2.03 GBytes  1.74 Gbits/sec  1547             sender
[  5]   0.00-10.00  sec  2.03 GBytes  1.74 Gbits/sec                  receiver


Now, there is no option to disable these tunnel offloads via VyOS config. So I’d have to set up a script to run on boot & upon commits to fix this automatically.

But wait, there’s more.

So far the troubleshooting has been within GRETAP tunnels that originate & terminate within my private network (that runs IS-IS + SR).

I have also noticed poor behavior on GRE tunnels that ride on top of my private network, but are not originated nor terminated within it. The symptoms were the same, TX drops on the ingress router, on the core-facing interface.

For those, the fix was much easier: Disabling GRO offload on the router ingress interface (not on the core-facing interface).

Changing GRETAP to L2TPv3

I gave this a try - even though that it implies creating a dummy interface to work around VyOS bug T1080

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
set interfaces dummy dum3 address '192.168.254.254/24'
set interfaces dummy dum3 description 'Bug T1080 Workaround'


set interfaces l2tpv3 l2tpeth10 description 'OSR1FW2 - OSR1BR1 VLAN 543'
set interfaces l2tpv3 l2tpeth10 encapsulation 'ip'
set interfaces l2tpv3 l2tpeth10 mtu '1700'
set interfaces l2tpv3 l2tpeth10 peer-session-id '543'
set interfaces l2tpv3 l2tpeth10 peer-tunnel-id '543'
set interfaces l2tpv3 l2tpeth10 remote '192.168.254.11'
set interfaces l2tpv3 l2tpeth10 session-id '543'
set interfaces l2tpv3 l2tpeth10 source-address '192.168.254.13'
set interfaces l2tpv3 l2tpeth10 tunnel-id '543'

set interfaces bridge br2 member interface l2tpeth10 allowed-vlan '100'
set interfaces bridge br2 member interface l2tpeth10 native-vlan '100'


delete interfaces bridge br2 member interface tun2


It works just fine!. Will have to do this with the rest.

Summary.

In short.

  • GRETAP tunnels configured on a VyOS router which sends the traffic encapsulated with MPLS on top of GRE: Either disable TSO on the tunnel with a script, or change the tunnel to L2TPv3.
  • GRE tunnels ingressing a VyOS router which then encapsulates again with MPLS: Disable GRO on the router ingress interface.
This post is licensed under CC BY 4.0 by the author.

Trending Tags