[Flaky test] TestPacketCapture e2e test #6815

antoninbas · 2024-11-18T17:39:28Z

Describe the bug

The TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/ipv4-icmp-timeout e2e test has failed in CI:

=== RUN   TestPacketCapture/testPacketCaptureBasic
    connectivity_test.go:76: Waiting for Pods to be ready and retrieving IPs
    connectivity_test.go:90: Retrieved all Pod IPs: map[client:IPv4(10.244.0.54),IPstrings(10.244.0.54) tcp-server:IPv4(10.244.1.120),IPstrings(10.244.1.120) udp-server:IPv4(10.244.0.55),IPstrings(10.244.0.55)]
=== RUN   TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic
=== RUN   TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/ipv4-icmp-timeout
=== PAUSE TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/ipv4-icmp-timeout
=== RUN   TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/non-existing-pod
=== PAUSE TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/non-existing-pod
=== RUN   TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/ipv4-tcp
=== PAUSE TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/ipv4-tcp
=== RUN   TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/ipv4-udp
=== PAUSE TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/ipv4-udp
=== RUN   TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/ipv4-icmp
=== PAUSE TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/ipv4-icmp
=== CONT  TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/ipv4-icmp-timeout
=== CONT  TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/ipv4-tcp
=== CONT  TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/ipv4-udp
=== CONT  TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/ipv4-icmp
=== CONT  TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/non-existing-pod
=== NAME  TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/ipv4-icmp-timeout
    packetcapture_test.go:587: CR status not match, actual: {NumberCaptured:12 FilePath:sftp://172.18.0.3:30010/upload/ipv4-icmp-timeout.pcapng Conditions:[{Type:PacketCaptureStarted Status:True LastTransitionTime:2024-11-16 04:27:44 +0000 UTC Reason:Started Message:} {Type:PacketCaptureComplete Status:True LastTransitionTime:2024-11-16 04:27:59 +0000 UTC Reason:Timeout Message:context deadline exceeded} {Type:PacketCaptureFileUploaded Status:True LastTransitionTime:2024-11-16 04:27:59 +0000 UTC Reason:Succeed Message:}]}, expected: {NumberCaptured:10 FilePath:sftp://172.18.0.3:30010/upload/ipv4-icmp-timeout.pcapng Conditions:[{Type:PacketCaptureStarted Status:True LastTransitionTime:2024-11-16 04:27:44.385427248 +0000 UTC m=+2588.139709613 Reason:Started Message:} {Type:PacketCaptureComplete Status:True LastTransitionTime:2024-11-16 04:27:44.385427348 +0000 UTC m=+2588.139709703 Reason:Timeout Message:context deadline exceeded} {Type:PacketCaptureFileUploaded Status:True LastTransitionTime:2024-11-16 04:27:44.385427438 +0000 UTC m=+2588.139709793 Reason:Succeed Message:}]}
=== NAME  TestPacketCapture
    fixtures.go:353: Exporting test logs to '/home/runner/work/antrea/antrea/log/TestPacketCapture/beforeTeardown.Nov16-04-28-19'
    fixtures.go:524: Deleting 'testpacketcapture-b0nmyag3' K8s Namespace
I1116 04:28:22.104562   26823 framework.go:858] Deleting Namespace testpacketcapture-b0nmyag3 took 3.360803ms
--- FAIL: TestPacketCapture (50.42s)
    --- FAIL: TestPacketCapture/testPacketCaptureBasic (38.64s)
        --- FAIL: TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic (0.00s)
            --- PASS: TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/ipv4-tcp (1.60s)
            --- PASS: TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/non-existing-pod (1.01s)
            --- PASS: TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/ipv4-icmp (10.27s)
            --- FAIL: TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/ipv4-icmp-timeout (15.26s)
            --- PASS: TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/ipv4-udp (21.60s)

The text was updated successfully, but these errors were encountered:

antoninbas · 2024-11-18T17:39:50Z

cc @hangyan

hangyan · 2024-11-19T02:11:41Z

cc @hangyan

There is a small time window between we start the capture and we apply the filter, so the first few packets maybe unrelated. this is not easy to address in the current architecture, Quan's last follow up PR has some improvement on this and it seems gone by that time. I didn't reproduce since then either.

A possible solution is to add an extra layer of check after we get the packet, but that will make the code more complicated.

Do you have any thoughts on this? Or we can temporary bring this testcase down?

meanwhile i will post an issue to the gopacket repo.

antoninbas · 2024-11-19T04:21:24Z

@hangyan Thanks for the info. Is it possible to implement this workaround on our side: https://natanyellin.com/posts/ebpf-filtering-done-right/. Apparently, this is what libpcap does (or did at the time the post was written?)
We could:

attach a "zero" BPF filter to the socket (doesn't match any packet)
drain the socket
swap to the correct BPF filter
after that, all packets received are guaranteed to match the correct BPF filter

However, I am not 100% sure we have a good way to do the second step (drain the socket). When I look at https://github.com/gopacket/gopacket/blob/v1.3.1/pcapgo/capture.go, it seems that the socket is blocking, which is not ideal. That means that we may have to use the packet source as follows:

func (p *pcapCapture) Capture(ctx context.Context, device string, srcIP, dstIP net.IP, packet *crdv1alpha1.Packet) (chan gopacket.Packet, error) {
	// Compile the BPF filter in advance to reduce the time window between starting the capture and applying the filter.
	inst := compilePacketFilter(packet, srcIP, dstIP)
	klog.V(5).InfoS("Generated bpf instructions for PacketCapture", "device", device, "srcIP", srcIP, "dstIP", dstIP, "packetSpec", packet, "bpf", inst)
	rawInst, err := bpf.Assemble(inst)
	if err != nil {
		return nil, err
	}

	eth, err := pcapgo.NewEthernetHandle(device)
	if err != nil {
		return nil, err
	}
	if err = eth.SetPromiscuous(false); err != nil {
		return nil, err
	}
        // Install a BPF filter that won't match any packets
	if err = eth.SetBPF(rawInstForZeroFilter); err != nil {
		return nil, err
	}
	if err = eth.SetCaptureLength(maxSnapshotBytes); err != nil {
		return nil, err
	}

	packetSource := gopacket.NewPacketSource(eth, layers.LinkTypeEthernet, gopacket.WithNoCopy(true))
        packetCh := packetSource.PacketsCtx(ctx)

        // Drain the channel
        for {
                select {
                        case <- ctx.Done():
                                return nil, ctx.Err()
                        case <- packetCh:
                                break
                        case <- time.After(50*time.Millisecond):
                                // timeout: channel is drained so socket is drained
                                // install the correct BPF filter
                                if err := eth.SetBPF(rawInst); err != nil {
                                        return nil, err
                                }
                                return packetCh, nil
                }
        }
}

It would be more elegant if we could call eth.ZeroCopyReadPacketData directly to drain the socket, but I think that would only be possible if the socket was non-blocking.

hangyan · 2024-11-19T05:56:19Z

@hangyan Thanks for the info. Is it possible to implement this workaround on our side: https://natanyellin.com/posts/ebpf-filtering-done-right/. Apparently, this is what libpcap does (or did at the time the post was written?) We could:

attach a "zero" BPF filter to the socket (doesn't match any packet)
drain the socket
swap to the correct BPF filter
after that, all packets received are guaranteed to match the correct BPF filter

However, I am not 100% sure we have a good way to do the second step (drain the socket). When I look at https://github.com/gopacket/gopacket/blob/v1.3.1/pcapgo/capture.go, it seems that the socket is blocking, which is not ideal. That means that we may have to use the packet source as follows:

func (p *pcapCapture) Capture(ctx context.Context, device string, srcIP, dstIP net.IP, packet *crdv1alpha1.Packet) (chan gopacket.Packet, error) {
	// Compile the BPF filter in advance to reduce the time window between starting the capture and applying the filter.
	inst := compilePacketFilter(packet, srcIP, dstIP)
	klog.V(5).InfoS("Generated bpf instructions for PacketCapture", "device", device, "srcIP", srcIP, "dstIP", dstIP, "packetSpec", packet, "bpf", inst)
	rawInst, err := bpf.Assemble(inst)
	if err != nil {
		return nil, err
	}

	eth, err := pcapgo.NewEthernetHandle(device)
	if err != nil {
		return nil, err
	}
	if err = eth.SetPromiscuous(false); err != nil {
		return nil, err
	}
        // Install a BPF filter that won't match any packets
	if err = eth.SetBPF(rawInstForZeroFilter); err != nil {
		return nil, err
	}
	if err = eth.SetCaptureLength(maxSnapshotBytes); err != nil {
		return nil, err
	}

	packetSource := gopacket.NewPacketSource(eth, layers.LinkTypeEthernet, gopacket.WithNoCopy(true))
        packetCh := packetSource.PacketsCtx(ctx)

        // Drain the channel
        for {
                select {
                        case <- ctx.Done():
                                return nil, ctx.Err()
                        case <- packetCh:
                                break
                        case <- time.After(50*time.Millisecond):
                                // timeout: channel is drained so socket is drained
                                // install the correct BPF filter
                                if err := eth.SetBPF(rawInst); err != nil {
                                        return nil, err
                                }
                                return packetCh, nil
                }
        }
}

It would be more elegant if we could call eth.ZeroCopyReadPacketData directly to drain the socket, but I think that would only be possible if the socket was non-blocking.

i will take a look and try this out, seems promising. Thanks.

libbpf still use this.

In PacketCapture, packets which don’t match the target BPF can be received after the socket is created and before the bpf filter is applied.This patch use a zero bpf filter(match no packet), then empty out any packets that arrived before the “zero-BPF” filter was applied.At this point the socket is definitely empty and it can’t fill up with junk because the zero-BPF is in place. Then we replace the zero-BPF with the real BPF we want. Signed-off-by: Hang Yan <[email protected]> Co-authored-by: Antonin Bas <[email protected]>

hangyan · 2024-11-19T09:44:00Z

I have created a MR based on your suggestions, it worked well but the tests result is confusing compared to the old ones. It's like 10/15 chance this could happen, and 1-4 packets will be discarded before we apply the 'real' filter. The rate is bit of high, don't know why we didn't hit this so often before.

antoninbas · 2024-11-19T18:02:19Z

I have created a MR based on your suggestions, it worked well but the tests result is confusing compared to the old ones. It's like 10/15 chance this could happen, and 1-4 packets will be discarded before we apply the 'real' filter. The rate is bit of high, don't know why we didn't hit this so often before.

I have seen the issue quite often, even after Quan's patch.
As long as the fix is working, that's what matters?

In PacketCapture, packets which don’t match the target BPF can be received after the socket is created and before the bpf filter is applied. This patch uses a zero bpf filter (matches no packet), then empties out any packets that arrived before the "zero-BPF" filter was applied. At this point the socket is definitely empty and it can’t fill up with junk because the zero-BPF is in place. Then we replace the zero-BPF with the real BPF we want. Signed-off-by: Hang Yan <[email protected]> Co-authored-by: Antonin Bas <[email protected]>

antoninbas · 2024-11-19T20:49:29Z

@hangyan I assume we can close this now that the PR is merged?

hangyan · 2024-11-20T01:54:53Z

@hangyan I assume we can close this now that the PR is merged?

Yes!

antoninbas · 2024-11-20T18:46:22Z

@hangyan I just noticed https://github.com/antrea-io/antrea/actions/runs/11938753429/job/33277918607. It seems to be the same type of failure, even though we have merged the fix?

Edit: I guess not exactly the same type of failure, as we are actually missing a packet now, and the capture times out...

hangyan · 2024-11-21T01:51:27Z

@hangyan I just noticed https://github.com/antrea-io/antrea/actions/runs/11938753429/job/33277918607. It seems to be the same type of failure, even though we have merged the fix?

Edit: I guess not exactly the same type of failure, as we are actually missing a packet now, and the capture times out...

i will take a look. on my first guess, it would be the problem that caused by the time window between we sent the test packet and applied the real filter. It shouldn't be a real problem in real world case.

In PacketCapture, packets which don’t match the target BPF can be received after the socket is created and before the bpf filter is applied. This patch uses a zero bpf filter (matches no packet), then empties out any packets that arrived before the "zero-BPF" filter was applied. At this point the socket is definitely empty and it can’t fill up with junk because the zero-BPF is in place. Then we replace the zero-BPF with the real BPF we want. Signed-off-by: Hang Yan <[email protected]> Co-authored-by: Antonin Bas <[email protected]>

antoninbas · 2024-12-09T22:18:54Z

@hangyan I still see frequent e2e test failures. Should we reopen this issue or open a new one?

hangyan · 2024-12-10T02:13:12Z

@hangyan I still see frequent e2e test failures. Should we reopen this issue or open a new one?

reopened this one. I will create PR to tests this and see if there is a solution. Do you think we can temporary remove this case and add it back later once we figure out the root cause?

antoninbas added kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. labels Nov 18, 2024

hangyan mentioned this issue Nov 19, 2024

Install bpf filter delay caused captured unrelated packets. (pcapgo) gopacket/gopacket#98

Open

hangyan mentioned this issue Nov 19, 2024

Follow up for PacketCapture feature. #6795

Open

hangyan closed this as completed Nov 20, 2024

hangyan mentioned this issue Nov 21, 2024

Automated cherry pick of #6821: Fix packetcapture bpf filter issue (#6815) (#6821) #6827

Open

hangyan reopened this Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Flaky test] TestPacketCapture e2e test #6815

[Flaky test] TestPacketCapture e2e test #6815

antoninbas commented Nov 18, 2024

antoninbas commented Nov 18, 2024

hangyan commented Nov 19, 2024 •

edited

Loading

antoninbas commented Nov 19, 2024

hangyan commented Nov 19, 2024 •

edited

Loading

hangyan commented Nov 19, 2024

antoninbas commented Nov 19, 2024

antoninbas commented Nov 19, 2024

hangyan commented Nov 20, 2024

antoninbas commented Nov 20, 2024 •

edited

Loading

hangyan commented Nov 21, 2024 •

edited

Loading

antoninbas commented Dec 9, 2024

hangyan commented Dec 10, 2024 •

edited

Loading

[Flaky test] TestPacketCapture e2e test #6815

[Flaky test] TestPacketCapture e2e test #6815

Comments

antoninbas commented Nov 18, 2024

antoninbas commented Nov 18, 2024

hangyan commented Nov 19, 2024 • edited Loading

antoninbas commented Nov 19, 2024

hangyan commented Nov 19, 2024 • edited Loading

hangyan commented Nov 19, 2024

antoninbas commented Nov 19, 2024

antoninbas commented Nov 19, 2024

hangyan commented Nov 20, 2024

antoninbas commented Nov 20, 2024 • edited Loading

hangyan commented Nov 21, 2024 • edited Loading

antoninbas commented Dec 9, 2024

hangyan commented Dec 10, 2024 • edited Loading

hangyan commented Nov 19, 2024 •

edited

Loading

hangyan commented Nov 19, 2024 •

edited

Loading

antoninbas commented Nov 20, 2024 •

edited

Loading

hangyan commented Nov 21, 2024 •

edited

Loading

hangyan commented Dec 10, 2024 •

edited

Loading