Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaky test] TestPacketCapture e2e test #6815

Open
antoninbas opened this issue Nov 18, 2024 · 12 comments
Open

[Flaky test] TestPacketCapture e2e test #6815

antoninbas opened this issue Nov 18, 2024 · 12 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test.

Comments

@antoninbas
Copy link
Contributor

Describe the bug

The TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/ipv4-icmp-timeout e2e test has failed in CI:

=== RUN   TestPacketCapture/testPacketCaptureBasic
    connectivity_test.go:76: Waiting for Pods to be ready and retrieving IPs
    connectivity_test.go:90: Retrieved all Pod IPs: map[client:IPv4(10.244.0.54),IPstrings(10.244.0.54) tcp-server:IPv4(10.244.1.120),IPstrings(10.244.1.120) udp-server:IPv4(10.244.0.55),IPstrings(10.244.0.55)]
=== RUN   TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic
=== RUN   TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/ipv4-icmp-timeout
=== PAUSE TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/ipv4-icmp-timeout
=== RUN   TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/non-existing-pod
=== PAUSE TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/non-existing-pod
=== RUN   TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/ipv4-tcp
=== PAUSE TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/ipv4-tcp
=== RUN   TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/ipv4-udp
=== PAUSE TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/ipv4-udp
=== RUN   TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/ipv4-icmp
=== PAUSE TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/ipv4-icmp
=== CONT  TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/ipv4-icmp-timeout
=== CONT  TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/ipv4-tcp
=== CONT  TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/ipv4-udp
=== CONT  TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/ipv4-icmp
=== CONT  TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/non-existing-pod
=== NAME  TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/ipv4-icmp-timeout
    packetcapture_test.go:587: CR status not match, actual: {NumberCaptured:12 FilePath:sftp://172.18.0.3:30010/upload/ipv4-icmp-timeout.pcapng Conditions:[{Type:PacketCaptureStarted Status:True LastTransitionTime:2024-11-16 04:27:44 +0000 UTC Reason:Started Message:} {Type:PacketCaptureComplete Status:True LastTransitionTime:2024-11-16 04:27:59 +0000 UTC Reason:Timeout Message:context deadline exceeded} {Type:PacketCaptureFileUploaded Status:True LastTransitionTime:2024-11-16 04:27:59 +0000 UTC Reason:Succeed Message:}]}, expected: {NumberCaptured:10 FilePath:sftp://172.18.0.3:30010/upload/ipv4-icmp-timeout.pcapng Conditions:[{Type:PacketCaptureStarted Status:True LastTransitionTime:2024-11-16 04:27:44.385427248 +0000 UTC m=+2588.139709613 Reason:Started Message:} {Type:PacketCaptureComplete Status:True LastTransitionTime:2024-11-16 04:27:44.385427348 +0000 UTC m=+2588.139709703 Reason:Timeout Message:context deadline exceeded} {Type:PacketCaptureFileUploaded Status:True LastTransitionTime:2024-11-16 04:27:44.385427438 +0000 UTC m=+2588.139709793 Reason:Succeed Message:}]}
=== NAME  TestPacketCapture
    fixtures.go:353: Exporting test logs to '/home/runner/work/antrea/antrea/log/TestPacketCapture/beforeTeardown.Nov16-04-28-19'
    fixtures.go:524: Deleting 'testpacketcapture-b0nmyag3' K8s Namespace
I1116 04:28:22.104562   26823 framework.go:858] Deleting Namespace testpacketcapture-b0nmyag3 took 3.360803ms
--- FAIL: TestPacketCapture (50.42s)
    --- FAIL: TestPacketCapture/testPacketCaptureBasic (38.64s)
        --- FAIL: TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic (0.00s)
            --- PASS: TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/ipv4-tcp (1.60s)
            --- PASS: TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/non-existing-pod (1.01s)
            --- PASS: TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/ipv4-icmp (10.27s)
            --- FAIL: TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/ipv4-icmp-timeout (15.26s)
            --- PASS: TestPacketCapture/testPacketCaptureBasic/testPacketCaptureBasic/ipv4-udp (21.60s)
@antoninbas antoninbas added kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. labels Nov 18, 2024
@antoninbas
Copy link
Contributor Author

cc @hangyan

@hangyan
Copy link
Member

hangyan commented Nov 19, 2024

cc @hangyan

There is a small time window between we start the capture and we apply the filter, so the first few packets maybe unrelated. this is not easy to address in the current architecture, Quan's last follow up PR has some improvement on this and it seems gone by that time. I didn't reproduce since then either.

A possible solution is to add an extra layer of check after we get the packet, but that will make the code more complicated.

Do you have any thoughts on this? Or we can temporary bring this testcase down?

meanwhile i will post an issue to the gopacket repo.

@antoninbas
Copy link
Contributor Author

@hangyan Thanks for the info. Is it possible to implement this workaround on our side: https://natanyellin.com/posts/ebpf-filtering-done-right/. Apparently, this is what libpcap does (or did at the time the post was written?)
We could:

  1. attach a "zero" BPF filter to the socket (doesn't match any packet)
  2. drain the socket
  3. swap to the correct BPF filter
  4. after that, all packets received are guaranteed to match the correct BPF filter

However, I am not 100% sure we have a good way to do the second step (drain the socket). When I look at https://github.com/gopacket/gopacket/blob/v1.3.1/pcapgo/capture.go, it seems that the socket is blocking, which is not ideal. That means that we may have to use the packet source as follows:

func (p *pcapCapture) Capture(ctx context.Context, device string, srcIP, dstIP net.IP, packet *crdv1alpha1.Packet) (chan gopacket.Packet, error) {
	// Compile the BPF filter in advance to reduce the time window between starting the capture and applying the filter.
	inst := compilePacketFilter(packet, srcIP, dstIP)
	klog.V(5).InfoS("Generated bpf instructions for PacketCapture", "device", device, "srcIP", srcIP, "dstIP", dstIP, "packetSpec", packet, "bpf", inst)
	rawInst, err := bpf.Assemble(inst)
	if err != nil {
		return nil, err
	}

	eth, err := pcapgo.NewEthernetHandle(device)
	if err != nil {
		return nil, err
	}
	if err = eth.SetPromiscuous(false); err != nil {
		return nil, err
	}
        // Install a BPF filter that won't match any packets
	if err = eth.SetBPF(rawInstForZeroFilter); err != nil {
		return nil, err
	}
	if err = eth.SetCaptureLength(maxSnapshotBytes); err != nil {
		return nil, err
	}

	packetSource := gopacket.NewPacketSource(eth, layers.LinkTypeEthernet, gopacket.WithNoCopy(true))
        packetCh := packetSource.PacketsCtx(ctx)

        // Drain the channel
        for {
                select {
                        case <- ctx.Done():
                                return nil, ctx.Err()
                        case <- packetCh:
                                break
                        case <- time.After(50*time.Millisecond):
                                // timeout: channel is drained so socket is drained
                                // install the correct BPF filter
                                if err := eth.SetBPF(rawInst); err != nil {
                                        return nil, err
                                }
                                return packetCh, nil
                }
        }
}

It would be more elegant if we could call eth.ZeroCopyReadPacketData directly to drain the socket, but I think that would only be possible if the socket was non-blocking.

@hangyan
Copy link
Member

hangyan commented Nov 19, 2024

@hangyan Thanks for the info. Is it possible to implement this workaround on our side: https://natanyellin.com/posts/ebpf-filtering-done-right/. Apparently, this is what libpcap does (or did at the time the post was written?) We could:

  1. attach a "zero" BPF filter to the socket (doesn't match any packet)
  2. drain the socket
  3. swap to the correct BPF filter
  4. after that, all packets received are guaranteed to match the correct BPF filter

However, I am not 100% sure we have a good way to do the second step (drain the socket). When I look at https://github.com/gopacket/gopacket/blob/v1.3.1/pcapgo/capture.go, it seems that the socket is blocking, which is not ideal. That means that we may have to use the packet source as follows:

func (p *pcapCapture) Capture(ctx context.Context, device string, srcIP, dstIP net.IP, packet *crdv1alpha1.Packet) (chan gopacket.Packet, error) {
	// Compile the BPF filter in advance to reduce the time window between starting the capture and applying the filter.
	inst := compilePacketFilter(packet, srcIP, dstIP)
	klog.V(5).InfoS("Generated bpf instructions for PacketCapture", "device", device, "srcIP", srcIP, "dstIP", dstIP, "packetSpec", packet, "bpf", inst)
	rawInst, err := bpf.Assemble(inst)
	if err != nil {
		return nil, err
	}

	eth, err := pcapgo.NewEthernetHandle(device)
	if err != nil {
		return nil, err
	}
	if err = eth.SetPromiscuous(false); err != nil {
		return nil, err
	}
        // Install a BPF filter that won't match any packets
	if err = eth.SetBPF(rawInstForZeroFilter); err != nil {
		return nil, err
	}
	if err = eth.SetCaptureLength(maxSnapshotBytes); err != nil {
		return nil, err
	}

	packetSource := gopacket.NewPacketSource(eth, layers.LinkTypeEthernet, gopacket.WithNoCopy(true))
        packetCh := packetSource.PacketsCtx(ctx)

        // Drain the channel
        for {
                select {
                        case <- ctx.Done():
                                return nil, ctx.Err()
                        case <- packetCh:
                                break
                        case <- time.After(50*time.Millisecond):
                                // timeout: channel is drained so socket is drained
                                // install the correct BPF filter
                                if err := eth.SetBPF(rawInst); err != nil {
                                        return nil, err
                                }
                                return packetCh, nil
                }
        }
}

It would be more elegant if we could call eth.ZeroCopyReadPacketData directly to drain the socket, but I think that would only be possible if the socket was non-blocking.

i will take a look and try this out, seems promising. Thanks.

libbpf still use this.

hangyan added a commit to hangyan/antrea that referenced this issue Nov 19, 2024
In PacketCapture, packets which don’t match the target BPF can be
received after the socket is created and before the bpf filter is
applied.This patch use a zero bpf filter(match no packet), then
empty out any packets that arrived before the “zero-BPF” filter was
applied.At this point the socket is definitely empty and it can’t
fill up with junk because the zero-BPF is in place. Then we replace
the zero-BPF with the real BPF we want.

Signed-off-by: Hang Yan <[email protected]>
Co-authored-by: Antonin Bas <[email protected]>
@hangyan
Copy link
Member

hangyan commented Nov 19, 2024

I have created a MR based on your suggestions, it worked well but the tests result is confusing compared to the old ones. It's like 10/15 chance this could happen, and 1-4 packets will be discarded before we apply the 'real' filter. The rate is bit of high, don't know why we didn't hit this so often before.

@antoninbas
Copy link
Contributor Author

I have created a MR based on your suggestions, it worked well but the tests result is confusing compared to the old ones. It's like 10/15 chance this could happen, and 1-4 packets will be discarded before we apply the 'real' filter. The rate is bit of high, don't know why we didn't hit this so often before.

I have seen the issue quite often, even after Quan's patch.
As long as the fix is working, that's what matters?

antoninbas added a commit that referenced this issue Nov 19, 2024
In PacketCapture, packets which don’t match the target BPF can be
received after the socket is created and before the bpf filter is
applied. This patch uses a zero bpf filter (matches no packet), then
empties out any packets that arrived before the "zero-BPF" filter was
applied. At this point the socket is definitely empty and it can’t
fill up with junk because the zero-BPF is in place. Then we replace
the zero-BPF with the real BPF we want.

Signed-off-by: Hang Yan <[email protected]>
Co-authored-by: Antonin Bas <[email protected]>
@antoninbas
Copy link
Contributor Author

@hangyan I assume we can close this now that the PR is merged?

@hangyan
Copy link
Member

hangyan commented Nov 20, 2024

@hangyan I assume we can close this now that the PR is merged?

Yes!

@hangyan hangyan closed this as completed Nov 20, 2024
@antoninbas
Copy link
Contributor Author

antoninbas commented Nov 20, 2024

@hangyan I just noticed https://github.com/antrea-io/antrea/actions/runs/11938753429/job/33277918607. It seems to be the same type of failure, even though we have merged the fix?

Edit: I guess not exactly the same type of failure, as we are actually missing a packet now, and the capture times out...

@hangyan
Copy link
Member

hangyan commented Nov 21, 2024

@hangyan I just noticed https://github.com/antrea-io/antrea/actions/runs/11938753429/job/33277918607. It seems to be the same type of failure, even though we have merged the fix?

Edit: I guess not exactly the same type of failure, as we are actually missing a packet now, and the capture times out...

i will take a look. on my first guess, it would be the problem that caused by the time window between we sent the test packet and applied the real filter. It shouldn't be a real problem in real world case.

hangyan added a commit to hangyan/antrea that referenced this issue Nov 21, 2024
In PacketCapture, packets which don’t match the target BPF can be
received after the socket is created and before the bpf filter is
applied. This patch uses a zero bpf filter (matches no packet), then
empties out any packets that arrived before the "zero-BPF" filter was
applied. At this point the socket is definitely empty and it can’t
fill up with junk because the zero-BPF is in place. Then we replace
the zero-BPF with the real BPF we want.

Signed-off-by: Hang Yan <[email protected]>
Co-authored-by: Antonin Bas <[email protected]>
@antoninbas
Copy link
Contributor Author

@hangyan I still see frequent e2e test failures. Should we reopen this issue or open a new one?

@hangyan hangyan reopened this Dec 10, 2024
@hangyan
Copy link
Member

hangyan commented Dec 10, 2024

@hangyan I still see frequent e2e test failures. Should we reopen this issue or open a new one?

reopened this one. I will create PR to tests this and see if there is a solution. Do you think we can temporary remove this case and add it back later once we figure out the root cause?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test.
Projects
None yet
Development

No branches or pull requests

2 participants