-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"SSH could not read data: Error waiting on socket" when using libgit2 #439
Comments
I've found a solution that worked for me, at least - there Why it worked, I'm not sure - but I happened to stumble across the difference between a working and a non-working cluster. |
I am getting this cloning from localhost git repo with openssh+gitolite. It's not transient every time error waiting on socket. Same GitRepository works perfectly from source-controller. In image-automation-controller this error:
|
I think this is a significant red herring; a few instances now with the Is it possible to get the |
For the record, this occurs with both |
Are there any suggestions on how to debug this further? Any flags to enable debug logging? Or maybe tcpdump? |
Hey everyone! {
"level": "error",
"ts": "2021-09-07T11:35:38.559Z",
"logger": "controller-runtime.manager.controller.imageupdateautomation",
"msg": "Reconciler error",
"reconciler group": "image.toolkit.fluxcd.io",
"reconciler kind": "ImageUpdateAutomation",
"name": "image-update-automation-myrepo",
"namespace": "test-ns",
"error": "unable to clone ssh://[email protected]/repo/myrepo, error: SSH could not read data: Error waiting on socket"
} Any thoughts , how to fix this ? |
I was taking a look at this, and it appears that only way to enable debug logging is to edit the deployment of the @@ -5235,7 +5235,7 @@ spec:
- args:
- --events-addr=http://notification-controller/
- --watch-all-namespaces=true
- - --log-level=info
+ - --log-level=debug
- --log-encoding=json
- --enable-leader-election
env: I'm now running this in a couple of clusters, so hopefully will get more information that can be contributed to this issue. |
I've noted a failure to clone, within only a few minutes of starting one of the pods: {"level":"info","ts":"2021-09-08T08:58:41.670Z","logger":"setup","msg":"starting manager"}
<..>
{"level":"debug","ts":"2021-09-08T09:02:02.075Z","logger":"controller-runtime.manager.controller.imageupdateautomation","msg":"ran updates to working dir","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"flux-system","namespace":"flux-system","working":"/tmp/flux-system-flux-system244015386"}
{"level":"debug","ts":"2021-09-08T09:02:02.097Z","logger":"controller-runtime.manager.events","msg":"Normal","object":{"kind":"ImageUpdateAutomation","namespace":"flux-system","name":"flux-system","uid":"c9b53014-a6e7-4e65-87f6-7d254c417baa","apiVersion":"image.toolkit.fluxcd.io/v1beta1","resourceVersion":"52286285"},"reason":"info","message":"no updates made"}
{"level":"debug","ts":"2021-09-08T09:02:02.098Z","logger":"controller-runtime.manager.controller.imageupdateautomation","msg":"no changes made in working directory; no commit","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"flux-system","namespace":"flux-system"}
{"level":"debug","ts":"2021-09-08T09:03:02.134Z","logger":"controller-runtime.manager.controller.imageupdateautomation","msg":"fetching git repository","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"flux-system","namespace":"flux-system","gitrepository":{"namespace":"flux-system","name":"flux-system"}}
{"level":"debug","ts":"2021-09-08T09:03:02.134Z","logger":"controller-runtime.manager.controller.imageupdateautomation","msg":"using git repository ref from .spec.git.checkout","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"flux-system","namespace":"flux-system","ref":{"branch":"main"}}
{"level":"debug","ts":"2021-09-08T09:03:02.134Z","logger":"controller-runtime.manager.controller.imageupdateautomation","msg":"using push branch from .spec.push.branch","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"flux-system","namespace":"flux-system","branch":"main"}
{"level":"debug","ts":"2021-09-08T09:03:02.134Z","logger":"controller-runtime.manager.controller.imageupdateautomation","msg":"attempting to clone git repository","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"flux-system","namespace":"flux-system","gitrepository":{"namespace":"flux-system","name":"flux-system"},"ref":{"branch":"main"},"working":"/tmp/flux-system-flux-system796571057"}
{"level":"debug","ts":"2021-09-08T09:03:03.695Z","logger":"controller-runtime.manager.events","msg":"Normal","object":{"kind":"ImageUpdateAutomation","namespace":"flux-system","name":"flux-system","uid":"c9b53014-a6e7-4e65-87f6-7d254c417baa","apiVersion":"image.toolkit.fluxcd.io/v1beta1","resourceVersion":"52286696"},"reason":"error","message":"unable to clone 'ssh://[email protected]/<GITHUB_GROUP>/<GITOPS_REPO>', error: SSH could not read data: Error waiting on socket"}
{"level":"error","ts":"2021-09-08T09:03:03.709Z","logger":"controller-runtime.manager.controller.imageupdateautomation","msg":"Reconciler error","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"flux-system","namespace":"flux-system","error":"unable to clone 'ssh://[email protected]/<GITHUB_GROUP>/<GITOPS_REPO>', error: SSH could not read data: Error waiting on socket","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:214"} Interestingly, there were only two time windows during which this complaint was made:
|
image-automation-controller always uses libgit2 for cloning, mainly because the go-git implementation does a shallow clone (and this makes branching difficult to impossible). Did source-controller have any problems when you switched the implementation to libgit2? I gather not -- that's what makes this truly mysterious, because it is running the same code. |
I thought there were issues with both the image automation controller and the source controller, but I've dug through our logging server and found records at the time the change was made. It looks like the problems were with the image-automation-controller, which was immediately failing to clone the repository just minutes after the change was merged. It appears to be the same issue as before, just immediate rather than after some delay. |
Which change was merged -- did I miss a vital clue? |
Sorry, that's my own terrible wording - the change was from |
@squaremo If it's any use, I can provide debug-level logs from a controller pod - I'd prefer to do so directly via Slack, though, as I'm not entirely comfortable posting unredacted logs on a public issue. |
Worth a try! I'm |
OK, thanks to lots of digging through logs from @rjhenry, I think we have established that this happens for source-controller as well, when the GitRepository object has [1] the same "SSH could not read data: Error waiting on socket" error message appears in the |
It's started working for me.. debug logging was really helpful. Ran into next issue -have to put resources in same namespace as GitRepository fluxcd/image-automation-controller#159 |
What changed so that it started working? How was the debug logging helpful? Don't leave us hanging @morganchristiansson :-P |
I'm still sporadically getting the same error.
Because of my namespace issue linked above the automations were correct but not applying anything - with log level debug I was figured this out, I shoud've been able to see this with default info log level too.. I also looked with wireshark/tcpdump but I was only seeing successful ssh connections... So I haven't really solved the problem just discovered it's working with the error... |
Very similar situation here; on a different cluster I administer we had a successful automation. Taking a look at the logs shows (neatly formatted for your viewing pleasure): {
"level": "error",
"ts": "2021-10-04T07:12:30.488Z",
"logger": "controller-runtime.manager.controller.imageupdateautomation",
"msg": "Reconciler error",
"reconciler group": "image.toolkit.fluxcd.io",
"reconciler kind": "ImageUpdateAutomation",
"name": "flux-system",
"namespace": "flux-system",
"error": "unable to clone 'ssh://[email protected]/<redacted>/<redacted>-gitops', error: SSH could not read data: Error waiting on socket"
}
{
"level": "info",
"ts": "2021-10-04T07:56:18.308Z",
"logger": "controller-runtime.manager.controller.imageupdateautomation",
"msg": "pushed commit to origin",
"reconciler group": "image.toolkit.fluxcd.io",
"reconciler kind": "ImageUpdateAutomation",
"name": "flux-system",
"namespace": "flux-system",
"revision": "afc8f467fb59361c55a3f80c94e31cefe96e43d1",
"branch": "main"
}
{
"level": "error",
"ts": "2021-10-04T07:56:26.055Z",
"logger": "controller-runtime.manager.controller.imageupdateautomation",
"msg": "Reconciler error",
"reconciler group": "image.toolkit.fluxcd.io",
"reconciler kind": "ImageUpdateAutomation",
"name": "flux-system",
"namespace": "flux-system",
"error": "unable to clone 'ssh://[email protected]/<redacted>/<redacted>-gitops', error: SSH could not read data: Error waiting on socket"
} No changes were made to the controller, |
You all may want to try out the the latest release of the source-controller ( |
We're still seeing this with:
ssh destination is github.com |
@timja we have a suspicion this is due to underlying C objects not getting freed in time; but given the undocumented state of libgit2/git2go, this is very much guess work. I created #452, which most importantly frees the It would be great if you could give this a test run, as the issue is kind of hard to replicate for us. (If you prefer testing it using the image-automation-controller, please shout, as I can port the changes to such a test image as well). |
@hiddeco If you could port the same changes over to the image-automation-controller, that'd be great - I'd be keen to test this out for you. |
@rjhenry thanks a lot! 🙇🥇 Did a quick port, and the image details can be found at fluxcd/image-automation-controller#238 |
Same error:
Running:
Can we provide anything else to help? |
After having cloned the repository myself, I now have the suspicion it's due to size. Even on my 1Gbps connection, it seems to be relatively slow (but I think the default timeout is
|
Was also wondering if that's an issue, on flux v1 we had to increase some timeouts, will try that out, thanks! |
Hi, no errors for almost 2 days |
Over a time image-automation-controller:0.21.3, can't connect to GitLab with error: |
@uderik did it stop working after the error? IAC should be able to self-heal from such issue and just carry on business as usual. What's the interval you currently have set for the repository that had the issue? |
I had the same issue and tried the new version with the
|
@pjbgf on two clusters (us-east-1 and ap-southeast-1) same issue, many errors with transport close (potentially due to a timeout) and no updates
maybe it has something to do with the location of the clusters , gitlab location eu-central-1 gitrepo interval 1min, timeout 2min see attach for pod crash log update: |
Same for me as observed by @uderik |
problem still present |
@uderik would you mind testing the versions below and confirm whether that fixes your issue?
An user with a similar issue had this resolving. This is related to the in-flight PR: #713 Test images based on version 830771f. |
@pjbgf CrashLoopBackOff log in attach,It works but has already crashed 4 times (in 13 hours) with this errors, on all 3 envs, where i updated images for source/image-automation deployments |
@uderik thanks again for sharing. I managed to reproduce and on my environment the changes seem to have fixed the problem. The new version have some changes we recently merged into main, which would recover git2go/libgit2 panics. Therefore if this would happen again you would see errors on the logs, but no crashes/restarts. The PR is now updated and a new image created for source-controller: Please let me know how you get on. xref: #713 (comment) UPDATE: changed the image with an official source-controller release candidate. |
Using
on GKE results in the same error. Surprisingly it was working for some days without any change. Source is Gitlab. |
@Nosmoht occurrences of this issue tend be intermittent and resolve itself. We have a few changes that should be released soon which should decrease their likelihood. On they are merged/released I will share on this thread. |
@pjbgf any idea when a new release will be published? |
@Nosmoht We are aiming to have a release done between the end of this week and beginning of next. |
@pjbgf i have the error within the image-automation-controller. Does it help if i use another source-controller image? |
@pjbgf image-automation-controller (image-automation-controller:v0.22.2-cacheless and Enabling experimental managed transport")
|
quay.io/paulinhu/image-automation-controller:v0.22.2-cacheless seems to have resolved the issue for me. |
Just tried with unable to clone '#####'
SSH could not read data: Error waiting on socket |
Hi @Nosmoht, we are in the process of getting some improvements merged, which should fix this. Could you try this image |
Hi @aryan9600, tried, SSH error replaced by following: {"level":"error","ts":"2022-06-02T08:22:41.221Z","logger":"controller.imageupdateautomation","msg":"Reconciler error","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"app","namespace":"flux-system","error":"unable to fetch-connect to remote 'ssh://[email protected]/app/app.git': ssh: handshake failed: hostkey could not be verified"} Seems like i need Gitlab's SSH key inside the container. Any easy way to do that? I bootstrap Flux with Terraform using Flux provider. |
I'm not familiar with the terraform provider, but the |
Hi @aryan9600, as it is the |
hello Flux Maintainers!, Thanks a lot for your great work! I wanted to let you know that this issue can be closed and it is solved with the latest Flux 0.31.0 release. I have already tested the latest release and seems that I no longer experience the issue in all of my environments. |
Look's like it's really fixed now, many thanks |
https://cloud-native.slack.com/archives/CLAJ40HV3/p1625133279255100
This is reported in the logs:
.. though apparently not all the time, as
Source controller reportedly manages to clone the repo (all the time?) when set to use libgit2, and changing to an RSA key didn't stop the error messages. EDIT: no, not all the time -- source-controller also fails intermittently, indicating that the problem is in the code in source-controller/pkg that source-controller and image-automation-controller both use.
The text was updated successfully, but these errors were encountered: