-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Startup hang caused by the deployment of New Relic #2809
Comments
Places I can find that match to 120 seconds:- newrelic-dotnet-agent/src/Agent/NewRelic/Agent/Core/Commands/ThreadProfilerCommandArgs.cs Line 16 in 709ddb6
newrelic-dotnet-agent/src/Agent/NewRelic/Agent/Core/DataTransport/ConnectionManager.cs Line 33 in 709ddb6
newrelic-dotnet-agent/src/Agent/NewRelic/Agent/Core/DataTransport/DataStreamingService.cs Line 160 in 709ddb6
newrelic-dotnet-agent/src/Agent/NewRelic/Agent/Core/AgentHealth/AgentHealthReporter.cs Line 25 in 709ddb6
I don't know the architecture of the agent (yet), but I guess the most likely candidate is which ever code path is invoked during start up. My suspicion initially falls on newrelic-dotnet-agent/src/Agent/NewRelic/Agent/Core/DataTransport/ConnectionManager.cs Line 79 in 709ddb6
As we keep the default settings and I think we would fall into the sync path (and therefore blocking). |
Hmmm, https://docs.newrelic.com/docs/apm/agents/net-agent/configuration/net-agent-configuration/#service-syncStartup says it defaults to false. So does that mean it is https://docs.newrelic.com/docs/apm/agents/net-agent/configuration/net-agent-configuration/#service-requestTimeout which implies that it is 120 seconds as we keep |
Because you are not enabling send data on exit, or sync startup, the agent handles its connection logic on a separate thread pool thread via a timer that should be triggered without a delay. newrelic-dotnet-agent/src/Agent/NewRelic/Agent/Core/DataTransport/ConnectionManager.cs Line 82 in 709ddb6
I don't think any of the other duration settings that you found in the code are related to this problem, because those things with the exception of the connect step (which is happening on a different thread), all require the agent to complete the connect process first. In order to understand what is going on we will most likely need a memory dump from a time when the application fails to start, and we would also likely need Most of the startup hang/deadlock issues that I've seen are related to applications using a Configuration Manager that needs to reach out to an external system. Usually these calls to the external system are something that the agent also instruments. We have a default list of known instrumented methods that have caused this problem. newrelic-dotnet-agent/src/Agent/NewRelic/Agent/Core/AgentShim.cs Lines 43 to 49 in 709ddb6
We also allow extending this list using an environment variable.
The difficult part will be determining what to add to that list. In the linked aspnetcore issue you provided a sample startup.cs.txt file, and in that file it looks like you might be using a postgresql database with the configuration manager. If that's the case you may need to add the classes and methods defined in the sql instrumentation xml associated to the Postgres comments. Here is an example of one of those sections. newrelic-dotnet-agent/src/Agent/NewRelic/Agent/Extensions/Providers/Wrapper/Sql/Instrumentation.xml Lines 140 to 149 in 709ddb6
|
Hi @nrcventura , Appreciate the detail reply.
I would love to get a memory dump. Unfortunately, this is pretty hard because we only know it's hung after the deploy has failed and the process has been automatically killed off by IIS because it is deemed "unhealthy". I can change pretty much anything, but it's not something I can do quickly as it would effect the entire CI/CD pipeline. I did however push out the change to the logging late on Friday, and this morning it has happened twice already. <log level="all" maxLogFileSizeMB="100"/> Last log:-
Timeout log:-
First log:-
Thanks, that gives me a thread to pull on. We do store out config keys etc in PG and all apps reach out to PG at startup time to get various details, settings etc. I'll try adding each method one by one to I've attached the relevant logs from the most recent failure. newrelic_agent_Allium.WebApp_099_31444.zip Cheers, |
I've pushed out [System.Environment]::SetEnvironmentVariable('NEW_RELIC_DELAY_AGENT_INIT_METHOD_LIST', 'Npgsql.NpgsqlCommand.ExecuteReader', [System.EnvironmentVariableTarget]::Machine) To all of our DEMO hosts. I guess if we don't get another error for a week we can be somewhat-confident it might be that. Hard to prove. |
Thanks for the update. Yes, these types of problems are very difficult to reproduce. You might be able to reproduce the problem more reliably if you enabled syncStartup which is one of the settings that you previously found. I suspect that you may also need the NpgSql.Connection.Open method. |
Just happened again, updated environment variable to:- [System.Environment]::SetEnvironmentVariable('NEW_RELIC_DELAY_AGENT_INIT_METHOD_LIST', 'Npgsql.NpgsqlCommand.ExecuteReader;Npgsql.NpgsqlConnection.Open', [System.EnvironmentVariableTarget]::Machine) |
Happened twice this morning, updated environment variable to:- $methods = @(
"Npgsql.NpgsqlConnection.Open"
"Npgsql.NpgsqlCommand.ExecuteReader"
"Npgsql.NpgsqlDataReader.Read"
"Npgsql.NpgsqlConnection.OpenAsync"
"Npgsql.NpgsqlCommand.ExecuteReaderAsync"
"Npgsql.NpgsqlDataReader.ReadAsync"
);
$concat = $methods -join ';';
[System.Environment]::SetEnvironmentVariable("NEW_RELIC_DELAY_AGENT_INIT_METHOD_LIST", $concat, [System.EnvironmentVariableTarget]::Machine); I'm not convinced it (the hang on start-up) is fetching the configuration from PG, so I'm including all methods (that can be used) on that path. This should help us either rule it out or in. Back to the waiting game! Cheers, |
Tiny update from me;
Cheers, |
If you are not seeing the problem when syncStartup is enabled, then I'm guessing that you are not running into the deadlock problem that we've seen in the past. I opened an issue to add a feature to just disable the usage of the ConfigurationManager altogether so that we can more easily rule that problem in or out before dedicating more time to identify what to add to that delayed initialization list. I have seen other weird cases in certain environments where enabling agent logs higher than debug or info level have slowed down application startup enough (in those environments) to trigger the default aspnetore app default startup timeout. I don't think that this is likely to be the problem that you are experiencing, but https://docs.newrelic.com/docs/apm/agents/net-agent/configuration/net-agent-configuration/#log-enabled explains how to disable our logging system (this does not affect capturing and forwarding agent log data). Another situation where we have seen deadlocks occur is within the agent's usage of the .net Event Pipe which we use to collect garbage collection metrics for the application. The deadlock that we've seen occurs (only occasionally) when we try to unsubscribe our EventListener from the Event Pipe EventSource. The collection of this data can be disabled, and this has resolved some problems similar to this (usually noticed when the application is trying to end). To accomplish that you can add |
So, I don't want to jinx anything, but since I added:- $methods = @(
"Npgsql.NpgsqlConnection.Open"
"Npgsql.NpgsqlCommand.ExecuteReader"
"Npgsql.NpgsqlDataReader.Read"
"Npgsql.NpgsqlConnection.OpenAsync"
"Npgsql.NpgsqlCommand.ExecuteReaderAsync"
"Npgsql.NpgsqlDataReader.ReadAsync"
);
$concat = $methods -join ';';
[System.Environment]::SetEnvironmentVariable("NEW_RELIC_DELAY_AGENT_INIT_METHOD_LIST", $concat, [System.EnvironmentVariableTarget]::Machine); We've not had a start-up hang. I'm not saying that I think this is solved, but as time goes on my confidence does increase 🤞🏽 Cheers, |
Small update, not seen the exact same start-up hang. So I'm getting closer to saying it looks like this particular issue (getting external configuration from pg) is resolved. Right now NR is only deployed to our DEMO environment - the real test will be getting it to our LIVE environment (the hang was more frequent when we first attempted NR deployment last year). If we get to the end of the week without another start-up hang, I'll spend next week getting onto a LIVE environment and see what happens. Tangentially; we've had two instances of the same application hanging after the app had booted and warming up other external services (in this case BigQuery/GetTable). I don't think it is NR, but I don't really believe in coincidences so I thought I'd mention it here, just in case I need to keep the issue open a bit longer. Cheers, |
Thank you for the updates. I'm not aware of any problems with someone using BigQuery, but we also do not currently support out of the box instrumentation for BigQuery, so I do not know what could be the problem there. |
Argh! The start-up hang has happened again. From event viewer:-
I have a
Now looking into the other avenues of investigation. Cheers, |
The profiler log is only going to have data for the profiler, but we also will want to see the managed agent logs. Managed agent logs can contain information that is potentially sensitive, especially at higher log levels. The best way for us to get these files would to be work with our New Relic support team since they have the ability to receive these types of files without having them shared publicly on github. |
Hi @jaffinito, Can you clarify what you mean by managed agent logs? I believe I uploaded those over here. I do review the logs before I upload them. I can open a ticket through out NR account so that we can share logs if something sensitive is in them. Cheers, |
Happened again tonight at
There is only a profiler log, there aren't any agent logs that pertain to 7804. Edit* I've not had the opportunity to pursue the other avenues of investigation. Cheers, |
This is now done. I've modified <?xml version="1.0"?>
<!-- Copyright (c) 2008-2020 New Relic, Inc. All rights reserved. -->
<!-- For more information see: https://docs.newrelic.com/docs/agents/net-agent/configuration/net-agent-configuration/ -->
<configuration xmlns="urn:newrelic-config" agentEnabled="true">
<appSettings>
<add key="NewRelic.EventListenerSamplersEnabled" value="false"/>
</appSettings>
<service licenseKey="REDACTED"/>
<application/>
<log level="all" maxLogFileSizeMB="100"/>
<allowAllHeaders enabled="true"/>
<attributes enabled="true">
<exclude>request.headers.cookie</exclude>
<exclude>request.headers.authorization</exclude>
<exclude>request.headers.proxy-authorization</exclude>
<exclude>request.headers.x-*</exclude>
<include>request.headers.*</include>
</attributes>
<transactionTracer enabled="false"/>
<distributedTracing enabled="false"/>
<errorCollector enabled="false"/>
<browserMonitoring autoInstrument="false"/>
<threadProfiling>
<ignoreMethod>System.Threading.WaitHandle:InternalWaitOne</ignoreMethod>
<ignoreMethod>System.Threading.WaitHandle:WaitAny</ignoreMethod>
</threadProfiling>
<applicationLogging enabled="false"/>
<utilization detectAws="false" detectAzure="false" detectGcp="false" detectPcf="false" detectDocker="false" detectKubernetes="false"/>
<slowSql enabled="false"/>
<distributedTracing enabled="false"/>
<codeLevelMetrics enabled="false"/>
</configuration> I confirmed that changing the config did something by logs:-
On the topic of environment variables does Found via grokking the sdk src:- newrelic-dotnet-agent/src/Agent/NewRelic/Agent/Core/Configuration/DefaultConfiguration.cs Line 1666 in a77919e
Cheers, |
The |
Happened again - I should explain that I've hijacked the CI/CD pipeline on weekday evenings ( What's interesting is that a Server Recycle is basically iisreset + hits the urls to warmup the apps. The binaries don't change. Before I added all the PG methods we used to get the same startup hang on actual deploys too (e.g. the binaries have changed and we don't use iisreset, we just kill the respective pid of that w3wp.exe). I'm beginning to think that we are playing whack-a-mole on a combination of our platform code + NR causes a start-up hang some of the time. I think our current state of affairs is that when we do an iisreset and then warmup the apps we run the risk of the start-up hang. To prove that I'll take out the iisreset in our Server Recycle and see what happens. Cheers, |
I finally replicated it on LOCAL. THANK GOD. We have a small console app that is invoked by our OMD/nagios setup every 1-5 seconds (I think), it looks like this (I've append .txt to make it upload-able). I leave that running in the background using powershell:- while($true) {Start-Sleep -Seconds 1; .\CWDotNetDiagClient.exe;} And then I simulate what a deploy/server recycle does with this (this is copied directly from the scripts, a lot of it is many years old):- $WebApplicationCreatorConfigurationDirectory = "D:\code\CodeweaversWorld\src\WebApplicationCreator\LocalConfiguration";
$WebApplicationCreatorExecutable = "D:\code\CodeweaversWorld\src\WebApplicationCreator\app\IISApplicationExecutor\bin\IISApplicationExecutor.exe";
$Timeout = 480;
$AbsoluteUrl = "http://datamining.localiis/services/application.svc/warmup";
while ($true) {
Get-WmiObject -Namespace 'root\WebAdministration' -class 'WorkerProcess' -Filter "NOT AppPoolName LIKE '%servicemonitor%'" `
| Select-Object -Property AppPoolName, ProcessId `
| ForEach-Object {
Write-Host "Killing '$($_.AppPoolName)' with pid '$($_.ProcessId)'";
Stop-Process -Id $_.ProcessId -Force -ErrorAction SilentlyContinue;
};
& "C:\Windows\System32\iisreset.exe" /stop ;
& "$WebApplicationCreatorExecutable" "ZZZ_PLACEHOLDER" "RESET" "$WebApplicationCreatorConfigurationDirectory" "NOPROMPT";
& "C:\Windows\System32\iisreset.exe" /start ;
Start-Sleep -Seconds 2;
& "$WebApplicationCreatorExecutable" "ZZZ_PLACEHOLDER" "ALL" "$WebApplicationCreatorConfigurationDirectory" "NOPROMPT";
Invoke-WebRequest -Uri $AbsoluteUrl -TimeoutSec $Timeout -UseBasicParsing -DisableKeepAlive -MaximumRedirection 0;
} The clue re-reading the entire issue thread was:-
Many thanks @nrcventura for that nugget, it took a while to percolate. I really should get to bed, maybe tonight I won't dream about this issue. Cheerio! |
Manage to replicate it a 2nd & 3d time (took 185 tries) on LOCAL, this time I was quick enough to grab a memory dump. I'll share them on Monday via the NR support ticket route. Meanwhile I had a quick poke around using dotnet dump:-
A quick Google around leads me to So it looks like when we connect to the event pipe just as the app is initialized it can cause a hang. What is weird is, is why the deployment of NR causes the start-up hang for us. That console app we've been using has been in place since 2023-03 and we've never seen the hang without NR agent(s) being deployed. Cheers, |
A lot of the code in the DiagnosticSource library tries to do as little work as possible if there are no subscribers to a particular source (DiagnosticSource, EventPipe, ActivitySource, Meter, etc.). The New Relic agent enables an EventLIstener by default to subscribe to the EventPipe datasource in order to capture Garbage Collection metrics (this was the primary way to capture that type of data for .net core < 3.0, which is when we started collecting that data). When .net standard 2.1 came out, and .net core 3.1 was released, more apis became available for collecting garbage collector data, but the agent still needed to support older applications. Similar .net 6 has a better way to collect this data, but those apis are not readily available unless the agent can stop targeting .net standard 2.0 and switch over to targeting .net 6 or .net 8 directly (which would be a breaking change requiring a major release). In the meantime, I've seen the eventpipe datasource (when subscribed to from within an application instead of out-of-process) have its stability increase and decrease with different .net versions. At this point, we do have plans to migrate away from the eventpipe for collecting garbage collection data, but it will result in a breaking change because the data we will be able to capture will be different. We also would like to align the data that we capture with OpenTelemetry which is what .net is increasing its alignment with in general. Hopefully, the problem you are experiencing goes away when the event listener samplers are disabled. |
Sorry for the delaying in sharing the memory dump - we are a part of a much larger organisation and it turns out that our child accounts (4721360, 4721361) in NR doesn't have "case message support". Trying to get that sorted out. This may take a while 😬
Thanks, I'll try
Ah, we were wondering why dotnet counters weren't collected, so this is on the roadmap? As it may block the NR rollout. We need dotnet counters support. Cheers, |
dotnet counters was a dotnet tool that was released around the same time that the In order to understand your question about dotnet counters support we need a little more information.
|
Yeah, that was a painful deprecation for us. We did try and complain about it over here.
Currently we have around 30 custom
We would expect the standard set emitted by the Core CLR at the very least (and these).
Very much so, that is exactly what we want/need. The backstory here is that we tried NR a while back, it didn't go well (because of this issue). So we had to fall-back to otel/grafana, this has proved troublesome for unrelated reasons. Our parent company is huge, and they have Enterprise level contract with NR, so we decided to give NR another go; and that's when I realised it was the NR deployment causing the start-up hang. We very much want to align with the parent company but missing the counters from the Core CLR would very much delay the rollout until that was supported/collected by the NR agent. Hopefully that makes sense, I'm battling a cold 🤧 Cheers, |
Thank you for this feedback. This confirms my assumptions about the type of support that the New Relic agent should have. |
Quick (bad) update; I tried with I know there were other suggestions in this thread; I'll go through them in later this week/next week. Cheers, |
Tried with The quest continues :) Edit* I did check but we still don't have access to case message support; I'll update as soon as I do. Cheers, |
I've opened a case under account Cheers, |
Thanks @indy-singh - We've received the memory dumps and we're analyzing them now. Initially I'm not seeing any smoking gun, but we'll keep digging and let you know what we find. |
@indy-singh We've looked at the memory dump and it doesn't look like the managed .NET agent ( Our next release (v10.33.0) will include an optional replacement for the current EventListener-based garbage collection sampler. If you're able, please re-test with that version (should be released later this week) and let us know if you're still seeing the deadlock. If it does happen, that would be a more definitive pointer to something in our profiler. One additional test you could try now would be to remove all of the instrumentation |
To be more specific about the profiler settings I'm wondering about, we have code at newrelic-dotnet-agent/src/Agent/NewRelic/Profiler/Profiler/CorProfilerCallbackImpl.h Lines 874 to 875 in 2460527
newrelic-dotnet-agent/src/Agent/NewRelic/Profiler/Profiler/CorProfilerCallbackImpl.h Line 385 in 2460527
In that memory dump, it looked like something else other than the New Relic agent was trying to interact with the event pipe, and causing the deadlock, but it wasn't clear to us which code was trying to interact with the event pipe. As @tippmar-nr said, it did not look like it was the New Relic agent trying to interact with the event pipe, because the managed code part of the agent was not yet listed in the loaded modules list. So the only part of the agent that was running was the native code (which does not interact with the event pipe). We may need someone from Microsoft to assist with looking at that memory dump to see if they can figure out why there is a deadlock with that event pipe usage. Since the memory dump, is from your application, it would be better for you to follow up with Microsoft on analyzing that memory, if you choose to do so. We're not trying to pass this off to them, we just can't open a case/issue with them on your behalf. We have seen some very weird behavior that was made reproducible by the runtime settings set by the profiler, or the increase in memory allocation leading to data being cleaned up sooner, that Microsoft was able to diagnose by looking at the memory dumps. |
Hi both, Thanks for taking the time to look at the dumps.
Will do both of these 👍🏽
Yes that other code will have been the CWDotNetDiagClient that we have on our DEMO/LIVE servers to collect the aforementioned dotnet counters. I'm of the opinion the deadlock is triggered when things line up just right between NR and CWDotNetDiagClient. It's entirely possible that CWDotNetDiagClient is doing something naughty and not being a nice neighbour. I've been chipping away at the reproduction harness to make the deadlock more prevalent but it is proving a tricky beast to tame. I'll chase up about Microsoft support; someone in the org chart will have a support contract I imagine. Updates from me might be a bit more sporadic - currently recovering from a short hospital stay. Cheers, |
Sorry for the huge delay. Just to let you know I'm back on this. Updated to
Config now looks like this:- <?xml version="1.0"?>
<!-- Copyright (c) 2008-2020 New Relic, Inc. All rights reserved. -->
<!-- For more information see: https://docs.newrelic.com/docs/agents/net-agent/configuration/net-agent-configuration/ -->
<configuration xmlns="urn:newrelic-config" agentEnabled="true">
<appSettings>
<add key="NewRelic.EventListenerSamplersEnabled" value="false"/>
</appSettings>
<service licenseKey="REDACTED"/>
<application/>
<log level="off" enabled="false"/>
<allowAllHeaders enabled="true"/>
<attributes enabled="true">
<exclude>request.headers.cookie</exclude>
<exclude>request.headers.authorization</exclude>
<exclude>request.headers.proxy-authorization</exclude>
<exclude>request.headers.x-*</exclude>
<include>request.headers.*</include>
</attributes>
<transactionTracer enabled="false"/>
<distributedTracing enabled="false"/>
<errorCollector enabled="false"/>
<browserMonitoring autoInstrument="false"/>
<threadProfiling>
<ignoreMethod>System.Threading.WaitHandle:InternalWaitOne</ignoreMethod>
<ignoreMethod>System.Threading.WaitHandle:WaitAny</ignoreMethod>
</threadProfiling>
<applicationLogging enabled="false"/>
<utilization detectAws="false" detectAzure="false" detectGcp="false" detectPcf="false" detectDocker="false" detectKubernetes="false"/>
<slowSql enabled="false"/>
<distributedTracing enabled="false"/>
<codeLevelMetrics enabled="false"/>
</configuration> Additional envs vars:- $methods = @(
"Npgsql.NpgsqlConnection.Open"
"Npgsql.NpgsqlCommand.ExecuteReader"
"Npgsql.NpgsqlDataReader.Read"
"Npgsql.NpgsqlConnection.OpenAsync"
"Npgsql.NpgsqlCommand.ExecuteReaderAsync"
"Npgsql.NpgsqlDataReader.ReadAsync"
);
$concat = $methods -join ';';
[System.Environment]::SetEnvironmentVariable("NEW_RELIC_DISABLE_SAMPLERS", "true", [System.EnvironmentVariableTarget]::Machine); Regarding Microsoft support. I've still not heard back from the chain above me. I'll chase. Cheers, |
Edit* Invalid results, ignore, see reply below. |
Oh wait. I'm an idiot. I didn't read the PR. I didn't turn on Cheers, |
Bad news, still happens. This time it took 243 iterations.
This is with:- $methods = @(
"Npgsql.NpgsqlConnection.Open"
"Npgsql.NpgsqlCommand.ExecuteReader"
"Npgsql.NpgsqlDataReader.Read"
"Npgsql.NpgsqlConnection.OpenAsync"
"Npgsql.NpgsqlCommand.ExecuteReaderAsync"
"Npgsql.NpgsqlDataReader.ReadAsync"
);
$concat = $methods -join ';';
[System.Environment]::SetEnvironmentVariable("NEW_RELIC_DISABLE_SAMPLERS", "true", [System.EnvironmentVariableTarget]::Machine);
[System.Environment]::SetEnvironmentVariable("NEW_RELIC_GC_SAMPLER_V2_ENABLED", "1", [System.EnvironmentVariableTarget]::Machine); and:- <?xml version="1.0"?>
<!-- Copyright (c) 2008-2020 New Relic, Inc. All rights reserved. -->
<!-- For more information see: https://docs.newrelic.com/docs/agents/net-agent/configuration/net-agent-configuration/ -->
<configuration xmlns="urn:newrelic-config" agentEnabled="true">
<appSettings>
<add key="NewRelic.EventListenerSamplersEnabled" value="false"/>
<add key="GCSamplerV2Enabled" value="true"/>
</appSettings>
<service licenseKey="REDACTED"/>
<application/>
<log level="off" enabled="false"/>
<allowAllHeaders enabled="true"/>
<attributes enabled="true">
<exclude>request.headers.cookie</exclude>
<exclude>request.headers.authorization</exclude>
<exclude>request.headers.proxy-authorization</exclude>
<exclude>request.headers.x-*</exclude>
<include>request.headers.*</include>
</attributes>
<transactionTracer enabled="false"/>
<distributedTracing enabled="false"/>
<errorCollector enabled="false"/>
<browserMonitoring autoInstrument="false"/>
<threadProfiling>
<ignoreMethod>System.Threading.WaitHandle:InternalWaitOne</ignoreMethod>
<ignoreMethod>System.Threading.WaitHandle:WaitAny</ignoreMethod>
</threadProfiling>
<applicationLogging enabled="false"/>
<utilization detectAws="false" detectAzure="false" detectGcp="false" detectPcf="false" detectDocker="false" detectKubernetes="false"/>
<slowSql enabled="false"/>
<distributedTracing enabled="false"/>
<codeLevelMetrics enabled="false"/>
</configuration> Cheers, |
I bombed out Just tried this and it failed in 23 tries (I'm not sure the count is related at all, just mentioning it for the record).
Cheers, |
Hi @indy-singh -- thanks so much for your detailed analysis and thorough testing. If you're still seeing the hang without the Extensions folder, the .NET Agent is effectively doing nothing at that point, so that leaves us with only a few possibilities. We have seen issues before caused by our Profiler setting flags that are necessary for our Agent to run, such as disabling tiered compilation. Unfortunately, if this is the case, we won't be able to solve it on our own,, and we would have to pull Microsoft in to investigate. If it's not too much trouble, could you capture a memory dump for the case where you've removed the Extensions folder and it still hangs? That would just be to confirm that the .NET Agent really isn't doing any work at all. |
@indy-singh -- On the very slim chance that it is the exact same issue we saw before, you can test that by setting the environment variable |
Apologies for the silence, I've been chasing down the org chart to get a case with Microsoft opened; that is now done under case id 2411220050001045 and the dumps have been shared with them. I don't know if you guys have a pre-existing relationship with Microsoft but happy to give you permissions to collaborate on the ticket if need be.
Yes, but this may take a while. I normally leave the test harness running in a loop, and I've only got the previous dumps by being extremely luckily.
Sure, I can give that try. Cheers, |
Hi @indy-singh -- no apology necessary. Please feel free to loop us in on any conversations with Microsoft, or let us know if we can help with their investigation. I don't think we can access your ticket directly, though. |
Small update; in the middle of radio silence from Microsoft. No update of note to share. We will chase on our side. I have not forgotten about the outstanding tasks. Cheers, |
Update! Been involved a little back and forth with Microsoft and we've arrived here:-
That was at Re: the outstanding tasks; I'm actually off for the rest of December but it seems Microsoft are finally in the same position we are in this thread. So I'm tempted to pause any work from my POV until after New Years. Cheers, |
Another update from Microsoft:-
That was at Cheers (have a great Christmas!), |
Thank you for all of the updates. |
Update from Microsoft at
Still on my todo list.
Not sure I fully understand this one. We already deploy our applications to all environments with:- {
"configProperties": {
"System.GC.Concurrent": true,
"System.GC.Server": true,
"System.GC.DynamicAdaptationMode": 1,
"System.Threading.ThreadPool.UseWindowsThreadPool": true,
"System.Runtime.TieredCompilation": false
}
} And our LIVE enviroment currently does not have any New Relic stuff installed. Given that we deploy around 100 times a day and now hang occurs, I would say this is already ticked off? Unless I misinterpreted this ask? Cheers, |
Thanks for the update, @indy-singh. Good to hear that Microsoft is still digging into the issue - hopefully they can find something soon. As relates to the Please keep us posted if there's more progress! |
Description
At a very high level; once we deploy New Relic (both infra agent and dotnet agent) we experience start-up hangs in our dotnet6/dotnet8 apps.
We ran into this problem last year: dotnet/aspnetcore#50317 but we didn't realise New Relic was the root cause until we tried to roll out it again this week (because I saw this fix) and the problem re-appeared.
Specifically the error that is logged at the highest level is
ASP.NET Core app failed to start after 120000 milliseconds
Expected Behavior
New Relic should not cause the app to hang on start-up.
Troubleshooting or NR Diag results
It's very difficult to establish a troubleshooting workflow. It's a very intermittent and sporadic hang. It almost feels like the app waits to connect to the NR backend and it then times out after 2 mins exactly.
Steps to Reproduce
Unable to reproduce at all, only happens in production very intermittently/sporadically.
Your Environment
Windows Server 2022 (was 2019) - experienced it on both OS versions.
dotnet8 (was dotnet6) - experienced it on both dotnet versions.
On-prem
Hosted via IIS
NewRelicDotNetAgent_10.30.0_x64.msi (deployed using DSC:
ADDLOCAL=NETCoreSupport
)newrelic-infra-amd64.1.57.1.msi (deployed using DSC)
Additional context
I know it's very long and information overload, but I did a ton of work in this ticket: dotnet/aspnetcore#50317
Our
newrelic-infra.yml
looks like this:-Our
newrelic.config
looks like this:-Our appsettings.json looks like this:-
The text was updated successfully, but these errors were encountered: