Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Startup hang caused by the deployment of New Relic #2809

Open
indy-singh opened this issue Oct 4, 2024 · 50 comments
Open

Startup hang caused by the deployment of New Relic #2809

indy-singh opened this issue Oct 4, 2024 · 50 comments
Labels
bug Something isn't working community To tag external issues and PRs

Comments

@indy-singh
Copy link

indy-singh commented Oct 4, 2024

Description

At a very high level; once we deploy New Relic (both infra agent and dotnet agent) we experience start-up hangs in our dotnet6/dotnet8 apps.

We ran into this problem last year: dotnet/aspnetcore#50317 but we didn't realise New Relic was the root cause until we tried to roll out it again this week (because I saw this fix) and the problem re-appeared.

Specifically the error that is logged at the highest level is ASP.NET Core app failed to start after 120000 milliseconds

Expected Behavior

New Relic should not cause the app to hang on start-up.

Troubleshooting or NR Diag results

It's very difficult to establish a troubleshooting workflow. It's a very intermittent and sporadic hang. It almost feels like the app waits to connect to the NR backend and it then times out after 2 mins exactly.

Steps to Reproduce

Unable to reproduce at all, only happens in production very intermittently/sporadically.

Your Environment

Windows Server 2022 (was 2019) - experienced it on both OS versions.
dotnet8 (was dotnet6) - experienced it on both dotnet versions.
On-prem
Hosted via IIS
NewRelicDotNetAgent_10.30.0_x64.msi (deployed using DSC: ADDLOCAL=NETCoreSupport)
newrelic-infra-amd64.1.57.1.msi (deployed using DSC)

Additional context

I know it's very long and information overload, but I did a ton of work in this ticket: dotnet/aspnetcore#50317

Our newrelic-infra.yml looks like this:-

# required
license_key: REDACTED
# to save cpu; we don't send process metrics, so we don't need the local agent to collect them
metrics_process_sample_rate: -1
# to save cpu; we don't use any of the plugins, so turn them off locally
disable_all_plugins: true
# to save cpu; we don't have removeable drives, so turn off detection locally
win_removable_drives: false
# to save cpu; we don't care to monitor the WU state of a host
windows_updates_refresh_sec: -1
# to save cpu; we don't care to monitor the services.msc of a host
windows_services_refresh_sec: -1

Our newrelic.config looks like this:-

<?xml version="1.0"?>
<!-- Copyright (c) 2008-2020 New Relic, Inc.  All rights reserved. -->
<!-- For more information see: https://docs.newrelic.com/docs/agents/net-agent/configuration/net-agent-configuration/ -->
<configuration xmlns="urn:newrelic-config" agentEnabled="true">
	<service licenseKey="REDACTED"/>
	<application/>
	<log level="off"/>
	<allowAllHeaders enabled="true"/>
	<attributes enabled="true">
		<exclude>request.headers.cookie</exclude>
		<exclude>request.headers.authorization</exclude>
		<exclude>request.headers.proxy-authorization</exclude>
		<exclude>request.headers.x-*</exclude>
		<include>request.headers.*</include>
	</attributes>
	<transactionTracer enabled="false"/>
	<distributedTracing enabled="false"/>
	<errorCollector enabled="false"/>
	<browserMonitoring autoInstrument="false"/>
	<threadProfiling>
		<ignoreMethod>System.Threading.WaitHandle:InternalWaitOne</ignoreMethod>
		<ignoreMethod>System.Threading.WaitHandle:WaitAny</ignoreMethod>
	</threadProfiling>
	<applicationLogging enabled="false"/>
	<utilization detectAws="false" detectAzure="false" detectGcp="false" detectPcf="false" detectDocker="false" detectKubernetes="false"/>
	<slowSql enabled="false"/>
	<distributedTracing enabled="false"/>
	<codeLevelMetrics enabled="false"/>
</configuration>

Our appsettings.json looks like this:-

{
  "Logging": {
    "LogLevel": {
      "Default": "Information",
      "Log4Net": "Debug",
      "Microsoft": "Warning",
      "Microsoft.Hosting.Lifetime": "Information"
    }
  },
  "AllowedHosts": "*",
  "NewRelic.AppName": "REDACTED",
  "NewRelic.AgentEnabled": "true"
}
@indy-singh indy-singh added the bug Something isn't working label Oct 4, 2024
@workato-integration
Copy link

@github-actions github-actions bot added the community To tag external issues and PRs label Oct 4, 2024
@indy-singh
Copy link
Author

Places I can find that match to 120 seconds:-

public const float MinimumSamplingDurationSeconds = 120; // 2 minutes

private static readonly TimeSpan _timeBetweenExecutions = TimeSpan.FromMinutes(2);

I don't know the architecture of the agent (yet), but I guess the most likely candidate is which ever code path is invoked during start up.

My suspicion initially falls on

if (_configuration.CollectorSyncStartup || _configuration.CollectorSendDataOnExit)

As we keep the default settings and I think we would fall into the sync path (and therefore blocking).

@indy-singh
Copy link
Author

Hmmm, https://docs.newrelic.com/docs/apm/agents/net-agent/configuration/net-agent-configuration/#service-syncStartup says it defaults to false.

So does that mean it is https://docs.newrelic.com/docs/apm/agents/net-agent/configuration/net-agent-configuration/#service-requestTimeout which implies that it is 120 seconds as we keep sendDataOnExit to the default value (false).

@nrcventura
Copy link
Member

Because you are not enabling send data on exit, or sync startup, the agent handles its connection logic on a separate thread pool thread via a timer that should be triggered without a delay.

_scheduler.ExecuteOnce(Connect, TimeSpan.Zero);

I don't think any of the other duration settings that you found in the code are related to this problem, because those things with the exception of the connect step (which is happening on a different thread), all require the agent to complete the connect process first.

In order to understand what is going on we will most likely need a memory dump from a time when the application fails to start, and we would also likely need DEBUG or FINEST logs. Because these files can be potentially sensitive, the best way to get support on this issue to work with our New Relic support team because they have the ability to receive these types of files without having them shared publicly on github. If you are unable to send these files to New Relic support, you can attempt to analyze the memory dump, but we'll need to know which threads/tasks are blocked by which other threads/tasks.

Most of the startup hang/deadlock issues that I've seen are related to applications using a Configuration Manager that needs to reach out to an external system. Usually these calls to the external system are something that the agent also instruments. We have a default list of known instrumented methods that have caused this problem.

private static HashSet<string> _deferInitializationOnTheseMethods = new HashSet<string>
{
"System.Net.Http.HttpClient.SendAsync",
"System.Net.Http.SocketsHttpHandler.SendAsync",
"System.Net.HttpWebRequest.SerializeHeaders",
"System.Net.HttpWebRequest.GetResponse"
};

We also allow extending this list using an environment variable.
var methodsFromEnvVar = System.Environment.GetEnvironmentVariable("NEW_RELIC_DELAY_AGENT_INIT_METHOD_LIST");

The difficult part will be determining what to add to that list. In the linked aspnetcore issue you provided a sample startup.cs.txt file, and in that file it looks like you might be using a postgresql database with the configuration manager. If that's the case you may need to add the classes and methods defined in the sql instrumentation xml associated to the Postgres comments. Here is an example of one of those sections.
<!-- Postgres SQL Driver-->
<match assemblyName="Npgsql" className="Npgsql.NpgsqlCommand">
<exactMethodMatcher methodName="ExecuteDbDataReader" parameters="System.Data.CommandBehavior"/>
<exactMethodMatcher methodName="ExecuteNonQuery" />
<exactMethodMatcher methodName="ExecuteScalar" />
</match>
<match assemblyName="Npgsql" className="Npgsql.NpgsqlCommand" minVersion="4.1">
<exactMethodMatcher methodName="ExecuteReader" parameters="System.Data.CommandBehavior" />
</match>
</tracerFactory>

@indy-singh
Copy link
Author

Hi @nrcventura ,

Appreciate the detail reply.

In order to understand what is going on we will most likely need a memory dump from a time when the application fails to start, and we would also likely need DEBUG or FINEST logs.

I would love to get a memory dump. Unfortunately, this is pretty hard because we only know it's hung after the deploy has failed and the process has been automatically killed off by IIS because it is deemed "unhealthy". I can change pretty much anything, but it's not something I can do quickly as it would effect the entire CI/CD pipeline.

I did however push out the change to the logging late on Friday, and this morning it has happened twice already.

<log level="all" maxLogFileSizeMB="100"/>

Last log:-

[2024-10-07T09:38:00.405Z, PID: 31444] [aspnetcorev2.dll] ASPNET_CORE_GLOBAL_MODULE::Terminate

Timeout log:-

[2024-10-07T09:37:59.247Z, PID: 31444] [aspnetcorev2_inprocess.dll] InvalidOperationException 'Managed server didn't initialize after 120000 ms.' caught at D:\a\_work\1\s\src\Servers\IIS\AspNetCoreModuleV2\InProcessRequestHandler\inprocessapplication.cpp:489 

First log:-

[2024-10-07T09:35:59.161Z, PID: 31444] [aspnetcorev2.dll] Initializing logs for 'C:\Program Files\IIS\Asp.Net Core Module\V2\aspnetcorev2.dll'. Process Id: 31444. File Version: 18.0.24141.6. Description: IIS ASP.NET Core Module V2. Commit: .

Most of the startup hang/deadlock issues that I've seen are related to applications using a Configuration Manager that needs to reach out to an external system.

Thanks, that gives me a thread to pull on. We do store out config keys etc in PG and all apps reach out to PG at startup time to get various details, settings etc.

I'll try adding each method one by one to NEW_RELIC_DELAY_AGENT_INIT_METHOD_LIST and see if the failures stop (this may take some time given the intermittent nature of the problem).

I've attached the relevant logs from the most recent failure.

newrelic_agent_Allium.WebApp_099_31444.zip

Cheers,
Indy

@indy-singh
Copy link
Author

I've pushed out

[System.Environment]::SetEnvironmentVariable('NEW_RELIC_DELAY_AGENT_INIT_METHOD_LIST', 'Npgsql.NpgsqlCommand.ExecuteReader', [System.EnvironmentVariableTarget]::Machine)

To all of our DEMO hosts.

I guess if we don't get another error for a week we can be somewhat-confident it might be that.

Hard to prove.

@nrcventura
Copy link
Member

Thanks for the update. Yes, these types of problems are very difficult to reproduce. You might be able to reproduce the problem more reliably if you enabled syncStartup which is one of the settings that you previously found.

I suspect that you may also need the NpgSql.Connection.Open method.

@indy-singh
Copy link
Author

indy-singh commented Oct 7, 2024

Just happened again, updated environment variable to:-

[System.Environment]::SetEnvironmentVariable('NEW_RELIC_DELAY_AGENT_INIT_METHOD_LIST', 'Npgsql.NpgsqlCommand.ExecuteReader;Npgsql.NpgsqlConnection.Open', [System.EnvironmentVariableTarget]::Machine)

@indy-singh
Copy link
Author

Happened twice this morning, updated environment variable to:-

$methods = @(
    "Npgsql.NpgsqlConnection.Open"
    "Npgsql.NpgsqlCommand.ExecuteReader"
    "Npgsql.NpgsqlDataReader.Read"
    "Npgsql.NpgsqlConnection.OpenAsync"
    "Npgsql.NpgsqlCommand.ExecuteReaderAsync"
    "Npgsql.NpgsqlDataReader.ReadAsync"    
);

$concat = $methods -join ';';

[System.Environment]::SetEnvironmentVariable("NEW_RELIC_DELAY_AGENT_INIT_METHOD_LIST", $concat, [System.EnvironmentVariableTarget]::Machine);

I'm not convinced it (the hang on start-up) is fetching the configuration from PG, so I'm including all methods (that can be used) on that path. This should help us either rule it out or in.

Back to the waiting game!

Cheers,
Indy

@indy-singh
Copy link
Author

Tiny update from me;

  • I tried syncStartup for local-dev reproduction, got absolutely no where.
  • I tried to block the domains below using Windows HOSTS file (collected using ipconfig /displaydns) but the dot-net agent was fine, and simply logged retries with some sort of back off strategy so that rules out the NR backend <-> dot-net agent communication :-
0.0.0.0 collector-eu01.cell.eu.nr-data.net
0.0.0.0 collector.eu01.nr-data.net
0.0.0.0 fastly-tls12-collector.eu01.nr-data.net
0.0.0.0 fastly-tls12-infra-api.eu.newrelic.com
0.0.0.0 infra-api.cell.eu.nr-data.net
0.0.0.0 infra-api.eu.newrelic.com
0.0.0.0 infrastructure-command-api.eu.newrelic.com
0.0.0.0 newrelic.com
0.0.0.0 tls12.eu.newrelic.com.cdn.cloudflare.net
  • I did notice some HTTP 409 in the dot-net agent log, not sure if related/expected? It doesn't line up with any start-up hangs, but seeing a 409 in the wild is rare enough for me to note it.
  • I did notice some 5xx in the infra agent log, but I guess that is to be expected to some extent.
  • I peppered the CI/CD deploy logs with Get-Date -Format "o" and hope to get an exact timeline going. I've got a theory that the presence of the NR agent(s) is enough to cause the start-up time to balloon and it only tips over 120 seconds sometimes.
  • I'm basically ignoring the infra agent for now, because the only apps that are effected are the dotnet8 apps (we have some legacy 472 apps) but we only installed dotnet-new support.

Cheers,
Indy

@nrcventura
Copy link
Member

If you are not seeing the problem when syncStartup is enabled, then I'm guessing that you are not running into the deadlock problem that we've seen in the past. I opened an issue to add a feature to just disable the usage of the ConfigurationManager altogether so that we can more easily rule that problem in or out before dedicating more time to identify what to add to that delayed initialization list.

I have seen other weird cases in certain environments where enabling agent logs higher than debug or info level have slowed down application startup enough (in those environments) to trigger the default aspnetore app default startup timeout. I don't think that this is likely to be the problem that you are experiencing, but https://docs.newrelic.com/docs/apm/agents/net-agent/configuration/net-agent-configuration/#log-enabled explains how to disable our logging system (this does not affect capturing and forwarding agent log data).

Another situation where we have seen deadlocks occur is within the agent's usage of the .net Event Pipe which we use to collect garbage collection metrics for the application. The deadlock that we've seen occurs (only occasionally) when we try to unsubscribe our EventListener from the Event Pipe EventSource. The collection of this data can be disabled, and this has resolved some problems similar to this (usually noticed when the application is trying to end). To accomplish that you can add <add key="NewRelic.EventListenerSamplersEnabled" value="false" /> to the appsettings element in the new relic config file (we don't have an environment variable implemented for this). This is the thing that I would try next.

@indy-singh
Copy link
Author

So, I don't want to jinx anything, but since I added:-

$methods = @(
    "Npgsql.NpgsqlConnection.Open"
    "Npgsql.NpgsqlCommand.ExecuteReader"
    "Npgsql.NpgsqlDataReader.Read"
    "Npgsql.NpgsqlConnection.OpenAsync"
    "Npgsql.NpgsqlCommand.ExecuteReaderAsync"
    "Npgsql.NpgsqlDataReader.ReadAsync"    
);

$concat = $methods -join ';';

[System.Environment]::SetEnvironmentVariable("NEW_RELIC_DELAY_AGENT_INIT_METHOD_LIST", $concat, [System.EnvironmentVariableTarget]::Machine);

We've not had a start-up hang. I'm not saying that I think this is solved, but as time goes on my confidence does increase 🤞🏽

Cheers,
Indy

@indy-singh
Copy link
Author

indy-singh commented Oct 15, 2024

Small update, not seen the exact same start-up hang. So I'm getting closer to saying it looks like this particular issue (getting external configuration from pg) is resolved. Right now NR is only deployed to our DEMO environment - the real test will be getting it to our LIVE environment (the hang was more frequent when we first attempted NR deployment last year).

If we get to the end of the week without another start-up hang, I'll spend next week getting onto a LIVE environment and see what happens.

Tangentially; we've had two instances of the same application hanging after the app had booted and warming up other external services (in this case BigQuery/GetTable). I don't think it is NR, but I don't really believe in coincidences so I thought I'd mention it here, just in case I need to keep the issue open a bit longer.

Cheers,
Indy

@nrcventura
Copy link
Member

Thank you for the updates. I'm not aware of any problems with someone using BigQuery, but we also do not currently support out of the box instrumentation for BigQuery, so I do not know what could be the problem there.

@indy-singh
Copy link
Author

Argh!

The start-up hang has happened again.

From event viewer:-

[2024-10-16T06:11:56.536Z, PID: 8672] [aspnetcorev2_inprocess.dll] InvalidOperationException 'Managed server didn't initialize after 120000 ms.' caught at D:\a\_work\1\s\src\Servers\IIS\AspNetCoreModuleV2\InProcessRequestHandler\inprocessapplication.cpp:489 

I have a NewRelic.Profiler.8672.log but I don't have the rotation-based for [pid: 8672]. The last 10 lines in the rotation based log file (newrelic_agent_VehicleData.WebApp_029.log) are:-

2024-10-16 06:09:13,931 NewRelic  DEBUG: [pid: 19716, tid: 111] Request(cce3a86f-56b1-4ae7-8a55-6043b9477b3d): Invocation of "analytic_event_data" yielded response : {}
2024-10-16 06:09:13,931 NewRelic FINEST: [pid: 19716, tid: 111] Transaction Event harvest finished.
2024-10-16 06:09:15,720 NewRelic FINEST: [pid: 19716, tid: 96] Error Event harvest starting.
2024-10-16 06:09:15,720 NewRelic FINEST: [pid: 19716, tid: 95] Custom Event harvest starting.
2024-10-16 06:09:15,893 NewRelic FINEST: [pid: 19716, tid: 92] Trx Noop: Attempting to execute NewRelic.Providers.Wrapper.Sql.OpenConnectionAsyncWrapper found from InstrumentedMethodInfo: Method: Npgsql.NpgsqlConnection, Npgsql, Version=8.0.5.0, Culture=neutral, PublicKeyToken=5d8b90d52f46fda7:OpenAsync(System.Threading.CancellationToken), RequestedWrapperName: OpenConnectionTracerAsync, IsAsync: False, RequestedMetricName: , RequestedTransactionNamePriority: , StartWebTransaction: False
2024-10-16 06:09:15,893 NewRelic FINEST: [pid: 19716, tid: 92] Trx Noop: No transaction, skipping method Npgsql.NpgsqlConnection.OpenAsync(System.Threading.CancellationToken)
2024-10-16 06:09:15,893 NewRelic FINEST: [pid: 19716, tid: 92] Trx Noop: Attempting to execute NewRelic.Providers.Wrapper.Sql.OpenConnectionWrapper found from InstrumentedMethodInfo: Method: Npgsql.NpgsqlConnection, Npgsql, Version=8.0.5.0, Culture=neutral, PublicKeyToken=5d8b90d52f46fda7:Open(System.Boolean,System.Threading.CancellationToken), RequestedWrapperName: OpenConnectionTracer, IsAsync: False, RequestedMetricName: , RequestedTransactionNamePriority: , StartWebTransaction: False
2024-10-16 06:09:15,893 NewRelic FINEST: [pid: 19716, tid: 92] Trx Noop: No transaction, skipping method Npgsql.NpgsqlConnection.Open(System.Boolean,System.Threading.CancellationToken)
2024-10-16 06:09:15,918 NewRelic FINEST: [pid: 19716, tid: 92] Trx Noop: Attempting to execute NewRelic.Providers.Wrapper.Sql.SqlCommandWrapper found from InstrumentedMethodInfo: Method: Npgsql.NpgsqlCommand, Npgsql, Version=8.0.5.0, Culture=neutral, PublicKeyToken=5d8b90d52f46fda7:ExecuteReader(System.Data.CommandBehavior), RequestedWrapperName: SqlCommandTracer, IsAsync: False, RequestedMetricName: , RequestedTransactionNamePriority: , StartWebTransaction: False
2024-10-16 06:09:15,918 NewRelic FINEST: [pid: 19716, tid: 92] Trx Noop: No transaction, skipping method Npgsql.NpgsqlCommand.ExecuteReader(System.Data.CommandBehavior)

NewRelic.Profiler.8672.zip

Now looking into the other avenues of investigation.

Cheers,
Indy

@jaffinito
Copy link
Member

The profiler log is only going to have data for the profiler, but we also will want to see the managed agent logs. Managed agent logs can contain information that is potentially sensitive, especially at higher log levels. The best way for us to get these files would to be work with our New Relic support team since they have the ability to receive these types of files without having them shared publicly on github.

@indy-singh
Copy link
Author

indy-singh commented Oct 16, 2024

Hi @jaffinito,

Can you clarify what you mean by managed agent logs? I believe I uploaded those over here.

I do review the logs before I upload them. I can open a ticket through out NR account so that we can share logs if something sensitive is in them.

Cheers,
Indy

@indy-singh
Copy link
Author

indy-singh commented Oct 17, 2024

Happened again tonight at

[2024-10-17T18:13:11.391Z, PID: 7804] [aspnetcorev2_inprocess.dll] InvalidOperationException 'Managed server didn't initialize after 120000 ms.' caught at D:\a\_work\1\s\src\Servers\IIS\AspNetCoreModuleV2\InProcessRequestHandler\inprocessapplication.cpp:489 

There is only a profiler log, there aren't any agent logs that pertain to 7804.

Edit* I've not had the opportunity to pursue the other avenues of investigation.

Cheers,
Indy

@indy-singh
Copy link
Author

To accomplish that you can add <add key="NewRelic.EventListenerSamplersEnabled" value="false" /> to the appsettings element in the new relic config file (we don't have an environment variable implemented for this). This is the thing that I would try next.

This is now done. I've modified "C:\ProgramData\New Relic\.NET Agent\newrelic.config" it now looks like this:-

<?xml version="1.0"?>
<!-- Copyright (c) 2008-2020 New Relic, Inc.  All rights reserved. -->
<!-- For more information see: https://docs.newrelic.com/docs/agents/net-agent/configuration/net-agent-configuration/ -->
<configuration xmlns="urn:newrelic-config" agentEnabled="true">
	<appSettings>
		<add key="NewRelic.EventListenerSamplersEnabled" value="false"/>
	</appSettings>
	<service licenseKey="REDACTED"/>
	<application/>
	<log level="all" maxLogFileSizeMB="100"/>
	<allowAllHeaders enabled="true"/>
	<attributes enabled="true">
		<exclude>request.headers.cookie</exclude>
		<exclude>request.headers.authorization</exclude>
		<exclude>request.headers.proxy-authorization</exclude>
		<exclude>request.headers.x-*</exclude>
		<include>request.headers.*</include>
	</attributes>
	<transactionTracer enabled="false"/>
	<distributedTracing enabled="false"/>
	<errorCollector enabled="false"/>
	<browserMonitoring autoInstrument="false"/>
	<threadProfiling>
		<ignoreMethod>System.Threading.WaitHandle:InternalWaitOne</ignoreMethod>
		<ignoreMethod>System.Threading.WaitHandle:WaitAny</ignoreMethod>
	</threadProfiling>
	<applicationLogging enabled="false"/>
	<utilization detectAws="false" detectAzure="false" detectGcp="false" detectPcf="false" detectDocker="false" detectKubernetes="false"/>
	<slowSql enabled="false"/>
	<distributedTracing enabled="false"/>
	<codeLevelMetrics enabled="false"/>
</configuration>

I confirmed that changing the config did something by logs:-

2024-10-17 22:46:22,361 NewRelic   INFO: [pid: 24308, tid: 48] Reconnecting due to configuration change
2024-10-17 22:46:22,363 NewRelic FINEST: [pid: 24308, tid: 48] Sampler NewRelic.Agent.Core.Samplers.GCSamplerNetCore has been requested to stop.
2024-10-17 22:46:22,363 NewRelic FINEST: [pid: 24308, tid: 48] Sampler NewRelic.Agent.Core.Samplers.CpuSampler has been requested to stop.
2024-10-17 22:46:22,363 NewRelic FINEST: [pid: 24308, tid: 48] Sampler NewRelic.Agent.Core.Samplers.MemorySampler has been requested to stop.
2024-10-17 22:46:22,363 NewRelic FINEST: [pid: 24308, tid: 48] Sampler NewRelic.Agent.Core.Samplers.ThreadStatsSampler has been requested to stop.

On the topic of environment variables does NEW_RELIC_DISABLE_SAMPLERS not do the same thing?

Found via grokking the sdk src:-

public bool DisableSamplers { get { return EnvironmentOverrides(_localConfiguration.application.disableSamplers, "NEW_RELIC_DISABLE_SAMPLERS"); } }

Cheers,
Indy

@nrcventura
Copy link
Member

The DisableSamplers setting disables all of our samplers including the cpu and memory samplers. The one I recommended disabling only disables a couple of samplers which are the EventListener samplers for Garbage Collection metrics and threading metrics.

@indy-singh
Copy link
Author

Happened again - I should explain that I've hijacked the CI/CD pipeline on weekday evenings (00 07,19-23 * * TUE-FRI) to basically deploy our entire application tier (we call this a 'Server Recycle') so I can have a constant amount of "deploys" happening.

What's interesting is that a Server Recycle is basically iisreset + hits the urls to warmup the apps. The binaries don't change.

Before I added all the PG methods we used to get the same startup hang on actual deploys too (e.g. the binaries have changed and we don't use iisreset, we just kill the respective pid of that w3wp.exe).

I'm beginning to think that we are playing whack-a-mole on a combination of our platform code + NR causes a start-up hang some of the time.

I think our current state of affairs is that when we do an iisreset and then warmup the apps we run the risk of the start-up hang.

To prove that I'll take out the iisreset in our Server Recycle and see what happens.

Cheers,
Indy

@indy-singh
Copy link
Author

I finally replicated it on LOCAL. THANK GOD.

We have a small console app that is invoked by our OMD/nagios setup every 1-5 seconds (I think), it looks like this (I've append .txt to make it upload-able).

program.cs.txt

I leave that running in the background using powershell:-

while($true) {Start-Sleep -Seconds 1; .\CWDotNetDiagClient.exe;}

And then I simulate what a deploy/server recycle does with this (this is copied directly from the scripts, a lot of it is many years old):-

$WebApplicationCreatorConfigurationDirectory = "D:\code\CodeweaversWorld\src\WebApplicationCreator\LocalConfiguration";
$WebApplicationCreatorExecutable = "D:\code\CodeweaversWorld\src\WebApplicationCreator\app\IISApplicationExecutor\bin\IISApplicationExecutor.exe";
$Timeout = 480;
$AbsoluteUrl = "http://datamining.localiis/services/application.svc/warmup";

while ($true) {    
    Get-WmiObject -Namespace 'root\WebAdministration' -class 'WorkerProcess' -Filter "NOT AppPoolName LIKE '%servicemonitor%'" `
    | Select-Object -Property AppPoolName, ProcessId `
    | ForEach-Object { 
        Write-Host "Killing '$($_.AppPoolName)' with pid '$($_.ProcessId)'";
        Stop-Process -Id $_.ProcessId -Force -ErrorAction SilentlyContinue;
    };
    & "C:\Windows\System32\iisreset.exe" /stop ;
    & "$WebApplicationCreatorExecutable" "ZZZ_PLACEHOLDER" "RESET" "$WebApplicationCreatorConfigurationDirectory" "NOPROMPT";
    & "C:\Windows\System32\iisreset.exe" /start ;
    Start-Sleep -Seconds 2;
    & "$WebApplicationCreatorExecutable" "ZZZ_PLACEHOLDER" "ALL" "$WebApplicationCreatorConfigurationDirectory" "NOPROMPT";    
    Invoke-WebRequest -Uri $AbsoluteUrl -TimeoutSec $Timeout -UseBasicParsing -DisableKeepAlive -MaximumRedirection 0;
}

The clue re-reading the entire issue thread was:-

The deadlock that we've seen occurs (only occasionally) when we try to unsubscribe our EventListener from the Event Pipe EventSource.

Many thanks @nrcventura for that nugget, it took a while to percolate.

I really should get to bed, maybe tonight I won't dream about this issue.

Cheerio!
Indy

@indy-singh
Copy link
Author

Manage to replicate it a 2nd & 3d time (took 185 tries) on LOCAL, this time I was quick enough to grab a memory dump.

I'll share them on Monday via the NR support ticket route.

Meanwhile I had a quick poke around using dotnet dump:-

> syncblk
Index         SyncBlock MonitorHeld Recursion Owning Thread Info          SyncBlock Owner
    2 000001E2B298B030            5         1 000001E2B3550430 413c  27   000001a235c153b0 System.Collections.Generic.List`1[[System.WeakReference`1[[System.Diagnostics.Tracing.EventSource, System.Private.CoreLib]], System.Private.CoreLib]]

> setthread 27
> clrstack
OS Thread Id: 0x413c (27)
        Child SP               IP Call Site
00000014A3AFF438 00007ffec3202f44 [HelperMethodFrame: 00000014a3aff438]
00000014A3AFF540 00007FFE141CC832 System.Threading.ThreadPool.get_CompletedWorkItemCount() [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPool.Windows.cs @ 204]
00000014A3AFF570 00007FFE141CC79D System.Diagnostics.Tracing.RuntimeEventSource+<>c.<OnEventCommand>b__35_9() [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/RuntimeEventSource.cs @ 110]
00000014A3AFF5A0 00007FFE141CC511 System.Diagnostics.Tracing.IncrementingPollingCounter.UpdateMetric() [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/IncrementingPollingCounter.cs @ 57]
00000014A3AFF600 00007FFE141CC406 System.Diagnostics.Tracing.CounterGroup.ResetCounters() [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/CounterGroup.cs @ 206]
00000014A3AFF670 00007FFE141CC0BD System.Diagnostics.Tracing.CounterGroup.EnableTimer(Single) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/CounterGroup.cs @ 158]
00000014A3AFF6D0 00007FFE141C6B18 System.Diagnostics.Tracing.CounterGroup.OnEventSourceCommand(System.Object, System.Diagnostics.Tracing.EventCommandEventArgs) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/CounterGroup.cs @ 73]
00000014A3AFF750 00007FFE1396CD5A System.Diagnostics.Tracing.EventSource.DoCommand(System.Diagnostics.Tracing.EventCommandEventArgs) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventSource.cs @ 2738]
00000014A3AFF7B0 00007FFE1396C61E System.Diagnostics.Tracing.EventSource.SendCommand(System.Diagnostics.Tracing.EventListener, System.Diagnostics.Tracing.EventProviderType, Int32, System.Diagnostics.Tracing.EventCommand, Boolean, System.Diagnostics.Tracing.EventLevel, System.Diagnostics.Tracing.EventKeywords, System.Collections.Generic.IDictionary`2<System.String,System.String>) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventSource.cs @ 2574]
00000014A3AFF830 00007FFE1396C45B System.Diagnostics.Tracing.EventSource+OverrideEventProvider.OnControllerCommand(System.Diagnostics.Tracing.ControllerCommand, System.Collections.Generic.IDictionary`2<System.String,System.String>, Int32) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventSource.cs @ 2438]
00000014A3AFF890 00007FFE139694C9 System.Diagnostics.Tracing.EventPipeEventProvider.HandleEnableNotification(System.Diagnostics.Tracing.EventProvider, Byte*, Byte, Int64, Int64, EVENT_FILTER_DESCRIPTOR*) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventPipeEventProvider.cs @ 58]
00000014A3AFF8E0 00007FFE13969077 System.Diagnostics.Tracing.EventProviderImpl.ProviderCallback(System.Diagnostics.Tracing.EventProvider, Byte*, Int32, Byte, Int64, Int64, EVENT_FILTER_DESCRIPTOR*) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventProvider.cs @ 1310]
00000014A3AFF940 00007FFE13968E14 System.Diagnostics.Tracing.EventPipeEventProvider.Callback(Byte*, Int32, Byte, Int64, Int64, EVENT_FILTER_DESCRIPTOR*, Void*) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventPipeEventProvider.cs @ 68]

> clrstack -all
OS Thread Id: 0x26c0
        Child SP               IP Call Site
00000014A3A7C738 00007ffec31fffa4 [HelperMethodFrame_1OBJ: 00000014a3a7c738] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef)
00000014A3A7C880 00007FFE13902B8E System.Diagnostics.Tracing.EventSource.Initialize(System.Guid, System.String, System.String[]) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventSource.cs @ 1619]
00000014A3A7C930 00007FFE141C873C System.Data.DataCommonEventSource..ctor()
00000014A3A7C990 00007FFE141C862D System.Data.DataCommonEventSource..cctor() [/_/src/libraries/System.Data.Common/src/System/Data/Common/DataCommonEventSource.cs @ 12]
00000014A3A7D1F8 00007ffe7345c663 [HelperMethodFrame: 00000014a3a7d1f8]
00000014A3A7D300 00007FFE141C85DA System.Data.Common.DbConnectionStringBuilder.Clear() [/_/src/libraries/System.Data.Common/src/System/Data/Common/DbConnectionStringBuilder.cs @ 283]
00000014A3A7D340 00007FFE141BA91E Npgsql.NpgsqlConnectionStringBuilder.Init()
00000014A3A7D380 00007FFE141BA239 Npgsql.NpgsqlConnectionStringBuilder..ctor()
00000014A3A7D3B0 00007FFE141BA026 Npgsql.NpgsqlConnection..cctor()
00000014A3A7DC28 00007ffe7345c663 [HelperMethodFrame: 00000014a3a7dc28]
00000014A3A7DD30 00007FFE141B9FCC Npgsql.NpgsqlConnection..ctor()
00000014A3A7DD70 00007FFE13913900 System.RuntimeType.CreateInstanceDefaultCtor(Boolean, Boolean) [/_/src/coreclr/System.Private.CoreLib/src/System/RuntimeType.CoreCLR.cs @ 3878]
00000014A3A7DDD0 00007FFE141B97DA log4net.Appender.AdoNetAppender.CreateConnection(System.Type, System.String)
00000014A3A7DE10 00007FFE141B8958 log4net.Appender.AdoNetAppender.InitializeDatabaseConnection()
00000014A3A7DE80 00007FFE13963502 log4net.Repository.Hierarchy.XmlHierarchyConfigurator.ParseAppender(System.Xml.XmlElement)
00000014A3A7DF20 00007FFE13962CFF log4net.Repository.Hierarchy.XmlHierarchyConfigurator.FindAppenderByReference(System.Xml.XmlElement)
00000014A3A7DF90 00007FFE13963330 log4net.Repository.Hierarchy.XmlHierarchyConfigurator.ParseAppender(System.Xml.XmlElement)
00000014A3A7E030 00007FFE13962CFF log4net.Repository.Hierarchy.XmlHierarchyConfigurator.FindAppenderByReference(System.Xml.XmlElement)
00000014A3A7E0A0 00007FFE139618E1 log4net.Repository.Hierarchy.XmlHierarchyConfigurator.ParseChildrenOfLoggerElement(System.Xml.XmlElement, log4net.Repository.Hierarchy.Logger, Boolean)
00000014A3A7E160 00007FFE139606A5 log4net.Repository.Hierarchy.XmlHierarchyConfigurator.ParseLogger(System.Xml.XmlElement)
00000014A3A7E1E0 00007FFE1395DC02 log4net.Repository.Hierarchy.XmlHierarchyConfigurator.Configure(System.Xml.XmlElement)
00000014A3A7E260 00007FFE1395D177 log4net.Repository.Hierarchy.Hierarchy.XmlRepositoryConfigure(System.Xml.XmlElement)
00000014A3A7E2D0 00007FFE13953004 log4net.Config.XmlConfigurator.InternalConfigure(log4net.Repository.ILoggerRepository, System.IO.Stream)
00000014A3A7E340 00007FFE13951B9F log4net.Config.XmlConfigurator.InternalConfigure(log4net.Repository.ILoggerRepository, System.IO.FileInfo)
00000014A3A7E3D0 00007FFE139515D3 log4net.Config.XmlConfigurator.InternalConfigureAndWatch(log4net.Repository.ILoggerRepository, System.IO.FileInfo)
00000014A3A7E440 00007FFE13911316 log4net.Config.XmlConfigurator.ConfigureAndWatch(System.IO.FileInfo)
00000014A3A7E4B0 00007FFE1390FEFD Codeweavers.Logging.Logger..cctor()
00000014A3A7ED68 00007ffe7345c663 [HelperMethodFrame: 00000014a3a7ed68]
00000014A3A7EE70 00007FFE1390FCA5 Codeweavers.Logging.Logger.GetLogger(System.String)
00000014A3A7EED0 00007FFE1390F451 DataMining.WebApp.Program.Main(System.String[])
OS Thread Id: 0x3180
        Child SP               IP Call Site
00000014A427F8F0 00007ffec31fffa4 [DebuggerU2MCatchHandlerFrame: 00000014a427f8f0]
OS Thread Id: 0x367c
        Child SP               IP Call Site
00000014A52CD8F8 00007ffec31fffa4 [HelperMethodFrame_1OBJ: 00000014a52cd8f8] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef)
00000014A52CDA40 00007FFE13902B8E System.Diagnostics.Tracing.EventSource.Initialize(System.Guid, System.String, System.String[]) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventSource.cs @ 1619]
00000014A52CDAF0 00007FFE1396E434 System.Diagnostics.Tracing.FrameworkEventSource..ctor() [/_/src/coreclr/System.Private.CoreLib/System.Private.CoreLib.Generators/Generators.EventSourceGenerator/FrameworkEventSource.g.cs @ 12]
00000014A52CDB40 00007FFE1396E34D System.Diagnostics.Tracing.FrameworkEventSource..cctor() [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/FrameworkEventSource.cs @ 14]
00000014A52CE3A8 00007ffe7345c663 [HelperMethodFrame: 00000014a52ce3a8]
00000014A52CE4B0 00007FFE1396DFC5 System.Threading.ThreadPoolWorkQueue..ctor() [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPoolWorkQueue.cs @ 442]
00000014A52CE570 00007FFE1396DB01 System.Threading.ThreadPool..cctor() [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPoolWorkQueue.cs @ 1430]
00000014A52CEDE8 00007ffe7345c663 [HelperMethodFrame: 00000014a52cede8]
00000014A52CEEF0 00007FFE1396D940 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib]].AwaitUnsafeOnCompleted[[System.Runtime.CompilerServices.ConfiguredValueTaskAwaitable`1+ConfiguredValueTaskAwaiter[[System.Boolean, System.Private.CoreLib]], System.Private.CoreLib]](ConfiguredValueTaskAwaiter<Boolean> ByRef, System.Runtime.CompilerServices.IAsyncStateMachineBox)
00000014A52CEF50 00007FFE1396B338 Codeweavers.Logging.Async.ChannelAsyncForwardingAppender+<ForwardLoggingEventsFromBuffer>d__25.MoveNext()
00000014A52CF130 00007FFE13968EF7 System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[[System.__Canon, System.Private.CoreLib]](System.__Canon ByRef) [/_/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncMethodBuilderCore.cs @ 38]
00000014A52CF1A0 00007FFE139695A8 Codeweavers.Logging.Async.ChannelAsyncForwardingAppender.ForwardLoggingEventsFromBuffer()
00000014A52CF210 00007FFE13969199 Codeweavers.Logging.Async.ChannelAsyncForwardingAppender+<ForwardingThreadExecute>d__24.MoveNext()
00000014A52CF300 00007FFE13968EF7 System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[[System.__Canon, System.Private.CoreLib]](System.__Canon ByRef) [/_/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncMethodBuilderCore.cs @ 38]
00000014A52CF370 00007FFE13968C38 Codeweavers.Logging.Async.ChannelAsyncForwardingAppender.ForwardingThreadExecute()
00000014A52CF3E0 00007FFE13968AED System.Threading.Tasks.Task`1[[System.__Canon, System.Private.CoreLib]].InnerInvoke() [/_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Future.cs @ 501]
00000014A52CF420 00007FFE13968990 System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ExecutionContext.cs @ 179]
00000014A52CF490 00007FFE13968655 System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef, System.Threading.Thread) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs @ 2345]
00000014A52CF780 00007ffe7345c663 [DebuggerU2MCatchHandlerFrame: 00000014a52cf780]
OS Thread Id: 0x413c
        Child SP               IP Call Site
00000014A3AFF438 00007ffec3202f44 [HelperMethodFrame: 00000014a3aff438]
00000014A3AFF540 00007FFE141CC832 System.Threading.ThreadPool.get_CompletedWorkItemCount() [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPool.Windows.cs @ 204]
00000014A3AFF570 00007FFE141CC79D System.Diagnostics.Tracing.RuntimeEventSource+<>c.<OnEventCommand>b__35_9() [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/RuntimeEventSource.cs @ 110]
00000014A3AFF5A0 00007FFE141CC511 System.Diagnostics.Tracing.IncrementingPollingCounter.UpdateMetric() [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/IncrementingPollingCounter.cs @ 57]
00000014A3AFF600 00007FFE141CC406 System.Diagnostics.Tracing.CounterGroup.ResetCounters() [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/CounterGroup.cs @ 206]
00000014A3AFF670 00007FFE141CC0BD System.Diagnostics.Tracing.CounterGroup.EnableTimer(Single) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/CounterGroup.cs @ 158]
00000014A3AFF6D0 00007FFE141C6B18 System.Diagnostics.Tracing.CounterGroup.OnEventSourceCommand(System.Object, System.Diagnostics.Tracing.EventCommandEventArgs) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/CounterGroup.cs @ 73]
00000014A3AFF750 00007FFE1396CD5A System.Diagnostics.Tracing.EventSource.DoCommand(System.Diagnostics.Tracing.EventCommandEventArgs) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventSource.cs @ 2738]
00000014A3AFF7B0 00007FFE1396C61E System.Diagnostics.Tracing.EventSource.SendCommand(System.Diagnostics.Tracing.EventListener, System.Diagnostics.Tracing.EventProviderType, Int32, System.Diagnostics.Tracing.EventCommand, Boolean, System.Diagnostics.Tracing.EventLevel, System.Diagnostics.Tracing.EventKeywords, System.Collections.Generic.IDictionary`2<System.String,System.String>) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventSource.cs @ 2574]
00000014A3AFF830 00007FFE1396C45B System.Diagnostics.Tracing.EventSource+OverrideEventProvider.OnControllerCommand(System.Diagnostics.Tracing.ControllerCommand, System.Collections.Generic.IDictionary`2<System.String,System.String>, Int32) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventSource.cs @ 2438]
00000014A3AFF890 00007FFE139694C9 System.Diagnostics.Tracing.EventPipeEventProvider.HandleEnableNotification(System.Diagnostics.Tracing.EventProvider, Byte*, Byte, Int64, Int64, EVENT_FILTER_DESCRIPTOR*) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventPipeEventProvider.cs @ 58]
00000014A3AFF8E0 00007FFE13969077 System.Diagnostics.Tracing.EventProviderImpl.ProviderCallback(System.Diagnostics.Tracing.EventProvider, Byte*, Int32, Byte, Int64, Int64, EVENT_FILTER_DESCRIPTOR*) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventProvider.cs @ 1310]
00000014A3AFF940 00007FFE13968E14 System.Diagnostics.Tracing.EventPipeEventProvider.Callback(Byte*, Int32, Byte, Int64, Int64, EVENT_FILTER_DESCRIPTOR*, Void*) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventPipeEventProvider.cs @ 68]

A quick Google around leads me to
dotnet/runtime#105682
dotnet/runtime#104322

So it looks like when we connect to the event pipe just as the app is initialized it can cause a hang.

What is weird is, is why the deployment of NR causes the start-up hang for us.

That console app we've been using has been in place since 2023-03 and we've never seen the hang without NR agent(s) being deployed.

Cheers,
Indy

@nrcventura
Copy link
Member

A lot of the code in the DiagnosticSource library tries to do as little work as possible if there are no subscribers to a particular source (DiagnosticSource, EventPipe, ActivitySource, Meter, etc.). The New Relic agent enables an EventLIstener by default to subscribe to the EventPipe datasource in order to capture Garbage Collection metrics (this was the primary way to capture that type of data for .net core < 3.0, which is when we started collecting that data). When .net standard 2.1 came out, and .net core 3.1 was released, more apis became available for collecting garbage collector data, but the agent still needed to support older applications. Similar .net 6 has a better way to collect this data, but those apis are not readily available unless the agent can stop targeting .net standard 2.0 and switch over to targeting .net 6 or .net 8 directly (which would be a breaking change requiring a major release). In the meantime, I've seen the eventpipe datasource (when subscribed to from within an application instead of out-of-process) have its stability increase and decrease with different .net versions. At this point, we do have plans to migrate away from the eventpipe for collecting garbage collection data, but it will result in a breaking change because the data we will be able to capture will be different. We also would like to align the data that we capture with OpenTelemetry which is what .net is increasing its alignment with in general.

Hopefully, the problem you are experiencing goes away when the event listener samplers are disabled.

@indy-singh
Copy link
Author

Sorry for the delaying in sharing the memory dump - we are a part of a much larger organisation and it turns out that our child accounts (4721360, 4721361) in NR doesn't have "case message support". Trying to get that sorted out. This may take a while 😬

Hopefully, the problem you are experiencing goes away when the event listener samplers are disabled.

Thanks, I'll try DisableSamplers next.

We also would like to align the data that we capture with OpenTelemetry which is what .net is increasing its alignment with in general.

Ah, we were wondering why dotnet counters weren't collected, so this is on the roadmap? As it may block the NR rollout. We need dotnet counters support.

Cheers,
Indy

@nrcventura
Copy link
Member

Ah, we were wondering why dotnet counters weren't collected, so this is on the roadmap? As it may block the NR rollout. We need dotnet counters support.

dotnet counters was a dotnet tool that was released around the same time that the EventCounter API was added to the DiagnosticSource library. It was a tool that allowed out-of-process collection of metric telemetry data. Previously the main way to get this telemetry information was via performance counters, which was a Windows-only API. The EventCounter API still exists, but a new Meter API, and that is the new OpenTelemetry compatible API that is recommended. Both the EventCounter and Meter APIs are compatible with dotnet counters and dotnet monitor (I believe). From the New Relic agent perspective, we wanted to focus less on enabling the capture of arbitrary sources of metric data (at least initially), and wanted to instead focus on a curated set of sources and metrics that we have found to be the most helpful and are consistently recommended in Microsoft and 3rd troubleshooting recommendations. The New Relic agent also collected these metrics before they became standardized. Now that they are more standardized, and some of them will be available out of the box with .NET 9, it is a good time for us to also standardize on that data (but these migrations can be difficult for our customers). There is a lot of overlap (of the most commonly referenced metrics) between the metrics that the New Relic agent automatically collects, and the data available in the dotnet counters tool.

In order to understand your question about dotnet counters support we need a little more information.

  • Are you generating custom metrics with the EventCounter or Meter APIs? If so, which APIs and metric types are you using?
  • Are there specific Metrics that you are hoping to have collected for every application? Will the list of metrics vary per application?
  • If the agent collected a standard set of metrics out of the box, but allowed the ability to configure the collection of additional Meters would that be helpful (this is basically what OpenTelemetry Automatic Instrumentation project supports)?

@indy-singh
Copy link
Author

indy-singh commented Oct 22, 2024

Previously the main way to get this telemetry information was via performance counters, which was a Windows-only API.

Yeah, that was a painful deprecation for us. We did try and complain about it over here.

Are you generating custom metrics with the EventCounter or Meter APIs? If so, which APIs and metric types are you using?

Currently we have around 30 custom Meters, we don't have any EventCounter(s) yet. For Meter's, we use CreateCounter, CreateHistogram, CreateObservableGauge (this is from a quick grep, so not exhaustive).

Are there specific Metrics that you are hoping to have collected for every application? Will the list of metrics vary per application?

We would expect the standard set emitted by the Core CLR at the very least (and these).

If the agent collected a standard set of metrics out of the box, but allowed the ability to configure the collection of additional Meters would that be helpful (this is basically what OpenTelemetry Automatic Instrumentation project supports)?

Very much so, that is exactly what we want/need.

The backstory here is that we tried NR a while back, it didn't go well (because of this issue). So we had to fall-back to otel/grafana, this has proved troublesome for unrelated reasons. Our parent company is huge, and they have Enterprise level contract with NR, so we decided to give NR another go; and that's when I realised it was the NR deployment causing the start-up hang.

We very much want to align with the parent company but missing the counters from the Core CLR would very much delay the rollout until that was supported/collected by the NR agent.

Hopefully that makes sense, I'm battling a cold 🤧

Cheers,
Indy

@nrcventura
Copy link
Member

Thank you for this feedback. This confirms my assumptions about the type of support that the New Relic agent should have.

@indy-singh
Copy link
Author

Quick (bad) update; I tried with NEW_RELIC_DISABLE_SAMPLERS=true and I still get the start-up hang. Took 130 tries. I didn't get a memory dump this time - just left it running and checking it periodically.

I know there were other suggestions in this thread; I'll go through them in later this week/next week.

Cheers,
Indy

@indy-singh
Copy link
Author

indy-singh commented Oct 23, 2024

Tried with <application disableSamplers="true"/> in "C:\ProgramData\New Relic\.NET Agent\newrelic.config" and I still got the start-up hang. This time after 109 tries.

The quest continues :)

Edit* I did check but we still don't have access to case message support; I'll update as soon as I do.

Cheers,
Indy

@indy-singh
Copy link
Author

I've opened a case under account 3110122 with case reference #00231542. There isn't much on the ticket except a link to this thread and zipped memory dumps.

Cheers,
Indy

@tippmar-nr
Copy link
Member

Thanks @indy-singh - We've received the memory dumps and we're analyzing them now. Initially I'm not seeing any smoking gun, but we'll keep digging and let you know what we find.

@tippmar-nr
Copy link
Member

tippmar-nr commented Oct 29, 2024

@indy-singh We've looked at the memory dump and it doesn't look like the managed .NET agent (NewRelic.Agent.Core.dll) is loaded at the time of the deadlock but the native Profiler (NewRelic.Profiler.dll) is loaded. So we don't think the deadlock is related our usage of EventListeners, but we're not 100% certain. There is also a possibility that there's something going on in our native profiler that is causing the deadlock.

Our next release (v10.33.0) will include an optional replacement for the current EventListener-based garbage collection sampler. If you're able, please re-test with that version (should be released later this week) and let us know if you're still seeing the deadlock. If it does happen, that would be a more definitive pointer to something in our profiler.

One additional test you could try now would be to remove all of the instrumentation .xml files from the extensions folder where the .NET agent is installed (save them so you can put them back later). That should stop the profiler from doing anything, which might further confirm where the issue is coming from.

@nrcventura
Copy link
Member

To be more specific about the profiler settings I'm wondering about, we have code at

DWORD _eventMask = OverrideEventMask(
COR_PRF_MONITOR_JIT_COMPILATION | COR_PRF_MONITOR_MODULE_LOADS | COR_PRF_USE_PROFILE_IMAGES | COR_PRF_MONITOR_THREADS | COR_PRF_ENABLE_STACK_SNAPSHOT | COR_PRF_ENABLE_REJIT | (DWORD)COR_PRF_DISABLE_ALL_NGEN_IMAGES);
and
ThrowOnError(_corProfilerInfo5->SetEventMask2, _eventMask, COR_PRF_HIGH_DISABLE_TIERED_COMPILATION);
that are used to configure the .net runtime to behave differently when the agent runs, allows us to be able to instrument methods automatically, or to dynamically add or remove instrumented methods without requiring the application to restart. It may be possible that this code has nudged things in such a way to make this race condition triggering the deadlock more likely.

In that memory dump, it looked like something else other than the New Relic agent was trying to interact with the event pipe, and causing the deadlock, but it wasn't clear to us which code was trying to interact with the event pipe. As @tippmar-nr said, it did not look like it was the New Relic agent trying to interact with the event pipe, because the managed code part of the agent was not yet listed in the loaded modules list. So the only part of the agent that was running was the native code (which does not interact with the event pipe). We may need someone from Microsoft to assist with looking at that memory dump to see if they can figure out why there is a deadlock with that event pipe usage. Since the memory dump, is from your application, it would be better for you to follow up with Microsoft on analyzing that memory, if you choose to do so. We're not trying to pass this off to them, we just can't open a case/issue with them on your behalf. We have seen some very weird behavior that was made reproducible by the runtime settings set by the profiler, or the increase in memory allocation leading to data being cleaned up sooner, that Microsoft was able to diagnose by looking at the memory dumps.

@indy-singh
Copy link
Author

Hi both,

Thanks for taking the time to look at the dumps.

Our next release (v10.33.0) will include an #2838 for the current EventListener-based garbage collection sampler. If you're able, please re-test with that version (should be released later this week) and let us know if you're still seeing the deadlock. If it does happen, that would be a more definitive pointer to something in our profiler.

One additional test you could try now would be to remove all of the instrumentation .xml files from the extensions folder where the .NET agent is installed (save them so you can put them back later). That should stop the profiler from doing anything, which might further confirm where the issue is coming from.

Will do both of these 👍🏽

In that memory dump, it looked like something else other than the New Relic agent was trying to interact with the event pipe, and causing the deadlock, but it wasn't clear to us which code was trying to interact with the event pipe.

Yes that other code will have been the CWDotNetDiagClient that we have on our DEMO/LIVE servers to collect the aforementioned dotnet counters. I'm of the opinion the deadlock is triggered when things line up just right between NR and CWDotNetDiagClient. It's entirely possible that CWDotNetDiagClient is doing something naughty and not being a nice neighbour.

I've been chipping away at the reproduction harness to make the deadlock more prevalent but it is proving a tricky beast to tame.

I'll chase up about Microsoft support; someone in the org chart will have a support contract I imagine.

Updates from me might be a bit more sporadic - currently recovering from a short hospital stay.

Cheers,
Indy

@indy-singh
Copy link
Author

Sorry for the huge delay. Just to let you know I'm back on this.

Updated to

NewRelicDotNetAgent_10.33.1_x64.msi
newrelic-infra-amd64.1.57.2.msi

Config now looks like this:-

<?xml version="1.0"?>
<!-- Copyright (c) 2008-2020 New Relic, Inc.  All rights reserved. -->
<!-- For more information see: https://docs.newrelic.com/docs/agents/net-agent/configuration/net-agent-configuration/ -->
<configuration xmlns="urn:newrelic-config" agentEnabled="true">
	<appSettings>
		<add key="NewRelic.EventListenerSamplersEnabled" value="false"/>
	</appSettings>
	<service licenseKey="REDACTED"/>
	<application/>
	<log level="off" enabled="false"/>
	<allowAllHeaders enabled="true"/>
	<attributes enabled="true">
		<exclude>request.headers.cookie</exclude>
		<exclude>request.headers.authorization</exclude>
		<exclude>request.headers.proxy-authorization</exclude>
		<exclude>request.headers.x-*</exclude>
		<include>request.headers.*</include>
	</attributes>
	<transactionTracer enabled="false"/>
	<distributedTracing enabled="false"/>
	<errorCollector enabled="false"/>
	<browserMonitoring autoInstrument="false"/>
	<threadProfiling>
		<ignoreMethod>System.Threading.WaitHandle:InternalWaitOne</ignoreMethod>
		<ignoreMethod>System.Threading.WaitHandle:WaitAny</ignoreMethod>
	</threadProfiling>
	<applicationLogging enabled="false"/>
	<utilization detectAws="false" detectAzure="false" detectGcp="false" detectPcf="false" detectDocker="false" detectKubernetes="false"/>
	<slowSql enabled="false"/>
	<distributedTracing enabled="false"/>
	<codeLevelMetrics enabled="false"/>
</configuration>

Additional envs vars:-

$methods = @(
    "Npgsql.NpgsqlConnection.Open"
    "Npgsql.NpgsqlCommand.ExecuteReader"
    "Npgsql.NpgsqlDataReader.Read"
    "Npgsql.NpgsqlConnection.OpenAsync"
    "Npgsql.NpgsqlCommand.ExecuteReaderAsync"
    "Npgsql.NpgsqlDataReader.ReadAsync"    
);

$concat = $methods -join ';';

[System.Environment]::SetEnvironmentVariable("NEW_RELIC_DISABLE_SAMPLERS", "true", [System.EnvironmentVariableTarget]::Machine);

Regarding Microsoft support. I've still not heard back from the chain above me. I'll chase.

Cheers,
Indy

@indy-singh
Copy link
Author

indy-singh commented Nov 20, 2024

I can confirm the crash still happens. This time is took 56 tries.

~ ~[2024-11-20T11:38:00.007Z, PID: 6544] [aspnetcorev2_inprocess.dll] InvalidOperationException 'Managed server didn't initialize after 120000 ms.' caught at ~D:\a\_work\1\s\src\Servers\IIS\AspNetCoreModuleV2\InProcessRequestHandler\inprocessapplication.cpp:489 ~ ~

Cheers,
Indy

Edit* Invalid results, ignore, see reply below.

@indy-singh
Copy link
Author

indy-singh commented Nov 20, 2024

Oh wait. I'm an idiot. I didn't read the PR. I didn't turn on GCSamplerV2Enabled. I'll do that now and retest. 🤦🏽

Cheers,
Indy

@indy-singh
Copy link
Author

Bad news, still happens. This time it took 243 iterations.

[2024-11-20T13:11:43.496Z, PID: 26084] [aspnetcorev2_inprocess.dll] InvalidOperationException 'Managed server didn't initialize after 120000 ms.' caught at D:\a\_work\1\s\src\Servers\IIS\AspNetCoreModuleV2\InProcessRequestHandler\inprocessapplication.cpp:489 

This is with:-

$methods = @(
    "Npgsql.NpgsqlConnection.Open"
    "Npgsql.NpgsqlCommand.ExecuteReader"
    "Npgsql.NpgsqlDataReader.Read"
    "Npgsql.NpgsqlConnection.OpenAsync"
    "Npgsql.NpgsqlCommand.ExecuteReaderAsync"
    "Npgsql.NpgsqlDataReader.ReadAsync"    
);

$concat = $methods -join ';';

[System.Environment]::SetEnvironmentVariable("NEW_RELIC_DISABLE_SAMPLERS", "true", [System.EnvironmentVariableTarget]::Machine);
[System.Environment]::SetEnvironmentVariable("NEW_RELIC_GC_SAMPLER_V2_ENABLED", "1", [System.EnvironmentVariableTarget]::Machine);

and:-

<?xml version="1.0"?>
<!-- Copyright (c) 2008-2020 New Relic, Inc.  All rights reserved. -->
<!-- For more information see: https://docs.newrelic.com/docs/agents/net-agent/configuration/net-agent-configuration/ -->
<configuration xmlns="urn:newrelic-config" agentEnabled="true">
	<appSettings>
		<add key="NewRelic.EventListenerSamplersEnabled" value="false"/>
		<add key="GCSamplerV2Enabled" value="true"/>
	</appSettings>
	<service licenseKey="REDACTED"/>
	<application/>
	<log level="off" enabled="false"/>
	<allowAllHeaders enabled="true"/>
	<attributes enabled="true">
		<exclude>request.headers.cookie</exclude>
		<exclude>request.headers.authorization</exclude>
		<exclude>request.headers.proxy-authorization</exclude>
		<exclude>request.headers.x-*</exclude>
		<include>request.headers.*</include>
	</attributes>
	<transactionTracer enabled="false"/>
	<distributedTracing enabled="false"/>
	<errorCollector enabled="false"/>
	<browserMonitoring autoInstrument="false"/>
	<threadProfiling>
		<ignoreMethod>System.Threading.WaitHandle:InternalWaitOne</ignoreMethod>
		<ignoreMethod>System.Threading.WaitHandle:WaitAny</ignoreMethod>
	</threadProfiling>
	<applicationLogging enabled="false"/>
	<utilization detectAws="false" detectAzure="false" detectGcp="false" detectPcf="false" detectDocker="false" detectKubernetes="false"/>
	<slowSql enabled="false"/>
	<distributedTracing enabled="false"/>
	<codeLevelMetrics enabled="false"/>
</configuration>

Cheers,
Indy

@indy-singh
Copy link
Author

One additional test you could try now would be to remove all of the instrumentation .xml files from the extensions folder where the .NET agent is installed (save them so you can put them back later). That should stop the profiler from doing anything, which might further confirm where the issue is coming from.

I bombed out C:\ProgramData\New Relic\.NET Agent\Extensions\netcore.

Just tried this and it failed in 23 tries (I'm not sure the count is related at all, just mentioning it for the record).

[2024-11-20T15:52:06.102Z, PID: 28068] [aspnetcorev2_inprocess.dll] InvalidOperationException 'Managed server didn't initialize after 120000 ms.' caught at D:\a\_work\1\s\src\Servers\IIS\AspNetCoreModuleV2\InProcessRequestHandler\inprocessapplication.cpp:489 

Cheers,
Indy

@chynesNR
Copy link
Member

Hi @indy-singh -- thanks so much for your detailed analysis and thorough testing. If you're still seeing the hang without the Extensions folder, the .NET Agent is effectively doing nothing at that point, so that leaves us with only a few possibilities. We have seen issues before caused by our Profiler setting flags that are necessary for our Agent to run, such as disabling tiered compilation. Unfortunately, if this is the case, we won't be able to solve it on our own,, and we would have to pull Microsoft in to investigate.

If it's not too much trouble, could you capture a memory dump for the case where you've removed the Extensions folder and it still hangs? That would just be to confirm that the .NET Agent really isn't doing any work at all.

@chynesNR
Copy link
Member

@indy-singh -- On the very slim chance that it is the exact same issue we saw before, you can test that by setting the environment variable COMPlus_TieredCompilation to 0 and running without our agent installed. If it still hangs, then we may have our answer.

@indy-singh
Copy link
Author

Apologies for the silence, I've been chasing down the org chart to get a case with Microsoft opened; that is now done under case id 2411220050001045 and the dumps have been shared with them. I don't know if you guys have a pre-existing relationship with Microsoft but happy to give you permissions to collaborate on the ticket if need be.

If it's not too much trouble, could you capture a memory dump for the case where you've removed the Extensions folder and it still hangs?

Yes, but this may take a while. I normally leave the test harness running in a loop, and I've only got the previous dumps by being extremely luckily.

On the very slim chance that it is the exact same issue we saw before, you can test that by setting the environment variable COMPlus_TieredCompilation to 0 and running without our agent installed. If it still hangs, then we may have our answer.

Sure, I can give that try.

Cheers,
Indy

@chynesNR
Copy link
Member

Hi @indy-singh -- no apology necessary. Please feel free to loop us in on any conversations with Microsoft, or let us know if we can help with their investigation. I don't think we can access your ticket directly, though.

@indy-singh
Copy link
Author

Small update; in the middle of radio silence from Microsoft. No update of note to share. We will chase on our side.

I have not forgotten about the outstanding tasks.

Cheers,
Indy

@indy-singh
Copy link
Author

Update! Been involved a little back and forth with Microsoft and we've arrived here:-

Hello Indy,

Thanks for your patience here.

I reviewed the dumps again and I do see a potential deadlock situation during startup.

I am working on this issue with our .NET escalations and will get back to you soon.

Regards,
REDACTED

That was at Mon, 16 Dec 2024 21:37:35 +0000.

Re: the outstanding tasks; I'm actually off for the rest of December but it seems Microsoft are finally in the same position we are in this thread. So I'm tempted to pause any work from my POV until after New Years.

Cheers,
Indy

@indy-singh
Copy link
Author

Another update from Microsoft:-

Hello Indy,

Just a quick update here, we are working with our Product Group to establish a pattern of behavior on this.

Will keep you posted.

Regards,
REDACTED

That was at Mon, 23 Dec 2024 16:14:54 +0000.

Cheers (have a great Christmas!),
Indy

@nrcventura
Copy link
Member

Thank you for all of the updates.

@indy-singh
Copy link
Author

Update from Microsoft at Mon, 13 Jan 2025 21:33:50 +0000:-

Hello Indy,

My apologies for the delay in getting back to you on this request.

Our Product Group is looking into this actively. It is taking some time as this is an edge case specifically with New Relic

I will share an update as soon as I can.

Regards,
REDACTED

If it's not too much trouble, could you capture a memory dump for the case where you've removed the Extensions folder and it still hangs?

Still on my todo list.

On the very slim chance that it is the exact same issue we saw before, you can test that by setting the environment variable COMPlus_TieredCompilation to 0 and running without our agent installed. If it still hangs, then we may have our answer.

Not sure I fully understand this one. We already deploy our applications to all environments with:-

{
  "configProperties": {
    "System.GC.Concurrent": true,
    "System.GC.Server": true,
    "System.GC.DynamicAdaptationMode": 1,
    "System.Threading.ThreadPool.UseWindowsThreadPool": true,
    "System.Runtime.TieredCompilation": false
  }
}

And our LIVE enviroment currently does not have any New Relic stuff installed. Given that we deploy around 100 times a day and now hang occurs, I would say this is already ticked off? Unless I misinterpreted this ask?

Cheers,
Indy

@tippmar-nr
Copy link
Member

Thanks for the update, @indy-singh. Good to hear that Microsoft is still digging into the issue - hopefully they can find something soon.

As relates to the COMPlus_TierCompliation=0 environment variable, I think if you're already deploying with that setting (or its equivalent), you've probably covered that scenario already.

Please keep us posted if there's more progress!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working community To tag external issues and PRs
Projects
None yet
Development

No branches or pull requests

5 participants