Observing delays in otel collector extension #1497

bhaskarbanerjee · 2024-08-23T19:19:32Z

Describe the bug
We are using the otel collector in decouple mode and observing that the extension is taking > 2s to start. On further investigation, we are noticing a delay of ~1.5-2s between the log statements
Everything is ready. Begin running and processing data. and EXTENSION Name: collector State: Ready Events: [INVOKE, SHUTDOWN]

Is so much delay expected?

The text was updated successfully, but these errors were encountered:

serkan-ozal · 2024-08-26T18:20:36Z

Hi @bhaskarbanerjee, is it possible to set OTEL_LOG_LEVEL env var to true and share the debug logs with us.

bhaskarbanerjee · 2024-08-26T18:49:06Z

Thanks for quick reponse @serkan-ozal. Here it is
otel-lambda-collector.log

bhaskarbanerjee · 2024-08-26T18:52:27Z

2024-08-26T13:39:55.810-05:00 | {"level":"warn","ts":1724697595.8106759,"caller":"localhostgate/featuregate.go:63","msg":"The default endpoints for all servers in components will change to use localhost instead of 0.0.0.0 in a future version. Use the feature gate to preview the new default."",""feature gate ID"":""component.UseLocalHostAsDefaultHost""}

THIS IS WHERE WE SEE A DELAY OF ~2-4 seconds.

2024-08-26T13:39:57.228-05:00 | EXTENSION Name: collector State: Ready Events: [INVOKE, SHUTDOWN]
2024-08-26T13:39:57.229-05:00 | START RequestId: 6a59d6bb-e5ce-46d7-9b70-37069cf7f12a Version: $LATEST

bhaskarbanerjee · 2024-08-26T18:53:56Z

Here is my config.

  otlp:
    protocols:
      http:
        endpoint: "localhost:4318"
 
 
exporters:
  logging:
    loglevel: info
  otlphttp:
    endpoint: ${OTEL_GATEWAY_ENDPOINT}
    retry_on_failure:
      initial_interval: ${INITIAL_INTERVAL}
      max_interval: ${MAX_INTERVAL}
      max_elapsed_time: ${MAX_ELAPSED_TIME}
    timeout: ${TRANSPORT_TIMEOUT}
    sending_queue:
      queue_size: 5000
 
processors:
  decouple:
    max_queue_size: 5000
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [logging, otlphttp]
      processors: [decouple]
    metrics:
      receivers: [otlp]
      exporters: [logging, otlphttp]
      processors: [decouple]

serkan-ozal · 2024-08-26T19:14:12Z

@bhaskarbanerjee
While I am not fully sure at the moment, I think there is Lambda runtime and function init (with OTEL SDK init) between Everything is ready. Begin running and processing data. and EXTENSION Name: collector State: Ready Events: [INVOKE, SHUTDOWN] logs.

In my simple example, I have seen runtime and function init logs between those collector extension logs (~1.5s).

To check this, you may also put print debug log statement to the first and last lines (and also some other important points) of handler.py file to measure the load start/end timings of your handler module/file (not the handler function, but the handler file)

bhaskarbanerjee · 2024-08-28T14:56:03Z

Thanks @serkan-ozal . I have other lambdas too which are not using opentelemetry and in those cases, there is no such wait time. here is an example of using New Relic layer and another layer from Hashicorp in the same lambda.

2024-08-28T09:42:32.881-05:00EXTENSION	Name: newrelic-lambda-extension	State: Ready	Events: [INVOKE, SHUTDOWN] | 
EXTENSION Name: newrelic-lambda-extension State: Ready Events: [INVOKE, SHUTDOWN]

2024-08-28T09:42:32.881-05:00EXTENSION	Name: vault-lambda-extension	State: Ready	Events: [INVOKE, SHUTDOWN] | 
EXTENSION Name: vault-lambda-extension State: Ready Events: [INVOKE, SHUTDOWN]

2024-08-28T09:42:32.886-05:00START RequestId: 81648590-a693-4abb-b0e4-26b0a067a0f8 Version: $LATEST

In this case the Lambda lifecycle API shows it took 5ms for it to start since the extensions came to ready state, which is quite acceptable.

serkan-ozal · 2024-08-28T21:00:52Z

@bhaskarbanerjee

Here are the logs from my sample Nodejs Lambda:

With OTEL SDK + User Handler Init

2024-08-28T20:55:31.775Z {"level":"info","ts":1724878531.7748003,"caller":"[email protected]/otlp.go:102","msg":"Starting GRPC server","kind":"receiver","name":"otlp","data_type":"traces","endpoint":"localhost:4317"}
2024-08-28T20:55:31.775Z {"level":"info","ts":1724878531.7753932,"caller":"[email protected]/otlp.go:152","msg":"Starting HTTP server","kind":"receiver","name":"otlp","data_type":"traces","endpoint":"localhost:4318"}
2024-08-28T20:55:31.775Z {"level":"info","ts":1724878531.775681,"caller":"[email protected]/service.go:221","msg":"Everything is ready. Begin running and processing data."}
2024-08-28T20:55:31.775Z {"level":"info","ts":1724878531.7757266,"caller":"localhostgate/featuregate.go:63","msg":"The default endpoints for all servers in components have changed to use localhost instead of 0.0.0.0. Disable the feature gate to temporarily revert to the previous default.","feature gate ID":"component.UseLocalHostAsDefaultHost"}
2024-08-28T20:55:32.249Z

Registering OpenTelemetry

...

2024-08-28T20:55:33.288Z EXTENSION Name: collector State: Ready Events: [SHUTDOWN, INVOKE]
2024-08-28T20:55:33.290Z START RequestId: 5ef7f288-ba75-4637-b44b-3ae9cd4ccf13 Version: $LATEST
2024-08-28T20:55:33.296Z 2024-08-28T20:55:33.296Z 5ef7f288-ba75-4637-b44b-3ae9cd4ccf13 INFO Serving lambda request.

Time diff between collector logs: 1,513 ms

2024-08-28T20:55:31.775Z {"level":"info","ts":1724878531.775681,"caller":"[email protected]/service.go:221","msg":"Everything is ready. Begin running and processing data."}
2024-08-28T20:55:33.288Z EXTENSION Name: collector State: Ready Events: [SHUTDOWN, INVOKE]

With OTEL SDK + No User Handler Init

2024-08-28T20:33:34.018Z {"level":"info","ts":1724877214.018901,"caller":"[email protected]/otlp.go:102","msg":"Starting GRPC server","kind":"receiver","name":"otlp","data_type":"metrics","endpoint":"localhost:4317"}
2024-08-28T20:33:34.019Z {"level":"info","ts":1724877214.0191104,"caller":"[email protected]/otlp.go:152","msg":"Starting HTTP server","kind":"receiver","name":"otlp","data_type":"metrics","endpoint":"localhost:4318"}
2024-08-28T20:33:34.019Z {"level":"info","ts":1724877214.019237,"caller":"[email protected]/service.go:221","msg":"Everything is ready. Begin running and processing data."}
2024-08-28T20:33:34.019Z {"level":"info","ts":1724877214.0193567,"caller":"localhostgate/featuregate.go:63","msg":"The default endpoints for all servers in components have changed to use localhost instead of 0.0.0.0. Disable the feature gate to temporarily revert to the previous default.","feature gate ID":"component.UseLocalHostAsDefaultHost"}

2024-08-28T20:33:34.447Z Registering OpenTelemetry

...

2024-08-28T20:33:34.617Z 2024-08-28T20:33:34.617Z undefined DEBUG @opentelemetry/instrumentation-http Applying instrumentation patch for nodejs core module on require hook { module: 'http' }
2024-08-28T20:33:34.623Z 2024-08-28T20:33:34.623Z undefined DEBUG @opentelemetry/instrumentation-aws-lambda Applying instrumentation patch for nodejs module file on require hook { module: '/var/task/index.js', version: '0.0.0', fileName: 'index', baseDir: '/var/task' }
2024-08-28T20:33:34.623Z 2024-08-28T20:33:34.623Z undefined DEBUG patch handler function
2024-08-28T20:33:34.637Z EXTENSION Name: collector State: Ready Events: [SHUTDOWN, INVOKE]
2024-08-28T20:33:34.638Z START RequestId: 6fe1e4b8-c70c-4c31-9ff9-084fe62f8897 Version: $LATEST
2024-08-28T20:33:34.641Z 2024-08-28T20:33:34.641Z 6fe1e4b8-c70c-4c31-9ff9-084fe62f8897 INFO Serving lambda request.

Time diff between collector logs: 618 ms

2024-08-28T20:33:34.019Z {"level":"info","ts":1724877214.019237,"caller":"[email protected]/service.go:221","msg":"Everything is ready. Begin running and processing data."}
2024-08-28T20:33:34.637Z EXTENSION Name: collector State: Ready Events: [SHUTDOWN, INVOKE]

No OTEL SDK + No User Handler Init

2024-08-28T20:45:43.734Z {"level":"info","ts":1724877943.7347512,"caller":"[email protected]/otlp.go:102","msg":"Starting GRPC server","kind":"receiver","name":"otlp","data_type":"traces","endpoint":"localhost:4317"}
2024-08-28T20:45:43.734Z {"level":"info","ts":1724877943.7348504,"caller":"[email protected]/otlp.go:152","msg":"Starting HTTP server","kind":"receiver","name":"otlp","data_type":"traces","endpoint":"localhost:4318"}
2024-08-28T20:45:43.734Z {"level":"info","ts":1724877943.7349138,"caller":"[email protected]/service.go:221","msg":"Everything is ready. Begin running and processing data."}
2024-08-28T20:45:43.735Z {"level":"info","ts":1724877943.7349257,"caller":"localhostgate/featuregate.go:63","msg":"The default endpoints for all servers in components have changed to use localhost instead of 0.0.0.0. Disable the feature gate to temporarily revert to the previous default.","feature gate ID":"component.UseLocalHostAsDefaultHost"}
2024-08-28T20:45:43.863Z EXTENSION Name: collector State: Ready Events: [INVOKE, SHUTDOWN]
2024-08-28T20:45:43.864Z START RequestId: 8e450b50-32d2-42aa-8f38-34ffebf46032 Version: $LATEST
2024-08-28T20:45:43.866Z 2024-08-28T20:45:43.866Z 8e450b50-32d2-42aa-8f38-34ffebf46032 INFO Serving lambda request.

Time diff between collector logs: 129 ms

2024-08-28T20:45:43.734Z {"level":"info","ts":1724877943.7349138,"caller":"[email protected]/service.go:221","msg":"Everything is ready. Begin running and processing data."}
2024-08-28T20:45:43.863Z EXTENSION Name: collector State: Ready Events: [INVOKE, SHUTDOWN]

So, in summary, as you can see from the logs, time diff between collector logs are heavily depends on OTEL SDK init and user handler init delays. So, in your example, since handler itself is just simple hello world function without initializing any 3rd party dependency, I think, 2-4 second diff between collector logs is caused by OTEL Python SDK initialization.

I will also rerun this benchmark with sample Python Lambda and share the results. I am not a Python expert but maybe this type of auto instrumentation is heavyweight for Python runtime:

opentelemetry-lambda/python/src/otel/otel_sdk/otel-instrument

Line 162 in 1634f8d

exec python3 $LAMBDA_LAYER_PKGS_DIR/bin/opentelemetry-instrument "$@"

bhaskarbanerjee · 2024-08-28T21:07:18Z

@serkan-ozal Much Thanks, I really appreciate your help. qq - what config of the collector are you running with? Is it the default one

I am gathering evidences on my end. Will share soon.

serkan-ozal · 2024-08-28T21:09:21Z

@bhaskarbanerjee default one

serkan-ozal · 2024-08-28T21:11:51Z

@bhaskarbanerjee I am not sure but such type of auto instrumentation (checks the dependencies and forks another process from main one to run the real app) seems not lightweight to me: https://github.com/open-telemetry/opentelemetry-python-contrib/blob/main/opentelemetry-instrumentation/src/opentelemetry/instrumentation/auto_instrumentation/__init__.py

Maybe this is a question to be asked to OTEL Python folks

serkan-ozal · 2024-08-28T21:39:13Z

@bhaskarbanerjee I think my theory about EXTENSION Name: collector State: Ready Events: [INVOKE, SHUTDOWN] log (I have mentioned in the SIG meeting) is correct. As far as I see, that log is not printed whenever extension is ready, but when both the runtime and function are initialized. To verify this I think, we need to add some more debug logs to the collector especially right after it calls the /next

bhaskarbanerjee · 2024-08-28T21:41:50Z

Agreed to both your points. I have already added logs to find out more about several other inits and calls. I will share those tests after I run the scenarios we discussed this morning.
But so far with your and my results, the culprit seems be the sdk, especially AUTO instrumentation.

bhaskarbanerjee · 2024-08-29T18:24:38Z

@serkan-ozal Here are my test results with lambda auto instrumented in NodeJs v18
Conclusion same as yours.

SDK Added SDK Enabled DecoupledProcessor OTLPHTTP Exporter Wait Time (ms) Total Cold Start (ms)

✖ ✖ ✅ ✅ 145 918

✅ ✖ ✅ ✅ 174 1030

✅ ✅ ✅ ✅ 1684 1740

✅ ✖ ✅ ✖ 184 1440

✖ ✖ ✅ ✖ 147 803

✖ ✖ ✖ ✖ 176 844

✖ ✖ ✖ ✅ 149 746

Inference:
Addition of the SDK introduces high cold start time
Enabling the SDK in AUTO INSTRUMENTATION mode further increases the cold start.

serkan-ozal · 2024-08-29T20:16:50Z

Thanks for the detailed benchmark @bhaskarbanerjee.

I have also reproduced similar cold start delays with Python example too and verified that Python auto instrumentation is the reason. Will look into further tomorrow by profiling Python auto instrumentation script.

But, initially as far as I see from the Python auto instrumentation, it

starts user handler in child process
and scans all the Python path in the disk to discover auto instrumentation libraries (there are many)
and checks version conflicts between OTEL dependencies
then initializes instrumenters along with their pre and post hooks

dyladan · 2024-08-29T20:21:49Z

autoinstrumentation is definitely a cold start penalty in JS. the more instrumentations are enabled, the higher the penalty. This is because we wrap require and intercept all module loads to determine if it is a module that needs to be monkeypatched. It used to be much worse when each instrumentation maintained its own wrapper. Now there is a single global wrapper that instrumentations register with. There are maybe additional optimizations that can be done, but the mechanism will always have some module load penalty.

edit: tl;dr because node autoinstrumentation uses require-in-the-middle, each module load is slightly slower than it would be without autoinstrumentation enabled

bhaskarbanerjee · 2024-08-30T02:45:36Z

Thanks @dyladan. Should choosing the instrumentations lower the cold starts.
Ref:
OTEL_NODE_RESOURCE_DETECTORS from https://opentelemetry.io/docs/zero-code/js/
OTEL_PYTHON_DISABLED_INSTRUMENTATIONS from https://opentelemetry.io/docs/zero-code/python/configuration/#disabling-specific-instrumentations

Or are there any other ways apart from manual instrumentation?

Do we have any tests conducted/documented by the community to determine the performance difference between AUTO instrumentation and MANUAL instrumentation?

serkan-ozal · 2024-08-30T19:44:47Z

@bhaskarbanerjee

For the Python runtime, I have analyzed OTEL Python auto instrumentation script and verified that the following parts have significant effect on coldstart (for a 512 MB Lambda function):

auto discovery process: 750-1000 ms
botocore instrumentation: 350-400 ms

I am not sure whether there is anything we can do for botocore instrumentation, because it is caused by the import of some botocore modules by the wrapt which is the auto instrumentation library for Python.

But instead of auto discovery of the OTEL packages, I have switched to static list of instrumentation packages like we already do in Node.js layer and it helped me to cut 750-1000 ms from coldstart delay.

If you are also interested in to test my optimized layer (which is experimental and please don't use it on prod :) ), I can share the ARN of it, so you also can benchmark it too.

serkan-ozal · 2024-08-31T09:28:41Z

According to my coldstart analysis, I have created these issues in OTEL Python repository to discuss potential solutions:

bhaskarbanerjee · 2024-09-03T13:44:10Z

HI @serkan-ozal Sorry, I was out for a few days.

Can you please share which instrumentations have you included? I have tried setting OTEL_PYTHON_DISABLED_INSTRUMENTATIONS to none but it didn't yield much savings. So I would like to try out what you have tried.

But instead of auto discovery of the OTEL packages, I have switched to static list of instrumentation packages like we already do in Node.js layer and it helped me to cut 750-1000 ms from coldstart delay.

bhaskarbanerjee added the bug Something isn't working label Aug 23, 2024

bhaskarbanerjee mentioned this issue Aug 26, 2024

Use of layer + timeout causes high initialization times of next Lambda call #986

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Observing delays in otel collector extension #1497

Observing delays in otel collector extension #1497

bhaskarbanerjee commented Aug 23, 2024

serkan-ozal commented Aug 26, 2024

bhaskarbanerjee commented Aug 26, 2024

bhaskarbanerjee commented Aug 26, 2024

bhaskarbanerjee commented Aug 26, 2024

serkan-ozal commented Aug 26, 2024 •

edited

Loading

bhaskarbanerjee commented Aug 28, 2024

serkan-ozal commented Aug 28, 2024 •

edited

Loading

bhaskarbanerjee commented Aug 28, 2024

serkan-ozal commented Aug 28, 2024

serkan-ozal commented Aug 28, 2024

serkan-ozal commented Aug 28, 2024

bhaskarbanerjee commented Aug 28, 2024

bhaskarbanerjee commented Aug 29, 2024

serkan-ozal commented Aug 29, 2024 •

edited

Loading

dyladan commented Aug 29, 2024 •

edited

Loading

bhaskarbanerjee commented Aug 30, 2024

serkan-ozal commented Aug 30, 2024

serkan-ozal commented Aug 31, 2024 •

edited

Loading

bhaskarbanerjee commented Sep 3, 2024

Observing delays in otel collector extension #1497

Observing delays in otel collector extension #1497

Comments

bhaskarbanerjee commented Aug 23, 2024

serkan-ozal commented Aug 26, 2024

bhaskarbanerjee commented Aug 26, 2024

bhaskarbanerjee commented Aug 26, 2024

bhaskarbanerjee commented Aug 26, 2024

serkan-ozal commented Aug 26, 2024 • edited Loading

bhaskarbanerjee commented Aug 28, 2024

serkan-ozal commented Aug 28, 2024 • edited Loading

With OTEL SDK + User Handler Init

With OTEL SDK + No User Handler Init

No OTEL SDK + No User Handler Init

bhaskarbanerjee commented Aug 28, 2024

serkan-ozal commented Aug 28, 2024

serkan-ozal commented Aug 28, 2024

serkan-ozal commented Aug 28, 2024

bhaskarbanerjee commented Aug 28, 2024

bhaskarbanerjee commented Aug 29, 2024

serkan-ozal commented Aug 29, 2024 • edited Loading

dyladan commented Aug 29, 2024 • edited Loading

bhaskarbanerjee commented Aug 30, 2024

serkan-ozal commented Aug 30, 2024

serkan-ozal commented Aug 31, 2024 • edited Loading

bhaskarbanerjee commented Sep 3, 2024

serkan-ozal commented Aug 26, 2024 •

edited

Loading

serkan-ozal commented Aug 28, 2024 •

edited

Loading

serkan-ozal commented Aug 29, 2024 •

edited

Loading

dyladan commented Aug 29, 2024 •

edited

Loading

serkan-ozal commented Aug 31, 2024 •

edited

Loading