Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Observing delays in otel collector extension #1497

Open
bhaskarbanerjee opened this issue Aug 23, 2024 · 19 comments
Open

Observing delays in otel collector extension #1497

bhaskarbanerjee opened this issue Aug 23, 2024 · 19 comments
Labels
bug Something isn't working

Comments

@bhaskarbanerjee
Copy link

Describe the bug
We are using the otel collector in decouple mode and observing that the extension is taking > 2s to start. On further investigation, we are noticing a delay of ~1.5-2s between the log statements
Everything is ready. Begin running and processing data. and EXTENSION Name: collector State: Ready Events: [INVOKE, SHUTDOWN]

Is so much delay expected?

@serkan-ozal
Copy link
Contributor

Hi @bhaskarbanerjee, is it possible to set OTEL_LOG_LEVEL env var to true and share the debug logs with us.

@bhaskarbanerjee
Copy link
Author

Thanks for quick reponse @serkan-ozal. Here it is
otel-lambda-collector.log

@bhaskarbanerjee
Copy link
Author

2024-08-26T13:39:55.810-05:00 | {"level":"warn","ts":1724697595.8106759,"caller":"localhostgate/featuregate.go:63","msg":"The default endpoints for all servers in components will change to use localhost instead of 0.0.0.0 in a future version. Use the feature gate to preview the new default."",""feature gate ID"":""component.UseLocalHostAsDefaultHost""}

THIS IS WHERE WE SEE A DELAY OF ~2-4 seconds.

2024-08-26T13:39:57.228-05:00 | EXTENSION Name: collector State: Ready Events: [INVOKE, SHUTDOWN]
2024-08-26T13:39:57.229-05:00 | START RequestId: 6a59d6bb-e5ce-46d7-9b70-37069cf7f12a Version: $LATEST

@bhaskarbanerjee
Copy link
Author

Here is my config.

  otlp:
    protocols:
      http:
        endpoint: "localhost:4318"
 
 
exporters:
  logging:
    loglevel: info
  otlphttp:
    endpoint: ${OTEL_GATEWAY_ENDPOINT}
    retry_on_failure:
      initial_interval: ${INITIAL_INTERVAL}
      max_interval: ${MAX_INTERVAL}
      max_elapsed_time: ${MAX_ELAPSED_TIME}
    timeout: ${TRANSPORT_TIMEOUT}
    sending_queue:
      queue_size: 5000
 
processors:
  decouple:
    max_queue_size: 5000
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [logging, otlphttp]
      processors: [decouple]
    metrics:
      receivers: [otlp]
      exporters: [logging, otlphttp]
      processors: [decouple]

@serkan-ozal
Copy link
Contributor

serkan-ozal commented Aug 26, 2024

@bhaskarbanerjee
While I am not fully sure at the moment, I think there is Lambda runtime and function init (with OTEL SDK init) between Everything is ready. Begin running and processing data. and EXTENSION Name: collector State: Ready Events: [INVOKE, SHUTDOWN] logs.

In my simple example, I have seen runtime and function init logs between those collector extension logs (~1.5s).

To check this, you may also put print debug log statement to the first and last lines (and also some other important points) of handler.py file to measure the load start/end timings of your handler module/file (not the handler function, but the handler file)

@bhaskarbanerjee
Copy link
Author

Thanks @serkan-ozal . I have other lambdas too which are not using opentelemetry and in those cases, there is no such wait time. here is an example of using New Relic layer and another layer from Hashicorp in the same lambda.

2024-08-28T09:42:32.881-05:00EXTENSION	Name: newrelic-lambda-extension	State: Ready	Events: [INVOKE, SHUTDOWN] | 
EXTENSION Name: newrelic-lambda-extension State: Ready Events: [INVOKE, SHUTDOWN]

2024-08-28T09:42:32.881-05:00EXTENSION	Name: vault-lambda-extension	State: Ready	Events: [INVOKE, SHUTDOWN] | 
EXTENSION Name: vault-lambda-extension State: Ready Events: [INVOKE, SHUTDOWN]

2024-08-28T09:42:32.886-05:00START RequestId: 81648590-a693-4abb-b0e4-26b0a067a0f8 Version: $LATEST

In this case the Lambda lifecycle API shows it took 5ms for it to start since the extensions came to ready state, which is quite acceptable.

@serkan-ozal
Copy link
Contributor

serkan-ozal commented Aug 28, 2024

@bhaskarbanerjee

Here are the logs from my sample Nodejs Lambda:

With OTEL SDK + User Handler Init

2024-08-28T20:55:31.775Z {"level":"info","ts":1724878531.7748003,"caller":"[email protected]/otlp.go:102","msg":"Starting GRPC server","kind":"receiver","name":"otlp","data_type":"traces","endpoint":"localhost:4317"}
2024-08-28T20:55:31.775Z {"level":"info","ts":1724878531.7753932,"caller":"[email protected]/otlp.go:152","msg":"Starting HTTP server","kind":"receiver","name":"otlp","data_type":"traces","endpoint":"localhost:4318"}
2024-08-28T20:55:31.775Z {"level":"info","ts":1724878531.775681,"caller":"[email protected]/service.go:221","msg":"Everything is ready. Begin running and processing data."}
2024-08-28T20:55:31.775Z {"level":"info","ts":1724878531.7757266,"caller":"localhostgate/featuregate.go:63","msg":"The default endpoints for all servers in components have changed to use localhost instead of 0.0.0.0. Disable the feature gate to temporarily revert to the previous default.","feature gate ID":"component.UseLocalHostAsDefaultHost"}
2024-08-28T20:55:32.249Z

Registering OpenTelemetry

...

2024-08-28T20:55:33.288Z EXTENSION Name: collector State: Ready Events: [SHUTDOWN, INVOKE]
2024-08-28T20:55:33.290Z START RequestId: 5ef7f288-ba75-4637-b44b-3ae9cd4ccf13 Version: $LATEST
2024-08-28T20:55:33.296Z 2024-08-28T20:55:33.296Z 5ef7f288-ba75-4637-b44b-3ae9cd4ccf13 INFO Serving lambda request.

Time diff between collector logs: 1,513 ms

2024-08-28T20:55:31.775Z {"level":"info","ts":1724878531.775681,"caller":"[email protected]/service.go:221","msg":"Everything is ready. Begin running and processing data."}
2024-08-28T20:55:33.288Z EXTENSION Name: collector State: Ready Events: [SHUTDOWN, INVOKE]

With OTEL SDK + No User Handler Init

2024-08-28T20:33:34.018Z {"level":"info","ts":1724877214.018901,"caller":"[email protected]/otlp.go:102","msg":"Starting GRPC server","kind":"receiver","name":"otlp","data_type":"metrics","endpoint":"localhost:4317"}
2024-08-28T20:33:34.019Z {"level":"info","ts":1724877214.0191104,"caller":"[email protected]/otlp.go:152","msg":"Starting HTTP server","kind":"receiver","name":"otlp","data_type":"metrics","endpoint":"localhost:4318"}
2024-08-28T20:33:34.019Z {"level":"info","ts":1724877214.019237,"caller":"[email protected]/service.go:221","msg":"Everything is ready. Begin running and processing data."}
2024-08-28T20:33:34.019Z {"level":"info","ts":1724877214.0193567,"caller":"localhostgate/featuregate.go:63","msg":"The default endpoints for all servers in components have changed to use localhost instead of 0.0.0.0. Disable the feature gate to temporarily revert to the previous default.","feature gate ID":"component.UseLocalHostAsDefaultHost"}

2024-08-28T20:33:34.447Z Registering OpenTelemetry

...

2024-08-28T20:33:34.617Z 2024-08-28T20:33:34.617Z undefined DEBUG @opentelemetry/instrumentation-http Applying instrumentation patch for nodejs core module on require hook { module: 'http' }
2024-08-28T20:33:34.623Z 2024-08-28T20:33:34.623Z undefined DEBUG @opentelemetry/instrumentation-aws-lambda Applying instrumentation patch for nodejs module file on require hook { module: '/var/task/index.js', version: '0.0.0', fileName: 'index', baseDir: '/var/task' }
2024-08-28T20:33:34.623Z 2024-08-28T20:33:34.623Z undefined DEBUG patch handler function
2024-08-28T20:33:34.637Z EXTENSION Name: collector State: Ready Events: [SHUTDOWN, INVOKE]
2024-08-28T20:33:34.638Z START RequestId: 6fe1e4b8-c70c-4c31-9ff9-084fe62f8897 Version: $LATEST
2024-08-28T20:33:34.641Z 2024-08-28T20:33:34.641Z 6fe1e4b8-c70c-4c31-9ff9-084fe62f8897 INFO Serving lambda request.

Time diff between collector logs: 618 ms

2024-08-28T20:33:34.019Z {"level":"info","ts":1724877214.019237,"caller":"[email protected]/service.go:221","msg":"Everything is ready. Begin running and processing data."}
2024-08-28T20:33:34.637Z EXTENSION Name: collector State: Ready Events: [SHUTDOWN, INVOKE]

No OTEL SDK + No User Handler Init

2024-08-28T20:45:43.734Z {"level":"info","ts":1724877943.7347512,"caller":"[email protected]/otlp.go:102","msg":"Starting GRPC server","kind":"receiver","name":"otlp","data_type":"traces","endpoint":"localhost:4317"}
2024-08-28T20:45:43.734Z {"level":"info","ts":1724877943.7348504,"caller":"[email protected]/otlp.go:152","msg":"Starting HTTP server","kind":"receiver","name":"otlp","data_type":"traces","endpoint":"localhost:4318"}
2024-08-28T20:45:43.734Z {"level":"info","ts":1724877943.7349138,"caller":"[email protected]/service.go:221","msg":"Everything is ready. Begin running and processing data."}
2024-08-28T20:45:43.735Z {"level":"info","ts":1724877943.7349257,"caller":"localhostgate/featuregate.go:63","msg":"The default endpoints for all servers in components have changed to use localhost instead of 0.0.0.0. Disable the feature gate to temporarily revert to the previous default.","feature gate ID":"component.UseLocalHostAsDefaultHost"}
2024-08-28T20:45:43.863Z EXTENSION Name: collector State: Ready Events: [INVOKE, SHUTDOWN]
2024-08-28T20:45:43.864Z START RequestId: 8e450b50-32d2-42aa-8f38-34ffebf46032 Version: $LATEST
2024-08-28T20:45:43.866Z 2024-08-28T20:45:43.866Z 8e450b50-32d2-42aa-8f38-34ffebf46032 INFO Serving lambda request.

Time diff between collector logs: 129 ms

2024-08-28T20:45:43.734Z {"level":"info","ts":1724877943.7349138,"caller":"[email protected]/service.go:221","msg":"Everything is ready. Begin running and processing data."}
2024-08-28T20:45:43.863Z EXTENSION Name: collector State: Ready Events: [INVOKE, SHUTDOWN]

So, in summary, as you can see from the logs, time diff between collector logs are heavily depends on OTEL SDK init and user handler init delays. So, in your example, since handler itself is just simple hello world function without initializing any 3rd party dependency, I think, 2-4 second diff between collector logs is caused by OTEL Python SDK initialization.

I will also rerun this benchmark with sample Python Lambda and share the results. I am not a Python expert but maybe this type of auto instrumentation is heavyweight for Python runtime:

exec python3 $LAMBDA_LAYER_PKGS_DIR/bin/opentelemetry-instrument "$@"

@bhaskarbanerjee
Copy link
Author

@serkan-ozal Much Thanks, I really appreciate your help. qq - what config of the collector are you running with? Is it the default one

I am gathering evidences on my end. Will share soon.

@serkan-ozal
Copy link
Contributor

@bhaskarbanerjee default one

@serkan-ozal
Copy link
Contributor

@bhaskarbanerjee I am not sure but such type of auto instrumentation (checks the dependencies and forks another process from main one to run the real app) seems not lightweight to me: https://github.com/open-telemetry/opentelemetry-python-contrib/blob/main/opentelemetry-instrumentation/src/opentelemetry/instrumentation/auto_instrumentation/__init__.py

Maybe this is a question to be asked to OTEL Python folks

@serkan-ozal
Copy link
Contributor

@bhaskarbanerjee I think my theory about EXTENSION Name: collector State: Ready Events: [INVOKE, SHUTDOWN] log (I have mentioned in the SIG meeting) is correct. As far as I see, that log is not printed whenever extension is ready, but when both the runtime and function are initialized. To verify this I think, we need to add some more debug logs to the collector especially right after it calls the /next

@bhaskarbanerjee
Copy link
Author

Agreed to both your points. I have already added logs to find out more about several other inits and calls. I will share those tests after I run the scenarios we discussed this morning.
But so far with your and my results, the culprit seems be the sdk, especially AUTO instrumentation.

@bhaskarbanerjee
Copy link
Author

@serkan-ozal Here are my test results with lambda auto instrumented in NodeJs v18
Conclusion same as yours.

SDK Added SDK Enabled DecoupledProcessor OTLPHTTP Exporter Wait Time (ms) Total Cold Start (ms)
145 918
174 1030
1684 1740
184 1440
147 803
176 844
149 746

Inference:

  • Addition of the SDK introduces high cold start time

  • Enabling the SDK in AUTO INSTRUMENTATION mode further increases the cold start.

@serkan-ozal
Copy link
Contributor

serkan-ozal commented Aug 29, 2024

Thanks for the detailed benchmark @bhaskarbanerjee.

I have also reproduced similar cold start delays with Python example too and verified that Python auto instrumentation is the reason. Will look into further tomorrow by profiling Python auto instrumentation script.

But, initially as far as I see from the Python auto instrumentation, it

  • starts user handler in child process
  • and scans all the Python path in the disk to discover auto instrumentation libraries (there are many)
  • and checks version conflicts between OTEL dependencies
  • then initializes instrumenters along with their pre and post hooks

@dyladan
Copy link
Member

dyladan commented Aug 29, 2024

autoinstrumentation is definitely a cold start penalty in JS. the more instrumentations are enabled, the higher the penalty. This is because we wrap require and intercept all module loads to determine if it is a module that needs to be monkeypatched. It used to be much worse when each instrumentation maintained its own wrapper. Now there is a single global wrapper that instrumentations register with. There are maybe additional optimizations that can be done, but the mechanism will always have some module load penalty.

edit: tl;dr because node autoinstrumentation uses require-in-the-middle, each module load is slightly slower than it would be without autoinstrumentation enabled

@bhaskarbanerjee
Copy link
Author

Thanks @dyladan. Should choosing the instrumentations lower the cold starts.
Ref:
OTEL_NODE_RESOURCE_DETECTORS from https://opentelemetry.io/docs/zero-code/js/
OTEL_PYTHON_DISABLED_INSTRUMENTATIONS from https://opentelemetry.io/docs/zero-code/python/configuration/#disabling-specific-instrumentations

Or are there any other ways apart from manual instrumentation?

Do we have any tests conducted/documented by the community to determine the performance difference between AUTO instrumentation and MANUAL instrumentation?

@serkan-ozal
Copy link
Contributor

@bhaskarbanerjee

For the Python runtime, I have analyzed OTEL Python auto instrumentation script and verified that the following parts have significant effect on coldstart (for a 512 MB Lambda function):

  • auto discovery process: 750-1000 ms
  • botocore instrumentation: 350-400 ms

I am not sure whether there is anything we can do for botocore instrumentation, because it is caused by the import of some botocore modules by the wrapt which is the auto instrumentation library for Python.

But instead of auto discovery of the OTEL packages, I have switched to static list of instrumentation packages like we already do in Node.js layer and it helped me to cut 750-1000 ms from coldstart delay.

If you are also interested in to test my optimized layer (which is experimental and please don't use it on prod :) ), I can share the ARN of it, so you also can benchmark it too.

@serkan-ozal
Copy link
Contributor

serkan-ozal commented Aug 31, 2024

@bhaskarbanerjee
Copy link
Author

HI @serkan-ozal Sorry, I was out for a few days.

Can you please share which instrumentations have you included? I have tried setting OTEL_PYTHON_DISABLED_INSTRUMENTATIONS to none but it didn't yield much savings. So I would like to try out what you have tried.

But instead of auto discovery of the OTEL packages, I have switched to static list of instrumentation packages like we already do in Node.js layer and it helped me to cut 750-1000 ms from coldstart delay.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants