You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When aws-otel-python layer is used, Lambda timeouts will cause high initialization times of the next Lambda call.
When our database is under high load, it can cause our Lambdas to timeout. In these high load situations we are also seeing very high Lambda initialization times of 19 seconds! Importantly, this initialization time is charged for and is removed from our remaining runtime - with a 30 second Lambda timeout, context.get_remaining_time_in_millis() is reporting only 11 seconds when handler is called.
I have tested that these high initialization times are only present when the aws-otel-python layer is deployed and configured. I don't have experience with other extensions so am not able to confirm if this is specifically an open tel layer issue or something inherent to extensions
Steps to reproduce
This can be reproduced with a time.sleep call in the Lambda handler, and invoking the Lambda with some load so that it is reused.
Normal initialization times of this example function are 1.2 seconds, with the outliers being 9 seconds.
And CW showing that function with 15 second timeout only had 6 seconds to run
CloudFormation and Python code is shared below which can be used to reproduce. Warnings:
Code deploys a public URL
The python script will generate 100 requests to the URL with a 256 MB Lambda with 15 second timeout.
You should expect to see printed results like this:
count inits: 11
count number of inits over 6 seconds: 7
min init duration: 1.2568621635437012
max init duration: 9.623291969299316
mean init duration: 6.339779875495217
mean duration of inits under 6 seconds: 1.2770129442214966
inits trace ids over 6 second ['1-6546c293-xxxxxxxxxx', '1-6546c273-xxxxxx', '1-6546c293-xxxxxxxxx', '1-6546c293-xxxxxx', '1-6546c274-xxxxxx', '1-6546c273-xxxxx', '1-6546c273-xxxxxx']
from datetime import datetime, timedelta
import boto3
import json
from statistics import mean
import time
import concurrent.futures
import urllib.request
function_arn = "your function arn"
url = "your function url"
threshold = 6 # initialization seconds consider bad
print("Sending web requests for 1 minute")
worker_count = 20
requests_per_worker = 4
client = boto3.client("xray")
def load_url():
for _ in range(0, requests_per_worker):
try:
httprequest = urllib.request.Request(url)
with urllib.request.urlopen(httprequest) as response:
pass
except Exception as e:
print(e) # request is expected to timeout
with concurrent.futures.ThreadPoolExecutor(max_workers = worker_count) as executor:
jobs = []
for worker in range(0, worker_count):
job = executor.submit(load_url)
jobs.append(job)
for future in concurrent.futures.as_completed(jobs):
print("completed")
future.result()
print("Waiting for traces..")
time.sleep(20)
enddate = datetime.utcnow() - timedelta(minutes=3)
now = datetime.utcnow()
token = ''
ids = []
while True:
response = client.get_trace_summaries(
StartTime=enddate,
EndTime=now,
Sampling=True,
NextToken=token,
FilterExpression=f'resource.ARN = "{function_arn}"',
)
for trace in response["TraceSummaries"]:
ids.append(trace["Id"])
print(trace["Id"])
if response.get("NextToken"):
token = response["NextToken"]
else:
break
print(f"found {len(ids)}")
traces = []
while True:
batch = []
for i in range(0,5):
if ids:
batch.append(ids.pop())
if batch:
response = client.batch_get_traces(
TraceIds=batch,
)
traces.extend(response["Traces"])
else:
break
init_durations = []
long_init_durations = []
standard_init_durations = []
long_init_durations_ids = []
def recursive(node, trace):
if node.get("name") == "Initialization":
duration = node["end_time"] - node["start_time"]
init_durations.append(duration)
print(f"duration: {duration} s" )
if duration > threshold:
long_init_durations.append(duration)
long_init_durations_ids.append(trace["Id"])
if duration < threshold:
standard_init_durations.append(duration)
elif node.get("subsegments"):
for subsegment in node.get("subsegments"):
recursive(subsegment, trace)
for trace in traces:
for seg in trace["Segments"]:
recursive(json.loads(seg["Document"]), trace)
print(f"count inits: {len(init_durations)}")
print(f"count number of inits over {threshold} seconds: {len(long_init_durations)}")
print(f"min init duration: {min(init_durations)}")
print(f"max init duration: {max(init_durations)}")
print(f"mean init duration: {mean(init_durations)}")
print(f"mean duration of inits under {threshold} seconds: {mean(standard_init_durations)}")
print(f"inits trace ids over {threshold} second {long_init_durations_ids}")
What version of collector/language SDK version did you use?
arn:aws:lambda:us-east-1:901920570463:layer:aws-otel-python-amd64-ver-1-20-0:2
What language layer did you use?
Python 11
Additional context
We attempted to mitigate this issue by setting lower statement timeouts on our DB calls, but under high DB load this timeout does not fire accurately.
The text was updated successfully, but these errors were encountered:
Hi @keithpeck, I suspect that coldstart delay might be caused by collector. Not sure how it is important in terms of delay, but as far as I see, collector gets config from S3 initially, so with 256MB memory, this might add 1-2 seconds additional delay.
Is it possible to try with local collector config file?
And can you also share your collector config file itself?
Any leads on this? I have opened this ticket where a local config seems to be good and fast until Everything is ready. Begin running and processing data AND then there is a wait of about 1-2 seconds (BLACBOX). After that the lambda Lifecycle API shows EXTENSION Name: collector State: Ready Events: [INVOKE, SHUTDOWN]
Describe the bug
When aws-otel-python layer is used, Lambda timeouts will cause high initialization times of the next Lambda call.
When our database is under high load, it can cause our Lambdas to timeout. In these high load situations we are also seeing very high Lambda initialization times of 19 seconds! Importantly, this initialization time is charged for and is removed from our remaining runtime - with a 30 second Lambda timeout,
context.get_remaining_time_in_millis()
is reporting only 11 seconds when handler is called.I have tested that these high initialization times are only present when the aws-otel-python layer is deployed and configured. I don't have experience with other extensions so am not able to confirm if this is specifically an open tel layer issue or something inherent to extensions
Steps to reproduce
This can be reproduced with a
time.sleep
call in the Lambda handler, and invoking the Lambda with some load so that it is reused.Normal initialization times of this example function are 1.2 seconds, with the outliers being 9 seconds.
And CW showing that function with 15 second timeout only had 6 seconds to run
CloudFormation and Python code is shared below which can be used to reproduce. Warnings:
You should expect to see printed results like this:
What version of collector/language SDK version did you use?
arn:aws:lambda:us-east-1:901920570463:layer:aws-otel-python-amd64-ver-1-20-0:2
What language layer did you use?
Python 11
Additional context
We attempted to mitigate this issue by setting lower statement timeouts on our DB calls, but under high DB load this timeout does not fire accurately.
The text was updated successfully, but these errors were encountered: