Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Django StreamingHttpResponse View not working as expected #3736

Open
marcoaleixo opened this issue Nov 5, 2024 · 13 comments · Fixed by #3798
Open

Django StreamingHttpResponse View not working as expected #3736

marcoaleixo opened this issue Nov 5, 2024 · 13 comments · Fixed by #3798

Comments

@marcoaleixo
Copy link

marcoaleixo commented Nov 5, 2024

How do you use Sentry?

Sentry Saas (sentry.io)

Version

2.18.0

Steps to Reproduce

  1. Have a view like
class StreamingView(APIView):

    def post(self, request, *args, **kwargs):

        span = sentry_sdk.start_span(name=f"Log: {record.msg}", op="log.info", sampled=False)
        print(span)
        span.set_data("value", record.msg)
        span.set_data("filename", f"{record.pathname} at line {record.lineno}")
        span.finish()

        generator_streaming = dummy_function.execute(parameters=parameters)
        
        return StreamingHttpResponse(generator_streaming, content_type='text/event-stream')

Here the span.trace_id will be for instance "1234"

  1. Inside the dummy_function
def execute():
        span = sentry_sdk.start_span(name=f"Log: {record.msg}", op="log.info", sampled=False)
        print(span)
        span.set_data("value", record.msg)
        span.set_data("filename", f"{record.pathname} at line {record.lineno}")
        span.finish()

In this function, the trace_id will be totally different.

I think that this is related the StreamingHttpResponse, because the Sentry is working fine for all my other views.

Feel free to ask more details to help in the investigation.

FROM python:3.9.18-slim-bullseye
Django==4.1
djangorestframework==3.14

I'm executing my server with python manage.py runserver 0.0.0.0:8000
In production we use something like gunicorn api.wsgi -b 0.0.0.0:8081

Expected Result

The trace_id should stay the same in StreamingHttpResponse views and underlying function calls.

Actual Result

The trace_id is different between the view and underlying function calls.

@getsantry getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 3 Nov 5, 2024
@marcoaleixo marcoaleixo changed the title Django StreamingHttpResponse not working as expected Django StreamingHttpResponse View not working as expected Nov 5, 2024
@marcoaleixo
Copy link
Author

marcoaleixo commented Nov 5, 2024

Do we have a way to set the trace_id?
Maybe an easy fix is to get the trace_id in the view level and pass as parameter to my function and do a transaction.set_trace_id(trace_id=trace_id).

But it seems that we don't have a set_trace_id thing at transaction/event level, right?

I know that we can do a Span(trace_id=myid), but passing this to ALL calls downstream is bad. 😅

@antonpirker
Copy link
Member

Hey @marcoaleixo thanks for raising this, I will have a look!

@antonpirker
Copy link
Member

antonpirker commented Nov 7, 2024

Indeed!

I have now reproduced this with this demo application:
https://github.com/antonpirker/testing-sentry/blob/main/test-django-rest-framework-streaming/mysite/api/views.py

The first span "Streaming View" is present in the trace, but the second span "Stream Response Generator" is missing (because it has the wrong trace id)
Image

@antonpirker
Copy link
Member

This is a bug. I guess it has something to do with how Django handles streaming views and what we monkey patch in Django. Sentry is probably not patching the functions in Django that are handling the streaming responses.

@marcoaleixo
Copy link
Author

marcoaleixo commented Nov 12, 2024

hey @antonpirker - thanks for taking a look but I have a quick question: how do you prioritize those fixes?
This bug is creating a hole in the observability of one of my main endpoints :/

Just to understand if I will need to use another observability tool until you fix this.
I do have some bandwidth to work on this fix but not sure if I have enough context to work on it.

@antonpirker
Copy link
Member

Hey! I will have some time today and tomorrow to look at this. I will keep you posted on the progress

@antonpirker
Copy link
Member

I tested now with Uvicorn and Gunicorn.

Using Uvicorn (asgi) it works. The parent-child relationship is not correct, but at least the spans are there:
Image

Using Gunicorn (wsgi) it does not work:
Image

@antonpirker
Copy link
Member

Small update: The problem is that StreamingHttpResponse() finishes right away and this leads to the transaction being closed promptly. The spans generated in the generator are outside of the transactions time span and thus not part of that trace and are lost.

antonpirker added a commit that referenced this issue Nov 25, 2024
Fixes spans in streaming responses when using WSGI based frameworks. Only close the transaction once the response was consumed. This way all the spans created during creation of the response will be recorded with the transaction:

- The transaction stays open until all the streaming blocks are sent to the client. (because of this I had to update the tests, to make sure the tests, consume the response, because the Werkzeug test client (used by Flask and Django and our Strawberry tests) will not close the WSGI response)
- A maximum runtime of 5 minutes for transactions is enforced. (like Javascript does it)
- When using a generator to generate the streaming response, it uses the correct scopes to have correct parent-child relationship of spans created in the generator.

People having Sentry in a streaming application will:
- See an increase in their transaction duration to up to 5 minutes
- Get the correct span tree for streaming responses generated by a generator

Fixes #3736
@sentrivana
Copy link
Contributor

Reopening since we ultimately reverted the change

@marcoaleixo
Copy link
Author

marcoaleixo commented Dec 16, 2024

Is the solution of this bug complicated?
Asking that because this is really making our life more difficult to debug issues related to our LLM based endpoints.

Do you think that's too risky to deploy the reverted PR? is the performance the only thing that made you revert it?
I don't have the mooost popular ( 5M spans/daily ) service out there ( yet ), so maybe I can monitor closely and rollback if needed.

@getsantry getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 3 Dec 16, 2024
@sentrivana
Copy link
Contributor

Hey @marcoaleixo, it's unfortunately more tricky than we realized and there likely won't be a proper fix until after the holidays.

The PR was reverted for performance reasons, it's spawning an additional thread per request. It's hard to tell where the cutoff is when this starts being a problem since it depends on your setup etc. -- it could work fine for you. As long as you have the option to monitor it for some time I think you could go for it.

@marcoaleixo
Copy link
Author

marcoaleixo commented Dec 20, 2024

Just sharing that we deployed the branch on Monday and no major impact on our latency/CPU usage.
We are actually sending only < 3M spans daily.

@getsantry getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 3 Dec 20, 2024
@szokeasaurusrex
Copy link
Member

Awesome, thanks for sharing @marcoaleixo! As @sentrivana mentioned, we hope to continue working on a proper fix after the holidays.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

6 participants