Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Images are not visible in the console and are invalid in the ChatML / string representation. #1077

Open
dchichkov opened this issue Nov 15, 2024 · 3 comments

Comments

@dchichkov
Copy link

The bug
While images are visible in Jupyter, the ChatML/string representation becomes invalid once the runtime stops, and the images are not visible in the console.

<|im_start|>user
What is the capital of <|_image:94590504830032|>?<|im_end|>
<|im_start|>assistant
I'm unable to view or interpret images directly. However, if you provide a description of the image or additional context, I'd be happy to help you with any information you need!<|im_end|>

The current image representation is a custom ChatML tag - <|_image:94590504830032|> with the number valid only during runtime, referencing the local runtime object. As a result, the produced ChatML log str(lm) is effectively invalid. Note also, <|_image:xxx|> is not a valid markdown or HTML, and as a result, regular markdown viewers that can view the rest of the ChatML exchange can't display images.

It'd be good to normalize the representation and use valid Markdown tags <img src="https://..."> or <img src="file://"> or : <img src="data:image/jpeg instead of newly introduced <|_image: |> tag.

To Reproduce

        with user():
            lm = self.gpt + f"What is the capital of " + image("france.jpg") + "?"

        with assistant():
            lm += gen("capital")

System info (please complete the following information):

  • OS: Ubuntu 22.04, vscode
  • Guidance Version: 0.1.16
@dchichkov
Copy link
Author

If this helps, I've implemented a hook, so the URLs get expanded, as a workaround for vLLM / OpenAI hosted VLMs:

def hook(request):
    j = json.loads(request.content)
    
    def process_content(input_str):
        # Regular expression to split text and img tags, capturing the image URL.
        parts = re.split(r'(<img\s+[^>]*src=[\'"]([hfd].*?)[\'"][^>]*>)', input_str)
        content, image_url = [], None

        for i in range(len(parts)):
            if i % 3 == 0:  # Text part (regular text before or after <img> tags)
                text_part = parts[i].strip()
                if text_part:
                    content.append({"type": "text", "text": text_part})
            elif i % 3 == 2:  # Image URL part (captured inside the <img> tag)
                image_url = parts[i].strip()
                content.append({"type": "image_url", "image_url": {"url": image_url}})
                
        if not image_url:
            return input_str

        return content

    # Split the content of the message into text and image_url parts
    for message in j['messages']:
        if isinstance(message["content"], str):
            message["content"] = process_content(message["content"])

    modified_content = json.dumps(j).encode('utf-8')
    request.stream = httpx.ByteStream(modified_content)
    request.headers['Content-Length'] = str(len(modified_content))

It can be used with:

...
import httpx, json

llm = models.OpenAI(...
                    http_client=httpx.Client(event_hooks={'request': [hook]}))
                    
llm = llm + f"Is there a dog on this image? " + f"<img src=' ... '>" + "?"
...

@nking-1
Copy link
Collaborator

nking-1 commented Nov 15, 2024

Thank you for the feedback. We are actively working on a full-stack rework of multimodal support in Guidance. Part of this rework involves reformatting the way that prompt data is represented internally, which should enable us to have more flexibility in how we present the data to users.

Can you share more info on how you are trying to use the output of str(lm)? I would like to reproduce the error you're getting. It sounds like you expect str(lm) to result in a valid HTML string with the image data encoded in an img tag. Currently, str(lm) is meant to output the internal string prompt representation that is a Guidance-specific format. With more info about your use case, I can consider how we might make a better function for you to be able to achieve the output you want. There is ongoing work to rework the Jupyter UX in addition to adding better multimodal support, so this is a good opportunity to take your feedback into consideration.

@dchichkov
Copy link
Author

dchichkov commented Nov 16, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants