Make detection of binary files more robust#515
Conversation
| (if (and mime-type | ||
| (not (or (string-prefix-p "text/" mime-type) | ||
| (string-match-p "\\(?:json\\|xml\\|yaml\\)$" mime-type)))) | ||
| ;; Read contents as base-64 encoded binary |
There was a problem hiding this comment.
Won't small non-image binary files (PDFs, .docx, etc.) that pass the size check will now be embedded as "resource" blocks with base64-encoded content in the text field? Before this PR, json-serialize would error on non-UTF-8 bytes and the condition-case would catch it, falling back to a plain text mention, somewhat accidentally the right behavior.
Maybe resource embed branch in the caller probably needs a (not (map-elt file :base64-p)) guard so binary non-image files fall through to resource_link instead?
There was a problem hiding this comment.
I worked on this, because I @-mentioned a small PDF file that was then embedded in the message but not base64 encoded. agent-shell then refused any further cooperation, when sending the request failed on json-serialize in acp.
| (push `((type . "resource") | ||
| (resource . ((uri . ,(concat "file://" resolved-path)) | ||
| (text . ,(map-elt file :content)) | ||
| (,(if (map-elt file :base64-p) 'blob 'text) . ,(map-elt file :content)) |
There was a problem hiding this comment.
This embeds small non-text files as binary resources now as documented here. And I think at least claude would also support embedded PDFs, if these API docs also apply to ACP.
Regarding and agent-shell-text-file-capabilities above, why is this necessary to work with embedded resources? From the naming and the pattern in the image case, I would think that supports-embedded-context would be the right test to check if the agent can deal with embedded context.
0727958 to
934e3c1
Compare
934e3c1 to
388ac64
Compare
The previous heuristic would try to embed non-image files as text, which would fail, for example, for small PDFs. For those files, which contain non-UTF-8 bytes,
json-serializewould raise an error, because it would be an invalid UTF-8 string.The new function has more robust MIME type detection based on Emacs'
mailcapand ensures that only files with valid multi-byte content will not be base64 encoded.Checklist
M-x checkdocandM-x byte-compile-file.