Parse PDF context
Extract text from PDF inputs supplied via widget data and append it to the model context. Optionally add citation highlights to reference quotes within the PDF.
Reference implementation here.

Architecture
Handle PDF data delivered by the UI through the widget data tool. Support both URL and base64 PDFs and add text to the LLM context.
agents.json
configuration with widget-dashboard-select
feature enabled:
return JSONResponse(content={
"vanilla_agent_pdf": {
"endpoints": {"query": "http://localhost:7777/v1/query"},
"features": {
"widget-dashboard-select": True,
"widget-dashboard-search": False,
},
}
})
Query flow
- Check for human message with
widgets.primary
containing PDF data - Early exit: yield
get_widget_data()
for UI execution - On subsequent tool message:
- Iterate through
DataContent
items - Detect
PdfDataFormat
usingisinstance()
check - Handle both
SingleDataContent
(base64) andSingleFileReference
(URL) - Extract text using
pdfplumber.open()
withio.BytesIO()
- Append extracted text to context string
- Process with LLM and stream response
- Iterate through
- Optionally create
cite()
withCitationHighlightBoundingBox
for text highlighting
OpenBB AI SDK
PdfDataFormat
: Identifies PDF content in widget dataSingleDataContent
: Contains base64-encoded PDF dataSingleFileReference
: Contains URL reference to PDFDataContent
/DataFileReferences
: Containers for data itemsCitationHighlightBoundingBox
: Defines text highlighting coordinatescite(widget, input_arguments, extra_details)
: Creates citations with bounding boxesget_widget_data()
: Requests PDF data from widgets
Core logic
Detect PDF data, extract text, and accumulate as context:
import base64
import io
import pdfplumber
import httpx
from openbb_ai import get_widget_data, cite, citations
from openbb_ai.models import (
QueryRequest, WidgetRequest, PdfDataFormat,
SingleDataContent, SingleFileReference,
DataContent, DataFileReferences,
CitationHighlightBoundingBox
)
async def _download_file(url: str) -> bytes:
async with httpx.AsyncClient() as client:
response = await client.get(url)
return response.content
async def _get_pdf_text(item) -> str:
if isinstance(item, SingleDataContent):
# Handle base64 PDF
file_content = base64.b64decode(item.content)
elif isinstance(item, SingleFileReference):
# Handle URL PDF
file_content = await _download_file(str(item.url))
with pdfplumber.open(io.BytesIO(file_content)) as pdf:
document_text = ""
for page in pdf.pages:
document_text += page.extract_text() + "\n\n"
return document_text
async def handle_widget_data(data: list[DataContent | DataFileReferences]) -> str:
result_str = "--- PDF Content ---\n"
for result_item in data:
for item in result_item.items:
if isinstance(item.data_format, PdfDataFormat):
result_str += f"===== {item.data_format.filename} =====\n"
result_str += await _get_pdf_text(item)
result_str += "------\n"
else:
result_str += str(item.content) + "\n"
return result_str
Add citation highlights with bounding boxes:
# Create citations with text highlighting
citations_list = []
for widget in request.widgets.primary:
citation = cite(
widget=widget,
input_arguments={p.name: p.current_value for p in widget.params},
extra_details={"filename": "document.pdf"}
)
# Add bounding boxes for specific text regions
citation.quote_bounding_boxes = [
[
CitationHighlightBoundingBox(
text="Key financial metrics",
page=1,
x0=72.0, top=117, x1=259, bottom=135
),
CitationHighlightBoundingBox(
text="Revenue increased 15%",
page=1,
x0=110.0, top=140, x1=275, bottom=160
)
]
]
citations_list.append(citation)
if citations_list:
yield citations(citations_list).model_dump()