v2.15.0 — context-prefixed embedding text builder ("late-chunking-style"
context windowing). Pre-pends the document title + heading breadcrumb,
then includes a tail of the previous chunk + the chunk itself + a head
of the next chunk, all bounded so the multilingual model's 128-token
context budget isn't blown.
Why: short standalone chunks ("Use Adam β=0.9, β=0.999") embed
identically across documents, losing the surrounding context that
disambiguates them. Adding ~50-100 chars of neighbor text + the
doc title + breadcrumb gives the bi-encoder enough signal to keep
cross-document semantic separation. Per Chroma 2024 + Jina AI's late
chunking blog: +2-5 NDCG@10 typical at zero new dep cost.
Returns the concatenated text. When contextChars ≤ 0, returns the
legacy v2.1.0 form (just breadcrumb + chunk text), preserving
bit-for-bit behavior for users who don't opt in.
v3.8.0-rc.6 ARCH-1 — moved here from server.ts to break circular import.
v2.15.0 — context-prefixed embedding text builder ("late-chunking-style" context windowing). Pre-pends the document title + heading breadcrumb, then includes a tail of the previous chunk + the chunk itself + a head of the next chunk, all bounded so the multilingual model's 128-token context budget isn't blown.
Why: short standalone chunks ("Use Adam β=0.9, β=0.999") embed identically across documents, losing the surrounding context that disambiguates them. Adding ~50-100 chars of neighbor text + the doc title + breadcrumb gives the bi-encoder enough signal to keep cross-document semantic separation. Per Chroma 2024 + Jina AI's late chunking blog: +2-5 NDCG@10 typical at zero new dep cost.
Returns the concatenated text. When
contextChars≤ 0, returns the legacy v2.1.0 form (just breadcrumb + chunk text), preserving bit-for-bit behavior for users who don't opt in.v3.8.0-rc.6 ARCH-1 — moved here from server.ts to break circular import.