High-level architectural overview of jeeves-watcher for contributors and advanced users.

The watcher consists of several layered components:

The system maintains three distinct stores:
.meta.json sidecars) — Enrichment metadata added via APIQdrant lost → Rebuild from filesystem + metadata store:
jeeves-watcher reindex
Reads files from disk, reads .meta.json sidecars, re-embeds content, upserts to Qdrant.
Metadata store lost → Rebuild from Qdrant:
jeeves-watcher rebuild-metadata
Scrolls Qdrant points, extracts enrichment metadata from payloads, writes .meta.json files.

Debounce: Wait watch.debounceMs after the last change event (prevents re-embedding during rapid edits).
Stability Check: File size must be unchanged for watch.stabilityThresholdMs (prevents embedding partial writes).

The core processing flow transforms file changes into searchable vectors:
set templates and coerce to declared types (v0.5.0+)matched_rules array (v0.5.0+).meta.json sidecar if existsmatched_rules)
The enrichment API allows external callers to add metadata without re-embedding:
.meta.json sidecar (if exists).meta.json — atomic write via temp file + renameKey principle: Enrichment updates metadata only. No text extraction, no re-embedding, no hash check. Fast metadata updates without changing vectors.

Chunk deduplication: Search returns individual chunks. Callers group by file_path to get unique documents.
Documents exceeding embedding.chunkSize (default: 1000 characters) are split into chunks.
Splitters:
.md, .markdown) — MarkdownTextSplitter (splits on heading boundaries, preserves structure)RecursiveCharacterTextSplitter (splits on \n\n, then \n, then . , then characters)Both splitters use:
chunkSize — max characters per chunkchunkOverlap — overlap between consecutive chunks (helps preserve context at boundaries)Chunk Points:
Each chunk becomes a separate Qdrant point with:
pointId(filePath, chunkIndex) (deterministic UUID)file_path, domain, metadata (same across all chunks)chunk_index, total_chunks, chunk_textExample:
File with 3 chunks → 3 Qdrant points:
| Point ID | chunk_index |
total_chunks |
chunk_text |
|---|---|---|---|
uuid-0 |
0 |
3 |
"First chunk..." |
uuid-1 |
1 |
3 |
"Second chunk..." |
uuid-2 |
2 |
3 |
"Third chunk..." |
The v0.5.0 inference rules system uses declarative JSON Schemas with type coercion. When a rule matches a file, the watcher merges schema references and applies type coercion:

Key steps:
schemas collectiontypeset templates — interpolate {{...}} Handlebars expressions against file attributesWhen config files change, the watcher can trigger different reindex modes based on configWatch.reindex:

Modes:
issues (default) — Re-process only files in issues.json (cheap, targeted)full — Re-process all watched files (expensive, comprehensive)none — No automatic reindex (manual POST /reindex required)Every processed file gets a content hash (sha256 of extracted text) stored in the Qdrant payload.
On file modify events:
content_hash matches → skip (no re-embed, no API call)Benefits:
Qdrant point IDs are deterministic UUIDs derived from file paths:
import { v5 as uuidv5 } from 'uuid';
const NAMESPACE = '6a6f686e-6761-6c74-2d6a-656576657321'; // fixed
function pointId(filePath: string, chunkIndex: number): string {
const normalized = filePath.toLowerCase().replace(/\\/g, '/');
return uuidv5(`${normalized}:${chunkIndex}`, NAMESPACE);
}
Why deterministic:
Trade-off: Renaming/moving files changes their identity in the index. Acceptable — the watcher handles it transparently.
File events are queued and processed sequentially to maintain consistency.
Concurrency: Embedding API calls are parallelized (up to embedding.concurrency), bounded by a rate limiter (embedding.rateLimitPerMinute).
Same-path serialization: Events for the same file path are always processed in order (prevents race conditions).
On service start:
watch.paths globsThis ensures consistency after service downtime or crashes.
When configWatch.enabled is true, the watcher monitors its own config file.
On config change:
configWatch.debounceMs (default: 1000ms)Scoped reindex:
d:/meetings/** changes, only files under d:/meetings/ are re-evaluatedMetadata sidecars mirror the watched filesystem hierarchy:
File: D:\projects\readme.md
Sidecar: {metadataDir}/d/projects/readme.md.meta.json
Algorithm:
function metadataPath(filePath: string, metadataDir: string): string {
const normalized = filePath.replace(/^([A-Z]):/i, (_, d) => d.toLowerCase());
return path.join(metadataDir, normalized + '.meta.json');
}
Lookup: O(1) — fs.existsSync(metadataPath). No index required.
Each .meta.json contains only enrichment metadata (not content or embeddings):
{
"title": "Project Overview",
"labels": ["documentation", "important"],
"author": "jeeves",
"enriched_at": "2026-02-20T08:00:00Z"
}
The metadata store is watcher-owned. Only the watcher process writes to it, only via the POST /metadata API.
Enforcement: Architectural policy (no filesystem permissions, cross-platform consistency).
Rules are JSON Schema match + schema arrays, evaluated in order against file attributes.
Attributes:
{
"file": {
"path": "d:/meetings/2026-02-20/transcript.md",
"directory": "d:/meetings/2026-02-20",
"filename": "transcript.md",
"extension": ".md",
"sizeBytes": 4523,
"modified": "2026-02-20T08:15:00Z"
},
"frontmatter": { "title": "Meeting Notes", "author": "jeeves" },
"json": { "participants": ["Jason", "Devin"] }
}
Rule example (v2):
{
"name": "meetings-classifier",
"description": "Classify meeting transcripts and notes",
"match": {
"properties": {
"file": {
"properties": {
"path": { "type": "string", "glob": "d:/meetings/**" }
}
}
}
},
"schema": [
"base",
{
"properties": {
"domain": { "set": "meetings" },
"title": { "type": "string", "set": "{{frontmatter.title}}" }
}
}
],
"map": "extractProject"
}
Processing order for each matching rule:
match — JSON Schema validation against file attributesset resolution — Handlebars template interpolation ({{...}}) resolves against attributes, with type coercionmap — JsonMap transformation (inline or named reference)map output overrides set output on field conflictCustom glob format: Picomatch glob matching registered as an ajv custom keyword. This is the only custom format — everything else is pure JSON Schema.
JsonMap transformations (map):
Rules can reference named maps from the top-level maps config, or include inline JsonMap definitions. The watcher provides lib functions for path manipulation: split, slice, join, toLowerCase, replace, get.
See Inference Rules Guide for full details.
| Extension | Extractor | Strategy |
|---|---|---|
.md, .markdown |
markdown |
Strip YAML frontmatter (extract as metadata), return body text |
.txt, .text |
plaintext |
Return as-is |
.json |
json-content |
Extract string values from known content fields (content, body, text, subject) |
.pdf |
pdf-parse |
Extract text via unpdf library |
.docx |
docx-extract |
Extract text via mammoth library |
.html, .htm |
html-to-text |
Strip tags via cheerio library |
Binary files without extractors (e.g., .png, .svg) are skipped — no embedding, no Qdrant entry.
Transient failures (Gemini API, Qdrant) are handled with exponential backoff:
| Failure Type | Retry Policy | Max Retries |
|---|---|---|
| Gemini 429 (rate limit) | Backoff from 1s, respect Retry-After |
5 |
| Gemini 500/503 | Backoff from 2s | 3 |
| Qdrant connection refused | Backoff from 5s | 10 |
| Text extraction error | No retry (log and skip) | 0 |
Issues file: Files that fail all retries are recorded in the issues file (persisted to {stateDir}/issues.json). Surfaced via GET /issues and retried on next config reload or reindex.
On SIGTERM/SIGINT:
shutdownTimeoutMs)Partially processed files are safe — startup behavior re-processes them by comparing filesystem state against Qdrant.
On first startup, if the Qdrant collection doesn't exist, the watcher creates it:
await qdrant.createCollection(collectionName, {
vectors: {
size: config.embedding.dimensions,
distance: 'Cosine'
}
});
Dimension mismatch: If the collection exists with different dimensions (e.g., after switching embedding providers), the watcher logs an error and refuses to start.
Recovery: Delete the collection manually (or rename collectionName in config), then restart.
Structured JSON logging via pino:
| Level | Events |
|---|---|
info |
File indexed, file deleted, reindex started/completed, config reloaded |
warn |
Extraction failed, issues file entry, dimension mismatch |
error |
Embedding API failure (after retries), Qdrant write failure, startup failure |
debug |
Hash match (skip), debounce, queue depth, chunk processing |
Log format: JSON lines (parseable by standard log aggregators).
| Component | Library | Rationale |
|---|---|---|
| Filesystem watcher | chokidar |
Cross-platform, glob support, battle-tested |
| HTTP framework | fastify |
Lightweight, fast, schema validation |
| Text splitting | @langchain/textsplitters |
Markdown-aware + recursive splitting |
| JSON Schema validation | ajv |
Fast, extensible (custom glob format) |
| Embedding client | Direct HTTP (Gemini REST API) | No SDK bloat |
| Qdrant client | @qdrant/js-client-rest |
Official client, typed |
| Logging | pino |
Structured JSON, low overhead |
src/extractors/index.ts:async function extractYaml(filePath: string): Promise<ExtractionResult> {
const content = await fs.readFile(filePath, 'utf8');
const parsed = YAML.parse(content);
return {
text: JSON.stringify(parsed, null, 2),
frontmatter: undefined,
json: parsed
};
}
{
"extractors": {
".yaml": "yaml-content"
}
}
src/embedding/index.ts:class MyEmbeddingProvider implements EmbeddingProvider {
async embed(texts: string[]): Promise<number[][]> {
// Call your provider's API
}
}
export function createEmbeddingProvider(config: EmbeddingConfig): EmbeddingProvider {
if (config.provider === 'my-provider') return new MyEmbeddingProvider(config);
// ...
}
{
"embedding": {
"provider": "my-provider",
"model": "my-model",
"apiKey": "${MY_API_KEY}"
}
}