Configuration
pgEdge Vectorizer can be configured through PostgreSQL's GUC (Grand Unified Configuration) system. These settings control how the extension connects to embedding providers, manages background workers, processes text chunks, and maintains the processing queue. Most settings can be changed by any user and take effect after reloading the configuration with pg_reload_conf(), though some require a server restart.
Provider Settings
These settings configure the connection to your embedding provider, including the API endpoint, authentication, and model selection.
| Parameter | Default | Description | Reload | Restart | Superuser |
|---|---|---|---|---|---|
pgedge_vectorizer.provider |
openai |
Embedding provider (openai, voyage, ollama) | No | No | No |
pgedge_vectorizer.api_key_file |
~/.pgedge-vectorizer-llm-api-key |
API key file path (not needed for Ollama) | No | No | No |
pgedge_vectorizer.api_url |
https://api.openai.com/v1 |
API endpoint | No | No | No |
pgedge_vectorizer.model |
text-embedding-3-small |
Model name | No | No | No |
Worker Settings
These settings control the background workers that process the embedding queue, including concurrency, batch sizes, and retry behavior.
| Parameter | Default | Description | Reload | Restart | Superuser |
|---|---|---|---|---|---|
pgedge_vectorizer.num_workers |
2 |
Number of workers | No | Yes | Yes |
pgedge_vectorizer.databases |
(empty) | Comma-separated list of databases to monitor | Yes | Yes | No |
pgedge_vectorizer.batch_size |
10 |
Batch size for embeddings | Yes | No | No |
pgedge_vectorizer.max_retries |
3 |
Max retry attempts | Yes | No | No |
pgedge_vectorizer.worker_poll_interval |
1000 |
Poll interval in ms | Yes | No | No |
Chunking Settings
These settings determine how text content is split into chunks before embedding generation.
| Parameter | Default | Description | Reload | Restart | Superuser |
|---|---|---|---|---|---|
pgedge_vectorizer.auto_chunk |
true |
Enable auto-chunking | Yes | No | No |
pgedge_vectorizer.default_chunk_strategy |
token_based |
Chunking strategy | Yes | No | No |
pgedge_vectorizer.default_chunk_size |
400 |
Chunk size in tokens | Yes | No | No |
pgedge_vectorizer.default_chunk_overlap |
50 |
Overlap in tokens | Yes | No | No |
pgedge_vectorizer.strip_non_ascii |
true |
Strip non-ASCII characters (emoji, box-drawing, etc.) | Yes | No | No |
Chunking Strategies
The default_chunk_strategy parameter accepts the following values:
| Strategy | Description |
|---|---|
token_based |
Fixed token count chunking with overlap. Simple and fast. Default strategy. |
markdown |
Structure-aware chunking that respects markdown boundaries. Preserves heading context but without merge/split refinement. Good balance of structure awareness and simplicity. |
hybrid |
Full structure-aware chunking inspired by Docling. Parses markdown structure, preserves heading context, and applies two-pass refinement (split oversized, merge undersized). Best for RAG with structured documents. |
Automatic Fallback for Plain Text
Both markdown and hybrid strategies include automatic fallback detection. If the content doesn't appear to be markdown (no headings, code fences, lists, etc.), the chunker automatically falls back to token_based chunking. This ensures:
- No unnecessary overhead for plain text documents
- Consistent behavior regardless of content type
- Optimal chunking strategy is always used
Detection criteria (content is treated as markdown if it has):
- At least one heading (#, ##, etc.)
- At least one code fence (``` or ~~~)
- Or two or more of: lists, blockquotes, tables, links
Markdown Chunking Strategy
The markdown strategy provides structure-aware chunking with heading context:
- Parses markdown structure: Recognizes headings, code blocks, lists, blockquotes, tables, and paragraphs
- Preserves heading context: Each chunk includes its heading hierarchy (e.g.,
[Context: # Chapter 1 > ## Section 1.1]) - Respects structure boundaries: Doesn't split in the middle of code blocks or tables
This is simpler and faster than hybrid but may produce less optimal chunk sizes.
Hybrid Chunking Strategy
The hybrid strategy provides superior chunking for structured documents by:
- Parsing markdown structure: Recognizes headings, code blocks, lists, blockquotes, tables, and paragraphs
- Preserving heading context: Each chunk includes its heading hierarchy (e.g.,
[Context: # Chapter 1 > ## Section 1.1]) - Two-pass refinement:
- Pass 1: Splits chunks that exceed the token limit at natural boundaries
- Pass 2: Merges consecutive undersized chunks that share the same heading context
This approach significantly improves RAG retrieval accuracy by maintaining semantic context that would be lost with naive text splitting.
Choosing a Strategy
| Use Case | Recommended Strategy |
|---|---|
| Mixed content (markdown + plain text) | hybrid or markdown (auto-fallback handles plain text) |
| Structured documentation | hybrid (best retrieval quality) |
| Simple documents, speed priority | token_based |
| Code-heavy content | markdown or hybrid (preserves code blocks) |
Example usage:
-- Enable vectorization with hybrid chunking
SELECT pgedge_vectorizer.enable_vectorization(
'documents',
'content',
chunk_strategy := 'hybrid',
chunk_size := 400,
chunk_overlap := 50
);
-- Or chunk text directly
SELECT * FROM unnest(
pgedge_vectorizer.chunk_text(
'# Introduction
This is the introduction.
## Background
More content here...',
'hybrid',
200,
20
)
);
-- Plain text automatically falls back to token-based
SELECT * FROM unnest(
pgedge_vectorizer.chunk_text(
'This plain text document will use token-based chunking automatically.',
'hybrid',
100,
10
)
);
Queue Management
These settings control automatic cleanup of completed queue items to prevent unbounded growth.
| Parameter | Default | Description | Reload | Restart | Superuser |
|---|---|---|---|---|---|
pgedge_vectorizer.auto_cleanup_hours |
24 |
Automatically delete completed queue items older than this many hours. Set to 0 to disable. Workers clean up once per hour. | Yes | No | No |