Wenku 文库: Structured document library

Wenku is a local-first architecture designed to manage large collections of documents alongside structured bibliographic metadata. It prioritises performance, offline access, and seamless integration with terminal-based workflows. It is currently designed to integrate with editing tools as a command-line CLI, so can easily be accessed directly on the command line, or via text editors that have access to the terminal such as Vim, Neovim, Helix, etc. The library design separates raw document storage, plain-text access, and structured metadata into distinct but linked components, all unified through a shared UUID-based identifier scheme.

Core design principles

Local-first: all data is stored and accessed without requiring network connectivity
Separation of concerns: plain text access, and metadata are independent layers; the AST (abstract syntax tree) is packaged into an efficient CBOR format for use with other software
Zero-copy access: native memory mapping is used for extremely fast performance
Stable identifiers: UUIDs provide consistent identity across all components
Editor-agnostic terminal workflow: optimised for terminal usage; easily integrated with any editor that can access the terminal. Use Vim, Neovim, Helix, VSCode, Emacs, Sublime, Zed, nano, micro, etc.

The software is designed to be simple: consisting of only three main components: document storage (memory mapped plain text), metadata database (SQLite), and the logic code to search through documents and metadata and print results to stdout. Each component is linked through a shared UUID.

Document storage

engine: mmap
purpose: store plain text document for direct access
key type: UUID

This component stores extracted or authored plain text versions of documents (e.g., document.txt, document.md, document.toml, etc). Files are memory-mapped to allow fast, zero-copy access

Memory-mapped file access basically means that the text is directly read from the document itself; no need for the normal copying into temporary RAM data structures that 99% of user applications have. This makes it faster than 99% of document libraries out there. It also enables highly efficient handling of large text corpora, with immediate usability in the terminal and text editor, and is optimised for offline workflows.

This layer serves as the primary interface for reading documents, enabling fast searching, viewing, and serving data to your text editor without the need to decode binary formats.

Metadata storage

engine: SQLite
purpose: store structured bibliographic metadata
key type: UUID (linked to LMDB key)

The metadata layer maintains structured information about each document. It uses SQLite for portability and performance. SQLite is exactly designed for local-only use-cases, where very fast reads (even faster than more complex databases such as PostgreSQL, MySQL, etc, because it is simply a file on your machine, no server process and inter-process messaging required), are packaged into a single binary file, so users can simply send their whole database in an email attachment to colleagues, anmd instantly be used on the other end. Further, SQLite supports full-text search via FTS5 and can be extended to support semantic search also.

This layer acts as the primary entry point for all queries, enabling structured filtering that narrows the search space before more expensive operations are performed. All document discovery flows through the metadata database, which reduces the candidate set of documents prior to semantic vector search and full-text search. This significantly improves both performance and accuracy when retrieving relevant documents and specific passages. In addition, it provides citation metadata for export and links all storage layers via UUID.

Database schema

Field	Description
UUID	Primary key, link to documents, metadata, database
Author	Individual / organisation
Title	Document title
Document classification	Document typology: 令, 公告, etc
Year	Publication year
Issuing body	Organisation that issued or released the document
Identifier	External identifier: DOI, government document number, etc
Keywords	searchable keyword tags
Attributes	Extended set of metadata not suitable for core fields (json)

Citation format

standard: BibLaTeX | CSL JSON
purpose: ensure compatibility with document processing tools

The program supports widely used citation formats to integrate with tools such as Pandoc, Zotero, Google Scholar, Scopus, university library databases, etc. Metadata from the database can be exported into these formats for interoperability.

Plain text human-readable formats like BibLaTeX and CSL JSON enable interoperability with external tools and support easy inspection, editing, import and export.

Text editor integration

editor: Any program with access to terminal; any terminal-accessible editor
purpose: provide fast, accurate citation workflow

Integration with text editor is achieved through external command hooks and CLI tools, allowing users to search, insert, and navigate citations directly within the terminal, or via the editor.

Custom command-line tools enable querying of metadata, while editor integrates via external command hooks and keybindings for @citekey lookup and insertion. This makes the database directly usable during writing and eliminates context switching between tools. Currently, citation tools are often tied to word processing applications like Word, through slow, mouse-heavy GUI workflows like Zotero, Mendeley, and Endnote, where the user must 'click' through options and fill in boxes. They are not designed to integrate well with command line use, and in many cases, they are proprietary, paid applications, whch forces users into a specific workflow, and makes interoperability difficult on purpose.

Citekey

Format: human-readable identifier (optional)
Example: @zhang_2014_environmental_politics

The UUID is the canonical identifier used across the entire system and may be cited directly when desired (e.g. JPXJZUA5JLHBOZCA). Citekeys are optional, human-readable aliases created locally on a per-document basis (e.g. @zhang_2014_environmental_politics). They are only introduced when needed by the author for convenience during writing; authors can search for documents in their local metadata, automatically generate a citekey, and insert it into a document, all through the terminal. Citekeys follow a structured pattern (e.g., author_year_keyword) and are generated with awareness of the existing bibliography. When a potential naming conflict is detected, the keyword is adjusted (next keyword in list) to ensure uniqueness within the local collection. Note that citekeys are designed to be local to a manuscript only; they are a convenience for the aurhor; the UUID remains the canonical identifier. This approach preserves global uniqueness and stability via UUIDs, while allowing flexible, readable references. This is especially beneficial where authors write in a plain-text format such as markdown, Typst, etc.