An RSID, or Revision Save Identifier, is an internal Microsoft Word metadata mechanism used to identify editing and saving activity inside Word documents. It was introduced in Microsoft Word 2007 and is found in formats such as DOCX, DOCM, DOC, and RTF. In practical terms, an RSID is a 32-bit integer identifier, usually represented as a hexadecimal value such as 008C0F1F. Microsoft Word uses RSIDs internally for editing-session awareness, merge operations, change tracking, conflict resolution, and revision-related behaviour. Although RSIDs are not normally visible to a user editing a document, they are embedded in the document structure and can be extracted for document analysis.
In DOCX and DOCM files, the document is a ZIP container made up of XML files. RSID values are found in word/document.xml and word/settings.xml, as well as in revision-tracking structures. Word stores RSID values in several XML attributes, including rsidRoot, rsidR, rsidP, rsidDel, and rsidTr. These identifiers do not all mean exactly the same thing, because their meaning depends on the XML context in which they appear. For genealogy and provenance analysis, however, rsidRoot and rsidR are usually the most important.
The rsidRoot value is typically associated with the original root of the document. It is assigned when the document is first created and should remain stable across later versions created through ordinary editing, copying, or Save As operations. This makes it valuable for detecting documents that may share a common origin, and much more stable than the metadata found in Document Properties which may be reset upon large-scale changes such as Save As. If several files contain the same rsidRoot, this suggests that they belong to the same document family, were derived from the same original document, or were created from a related source file.
The rsidR values are associated with editing sessions and save operations. A Word document may accumulate multiple rsidR values over time as it is revised, saved, merged, copied, or otherwise modified. This accumulation is what makes RSIDs useful for ancestry analysis. A later descendant document may inherit most or all of the RSIDs from an earlier ancestor document while also introducing additional RSIDs from later editing sessions. The resulting RSID set can therefore act as a kind of edit-lineage residue left inside the file.
The core genealogy assumption is that document descendants inherit RSIDs from ancestor versions while adding new RSIDs during later editing. If the RSID set of document A is a proper subset of the RSID set of document B, then document A is likely to be an ancestor of document B. If two documents share one or more RSIDs, they may belong to the same genealogical tree. If two documents contain identical RSID sets, they may be duplicate, equivalent, or near-equivalent versions. If documents A and B share some RSIDs but neither set is a subset of the other, this may indicate that both documents descend from a missing intermediate ancestor or from a shared earlier file that is not available for analysis.
This set-based model makes RSIDs useful for document genealogy reconstruction, version ancestry analysis, Save As lineage detection, template attribution, workstation attribution, collaborative editing analysis, document clustering, timeline reconstruction, fraud investigation, and insider leak investigation. In implementation terms, an RSID analysis pipeline normally extracts all RSIDs from each document, normalises hexadecimal values, builds a per-document RSID set, performs set intersection and subset analysis, and then constructs an ancestry graph. Useful storage structures include a document-to-RSID map, an RSID-to-document map, an rsidRoot index, a template RSID index, and a document relationship graph. SQLite, graph databases, columnar stores, or compressed bitmap indexes can all be used depending on scale.
RSIDs are also useful for provenance analysis because they may reveal shared editing environments. Microsoft Word templates, including Normal.dotm, can contribute RSIDs to newly created documents. Documents sharing RSIDs from the same template may indicate shared workstation usage, a shared user profile, or a shared organisational template. This can be useful evidence, but it can also create false-positive relationships between documents that are not directly related. For this reason, known template RSIDs should be tracked separately, ignored where appropriate, or given lower evidential weight. A stronger analysis should require multiple RSID matches, weight rare RSIDs more heavily, and distinguish template-derived identifiers from edit-derived identifiers.
Several edge cases make RSID analysis difficult. Template pollution is a high-severity problem because corporate templates or shared Normal.dotm files may inject identical RSIDs into unrelated documents. Metadata stripping is also high-severity, because tools such as LibreOffice, Google Docs export, sanitisation tools, and metadata removal tools may strip or rewrite Word metadata structures, partially or completely destroying RSID lineage. Non-Microsoft editors may not preserve RSID semantics correctly, making analysis unreliable across mixed editing environments.
Copy-paste behaviour can also complicate interpretation. Content copied from one Word document into another may carry RSID-linked structures with it, causing two documents to appear genealogically related even when there is no direct ancestor-descendant relationship. Document merging creates an even more complex problem because it can combine multiple RSID histories into a single file. In those cases, genealogy may no longer be a simple tree; it may be better modelled as a directed acyclic graph, or DAG, where different branches of document history have been combined.
Save As operations create another expected form of complexity. A single document may be saved into several sibling versions that all share the same early RSIDs but then diverge as each version is edited separately. This can make multiple descendants appear equally plausible as ancestors unless additional evidence is available. Partial observation creates a similar problem: in real investigations, only some versions in a genealogy chain may be available, so missing ancestors often have to be inferred probabilistically rather than directly observed.
RSID analysis also has technical limits. RSIDs are not cryptographically unique, and they should not be treated as globally unique identifiers. Collisions are considered low-probability but possible. Different Word versions, repair operations, or format conversions may regenerate or remove RSIDs, breaking lineage continuity. Not all edits necessarily generate persistent RSIDs. Documents may also be laundered through other formats, manually edited, stripped of metadata, or injected with synthetic RSID values. These adversarial possibilities mean that RSIDs are not secure provenance markers.
For these reasons, RSID evidence should be treated as probabilistic forensic inference rather than cryptographic proof of ancestry. Strong indicators include large RSID overlap, proper subset relationships, a shared rsidRoot, consistent edit chronology, matching authorship metadata, and matching template lineage. Weak indicators include a single shared RSID, template-only overlap, small overlap sets, or documents generated across different applications. RSIDs should therefore be combined with supporting signals such as document hashes, text similarity, embedded timestamps, author metadata, filesystem timelines, email provenance, change-tracking records, and content fingerprinting.
A useful conceptual comparison is with Git. Git stores explicit commit ancestry and records formal parent-child relationships between versions. RSIDs do not provide that kind of explicit version-control structure. Instead, they provide implicit ancestry signals left behind by Word’s editing and saving behaviour. They are best understood as lightweight edit-session fingerprints or accumulated editing residue rather than formal commits, secure signatures, or immutable provenance records.