The aim is to provide markers with a simple, automatically generated report indicating the likelihood that unusual processes were involved in producing a submission. Outputs include a JSON report containing a detailed log of metadata observations and heuristic signals, alongside a simplified .txt summary intended for rapid review by markers and module leaders. The system is designed as a triage mechanism rather than an automated judgement system. Its purpose is not to determine guilt or innocence, but to assist academic staff in identifying submissions whose document history may warrant closer inspection.

The core epistemological principle behind the system is conservative rather than accusatory. Metadata extracted from Microsoft Word documents is not treated as proof of misconduct; instead, it is treated as evidence of possible drafting behaviour. Word documents contain many forms of embedded metadata generated during ordinary writing activity, including editing-session history, revision counts, timestamps, authorship fields, structural XML traces, and internal revision identifiers. The presence of such metadata can provide positive evidence that a document underwent an organic drafting process inside a normal editing environment.

However, metadata can be lost, rewritten, stripped, damaged, or partially destroyed through many entirely legitimate workflows. For example, metadata may be altered through file conversion, cloud editors, non-Microsoft office software, document repair operations, sanitisation tools, copying between systems, or ordinary user behaviour. As a result, the absence of metadata cannot be treated as positive evidence that no drafting process took place. A student may genuinely have written a document normally while still producing an unusually sparse metadata footprint due to technical circumstances outside their awareness.

The important asymmetry is therefore that metadata may be lost, but it cannot spontaneously appear without some generating process having existed at some point in the document’s history. In practical terms, the presence of rich drafting metadata can increase confidence that a document evolved through ordinary iterative writing behaviour, while the absence of such metadata merely increases uncertainty rather than proving misconduct. This distinction is fundamental to the intended use of the system.

The software is therefore designed primarily as a mechanism for excluding large numbers of low-risk submissions from further scrutiny, rather than positively identifying misconduct. In a large assessment cohort, many students will display extensive ordinary metadata consistent with prolonged drafting activity. The system can rapidly classify these cases as low concern, allowing markers to focus attention more efficiently on unusual or ambiguous submissions. Cases flagged by the system are not assumed to involve misconduct; they are merely cases where the metadata profile differs sufficiently from ordinary drafting behaviour to justify human review.

The method itself automates the detection of unusual editing behaviour in Word submissions using embedded metadata and document structures. The scope applies both to individual .docx submissions and to zip archives containing multiple drafts. The pipeline extracts metadata from each document, performs heuristic checks, analyses structural patterns within the XML document representation, and generates a summary conclusion indicating whether unusual behaviour indicators are present. Because the system operates heuristically, it cannot be expected to capture all possible edge cases or sophisticated workflows. Real-world writing behaviour contains a potentially long tail of uncommon but legitimate scenarios.

For this reason, the professional judgement of the lecturer or marker must remain authoritative in all cases. The software cannot determine whether plagiarism, contract cheating, or LLM-assisted writing actually occurred. It can only provide probabilistic indicators derived from document behaviour. Any disciplinary or academic integrity decision must therefore depend upon broader contextual evidence, including the quality of the student’s work, prior performance, oral explanation, drafting consistency, and formal misconduct procedures such as vivas or interviews.

Student instructions: Students must draft their work in Microsoft Word. All drafts and the final document must be kept in a single folder. At submission, students must zip all drafts and upload the zip file, while also uploading the final document unchanged. Students must not use “Save As” or create new documents externally in a way that overwrites metadata.

Automated processing pipeline: First, unzip the student’s submission. Then, for each .docx file, extract metadata from docProps/core.xml and docProps/app.xml. Next, evaluate the data using heuristic checks. Aggregate the findings into a JSON report (optional). Generate a plain text conclusion file containing a single line: “no evidence of unusual behaviour” or “evidence of unusual behaviour”. Finally, append the .txt file to the submission archive or post results to Canvas.

Metadata fields: The system uses Created, Modified, TotalEditingTime, RevisionNumber, and LastModifiedBy. These fields are used to infer editing duration, revision activity, and authorship consistency.

Output: The JSON report is a full structured record intended for review by integrity officers if needed. The TXT summary is a single-line verdict for markers or LMS integration. Example outputs include “no evidence of unusual behaviour” and “evidence of unusual behaviour”.

Integration: In Canvas, the summary.txt or JSON file can be attached to the submission record for staff visibility. Alternatively, the summary.txt can be kept in the same folder as the student essay for audit purposes. The final document should remain uploaded separately outside the zip file so markers can use the native SpeedGrader interface without disruption, allowing the metadata check to run without any input from the marker.

Limitations: Metadata can be altered using third-party tools, so this system is heuristic only. It does not perform text-level analysis or AI content detection. It is intended for triage rather than disciplinary proof.

Implementation status: Fully deployable today with minimal requirements, needing only a quick script and no administrative privileges.

At the moment, the software primarily analyses metadata already exposed by Microsoft Word and OpenDocument formats, including fields such as total editing time, revision count, creation timestamps, modification timestamps, and authorship metadata. This provides a basic heuristic foundation for identifying unusual writing behaviour. The system is not intended to function as proof of plagiarism or proof of LLM use. Instead, it acts as a triage mechanism designed to identify submissions whose editing history appears statistically or structurally unusual, allowing those cases to be reviewed more carefully by academic staff. The final determination must always remain with human judgement through established academic integrity procedures such as vivas or misconduct interviews. However, there is a deeper layer of metadata embedded inside Word documents that may significantly improve this process.

Modern Microsoft Word documents are internally structured as ZIP containers containing XML files. The visible essay shown to the marker is therefore only the rendered output of a much larger underlying document structure. The actual document text is primarily stored inside word/document.xml, where Word preserves information about formatting, editing sessions, revision structures, paragraph boundaries, language settings, and internal document state. This underlying XML structure can preserve traces of how text entered the document, even when those traces are not visible in the final rendered version seen by the user.

For example, when a sentence is typed normally inside Word, the XML structure is often relatively clean and internally consistent. A sentence such as “This is a sentence.” may appear as a single contiguous text run:


        <w:p>
          <w:r><w:t>This is a sentence.</w:t></w:r>
        </w:p>
        

If the sentence is then edited through ordinary typing behaviour, Word will often rewrite the paragraph as a coherent block:


        <w:p>
          <w:r><w:t>This is a longer sentence.</w:t></w:r>
        </w:p>
        

However, text inserted through copy-paste operations frequently produces more fragmented XML structures. The visible sentence may appear identical to the marker, but internally Word may preserve separate formatting runs, imported style fragments, spacing directives, language metadata, or inconsistent structural boundaries:


        <w:r><w:t>This is a </w:t></w:r>
        <w:r><w:t>longer </w:t></w:r>
        <w:r><w:t>sentence.</w:t></w:r>
        

More complex imported content may contain additional metadata or formatting residue inherited from another application, browser session, AI interface, or external document:


        <w:r><w:t xml:space="preserve">This is a </w:t></w:r>
        <w:r><w:rPr><w:lang w:val="en-US"/></w:rPr><w:t>longer </w:t></w:r>
        <w:r><w:t>sentence.</w:t></w:r>
        

Or formatting fragments may be introduced during external composition workflows:


        <w:r><w:t>This is a </w:t></w:r>
        <w:r>
          <w:rPr><w:b/></w:rPr>
          <w:t>longer</w:t>
        </w:r>
        <w:r><w:t> sentence.</w:t></w:r>
        

These structural traces do not automatically prove misconduct, because ordinary workflows can also generate fragmented XML. Nevertheless, large-scale structural inconsistency across a document may indicate that substantial portions of text originated outside the visible drafting environment. This creates the possibility of combining traditional metadata analysis with low-level document structure analysis in order to identify unusual composition behaviour more accurately.

An especially important part of this deeper metadata layer is Microsoft Word’s internal Revision Save Identifier system, commonly known as RSIDs. RSIDs are internal editing-session identifiers embedded throughout Word documents and used by Microsoft Word for merge operations, revision tracking, conflict resolution, and session-aware editing behaviour. As documents are edited and saved, they accumulate RSID values over time. Because descendant documents typically inherit earlier RSIDs while adding new ones, RSIDs can sometimes be used to reconstruct document genealogy, editing history, and ancestry relationships between drafts.

For example, if two student drafts share substantial RSID overlap, they may belong to the same editing lineage. If one draft contains all the RSIDs of another draft plus additional editing-session identifiers, this may suggest a chronological ancestor-descendant relationship between the files. Conversely, if a supposedly independent final submission contains little or no RSID continuity with earlier drafts, this may indicate that the final text was generated externally and pasted into a new document environment.

RSID analysis is not cryptographic proof and should never be treated as definitive evidence of plagiarism or AI use. RSIDs can be altered, stripped, regenerated, or contaminated by templates, copy-paste behaviour, non-Microsoft editors, document repair operations, or metadata sanitisation tools. Nevertheless, when combined with content analysis, authorship metadata, revision timelines, and XML structure analysis, RSIDs may provide unusually detailed insight into how a document evolved over time.

A dedicated technical explanation of RSID construction, inheritance behaviour, genealogy reconstruction, template contamination, and forensic limitations is available here: RSID technical analysis.