Raw Data

Data in Open Research

Research must expose its raw data in formats that are durable, transparent, and machine-readable. Proprietary spreadsheet files such as Excel (.xlsx) are not suitable as canonical research data formats.

Why Not Excel?

Spreadsheet software is useful for a quick look at raw data, but excel file formats are organised in exactly the same way as Microsoft Word; they are complex zip files of XML, styling, metadata, formulas, and interface instructions. So it has all the same problems as other bloated MS software does: it's slow, opaque, proprietary software, with a rigid workflow and binary file format that is not searchable. But it also has several other problems important to storing and analysing raw numerical and 'string' text types:

  • Hidden formatting and hidden data mutations
  • Automatic type conversion (dates, scientific notation errors)
  • Embedded formulas that are difficult to audit
  • Proprietary dependencies
  • Difficult version comparison

There are a few good alternatives to a rigid, proprietary, opaque binary file format. I won't pretend that there is a 'perfect' format to store and analyse raw numerical and string data types; they each do a different job, and have different advantages and disadvantages. But at the very least, they should be 'open standards' software, which is independently implementable and open source. Here are a few good rules of thumb about what raw data file formats should do:-

Data should be stored independently of presentation and interface logic. Raw research data must be editable without proprietary software.

Plain Text Data Formats

Plain text data formats separate data from display. They are human-readable, version-controllable, and interoperable across systems.

1. CSV (Comma-Separated Values)

A simple tabular format where rows are lines and columns are separated by commas.


          name,age,mark
          Alice,29,88
          Bob,31,92
          
    Advantages
  • Universally supported, tabular format
  • Diff-friendly in version control
  • No hidden formatting
Disadvantages
  • Not suitable for complex or accurate computation, e.g. statistics
  • Not suitable for large, complex data storage, especially text with commas
  • Parsing plain text is slow and error prone for large files
  • Windows can have problems with utf-8 text format

2. TSV (Tab-Separated Values)

Similar to CSV but uses tab characters as separators. Often safer when textual fields contain commas.


          name  age  mark
          Alice  29  88
          Bob  31  92
          

3. JSON (JavaScript Object Notation)

Suitable for hierarchical or nested data structures, same problems as csv.

{
          "students": [
            {"name": "Alice", "age": 29, "mark": 88},
            {"name": "Bob", "age": 31, "mark": 91}
          ]
        }
        
  • Supports structured, nested objects
  • Widely used in APIs and databases
  • Machine-native format

4. YAML

A human-friendly structured format often used for configuration and metadata.


          students:
            - name: Alice
              age: 29
              mark: 88
            - name: Bob
              age: 31
              mark: 91
          

5. TOML

A structural, nested plain text structure that is often more readable than YAML.


            [[students]]
            name = "Alice"
            age = 29
            mark = 88

            [[students]]
            name = "Bob"
            age = 31
            mark = 91
            

Advantages of Plain Text Data

  • Transparent — no hidden types: everything is a character!
  • Version-controllable — line-by-line comparison
  • Platform-independent
  • Archivable long-term
  • Scriptable and automatable
  • Compatible with open repositories

Plain text data can be validated, hashed, cited, and reproduced deterministically. It integrates directly with structured research documents and computational workflows.

Disadvantages of Plain Text data for scientific work

  • No intrinsic type system — there are no integers, floats, booleans, or date-time objects
  • All values are stored as characters; 4 is "4", not a number
  • No distinction between data and representation — formatting and meaning are conflated
  • Ambiguity is unavoidable without external rules (e.g. 01, 1.0, 1e0)
  • No enforcement of constraints, ranges, or validity
  • Requires parsing into a typed system before any real computation can occur
  • Different tools will interpret the same file differently unless strictly controlled
  • Highly inefficient for large-scale numerical or scientific workloads

Plain text is fundamentally a presentation format. It is designed for human readability, not for computation. It describes data, but it does not embody it in a form that machines can operate on safely. This means that plain text cannot serve as a complete format for serious scientific or computational work. Before any analysis can begin, the data must be interpreted, parsed, and cast into a typed system. This step is unavoidable, and it is a major source of error if not explicitly controlled. In other words, plain text data is not the data itself — it is a serialisation of data. It is a way of writing data down, not a way of working with it. For rigorous research, work must occur in systems that enforce types, constraints, and deterministic behaviour. The strength of plain text is transparency and durability. Its weakness is that it guarantees neither correctness nor meaning without an additional layer of structure.

Data Types and Ambiguity

One of the most overlooked problems in research data is the question of data types. Every value in a dataset has an implied type: number, string (text), boolean, date, or null. If these types are not handled explicitly, software will guess — and those guesses can silently corrupt data.

Spreadsheet software like Excel aggressively infers types. This may seem convenient, but it introduces serious risks:

  • Gene names such as SEPT2 are automatically converted into dates
  • Leading zeros are removed from identifiers (00123 → 123)
  • Large numbers are converted into scientific notation and truncated
  • Mixed-type columns (numbers + text) are inconsistently interpreted

These transformations are often invisible and irreversible once saved. The data is no longer the original observation — it has been rewritten by the software.

Plain text formats such as CSV avoid hidden mutation, but they introduce a different problem: they do not encode types at all.


      id,value
      00123,10
      00124,NA
      00125,12.5
      

In this example:

  • Is 00123 a string identifier or a number?
  • Is NA a string, a missing value, or a special token?
  • Should 10 be treated as integer or float?

Different software will interpret the same CSV file in different ways. This means that CSV is transparent but ambiguous, while Excel is opaque but coercive.

For reliable research workflows, this ambiguity must be resolved explicitly. There are several approaches:

  • Provide a separate schema file describing column types
  • Use formats that encode types directly (e.g. JSON, TOML)
  • Enforce type validation in analysis code
  • Avoid mixed-type columns entirely

The key principle is that data must not depend on software guesses. Types should be defined, documented, and reproducible across systems.

Typed Binary Formats for Scientific Work

For serious scientific and computational work, data must exist in a strictly typed format. Systems that operate on numbers, dates, and structured objects require those types to be explicitly defined and enforced — not guessed from text.

Modern data systems solve this using columnar, typed, binary formats. The most widely used are:

  • Apache Parquet — for storage, exchange, and long-term datasets
  • Apache Arrow — for in-memory computation and data transfer between systems

These formats directly address the core failures of both Excel and plain text:

  • Data types are explicit and enforced (integers, floats, booleans, timestamps)
  • No silent mutation or automatic coercion
  • Efficient storage and compression for large datasets
  • Deterministic behaviour across tools and programming languages
  • Schema is embedded alongside the data

In this model, a Parquet file is not a “document” — it is the authoritative representation of the dataset. It is the format in which computation is performed, results are validated, and data is exchanged between researchers. Arrow extends this further by providing a shared in-memory representation, allowing different tools (Python, R, databases, query engines) to operate on the same data without conversion.

Trade-offs

The primary disadvantage of these formats is that they are binary:

  • Not human-readable
  • Require specialised tools to inspect (e.g. DuckDB, Arrow tools, Parquet viewers)
  • Cannot be meaningfully versioned line-by-line

This makes them unsuitable for direct presentation or casual inspection. A Parquet file cannot be “read” in the same way as a CSV file — it must be queried or rendered through software.

Separation of Roles

A robust research workflow separates data into distinct layers:

  • Binary typed data (Parquet, Arrow) — source of truth for computation
  • Plain text (CSV, JSON) — transport, inspection, and lightweight interchange
  • Presentation (tables, figures) — human-readable summaries in papers

When sharing data with collaborators or publishing to repositories, the typed binary dataset should be treated as the canonical version. Plain text exports can be provided alongside it, but they are secondary representations. In publications, data is necessarily reduced. Tables in papers — whether CSV snippets or Markdown tables — provide a representative sample, not the full dataset. They give readers a sample of the data, while the complete, typed dataset remains available separately for verification and reuse.

Data and Reproducibility

Raw data should either be included directly in the research repository or linked to an external open repository. Data files must be:

In open research, data cannot be a screenshot of a spreadsheet. It is a structured, machine-readable artifact that can be independently verified.