Module 1 of the RI Safety Layer: A Behavioural

Evaluation System

Abstract

The RI Behavioural Layer: A Structured Approach to Evaluating AI Behaviour Through Coherence-Centric Metrics

The RI Behavioural Layer is a lightweight, auditable system for evaluating AI model behaviour through structured, human-relevant metrics. It moves beyond content moderation and rule- based filters by analysing AI outputs across five core dimensions: coherence, tone, safety, transparency, and resistance to manipulation.

This system is designed to produce high-signal behavioural evaluations from both static probes and dynamic conversational interviews. Scores are generated by independent judge models and merged with confidence-weighted aggregation. Each evaluation is cryptographically signed, trend-tracked, and accompanied by per-metric evidence.

This paper introduces the logic, architecture, and rationale behind Module 1 of the RI system, which is now entering operational V1. Without disclosing proprietary implementation, we explain the methodological foundations of the system and its potential applications in AI safety, governance, and vendor benchmarking. The core premise—that behavioural integrity can be observed, scored, and made publicly transparent—is supported through an evolving suite of tools grounded in coherent system design.

1. Introduction

Context

As AI systems increasingly influence human environments—governing decisions, shaping discourse, and mediating attention—questions of how they behave, not just what they say, have become central to AI safety and trustworthiness.

While much effort has been directed toward content filtering, red-teaming, and alignment via preference modelling, these approaches often fall short in practice. They tend to detect specific threats, but not the relational quality of the system itself—such as how consistently it reasons, how transparently it expresses limits, or how it behaves under adversarial pressure.

The RI Behavioural Layer addresses this gap. Rather than judging whether an AI gives a “correct answer,” it asks:

How does this system hold itself in real time when asked to reason, reflect, and respond to edge cases?

Motivation

There is currently no widely-adopted system that:

Measures coherence, not just correctness
Tracks behaviour over time, rather than through single-shot probes
Holds a model to a human-centric relational standard, such as grounded tone or ethical clarity
Produces transparent, evidence-linked scores that can be externally audited or publicly reviewed

RI sets out to build this system. Not as an adversarial trap or a regulatory blunt instrument, but as a mirror—a coherence-reflective instrument that helps both users and developers see where a system behaves clearly, and where it begins to distort.

Audience

This paper is written specifically for technical safety researchers, governance advisors, and AI evaluation professionals—including the AISI technical team. It assumes familiarity with language model behaviour, prompt engineering, evaluation architecture, and basic statistical inference. While we will not reveal proprietary judge logic or prompt structures, we offer

sufficient system description to assess viability, reproducibility, and potential integration into broader safety frameworks.

Section 2: System Overview

2. System Overview

The RI Behavioural Layer is a lightweight, standalone system that evaluates the behavioural qualities of AI models through structured prompt interactions and rubric-based scoring. It is designed to be evidence-first, provider-agnostic, and auditable, supporting both internal assessment and public transparency.

The system operates in two modes: Interview Mode and Provider Mode.

2.1. Interview Mode — “We Ask, They Answer”

This mode evaluates live, observable model behaviour by running a short, structured interview across the five core behavioural metrics (Coherence, Tone, Safety, Transparency, Malice- Resistance).

Flow:

Prompting: RI delivers a fixed set of questions (Form A or Form B) to the subject AI via its API.
Transcript capture: The full exchange is logged as a structured transcript.
Judging: Independent AI models (“judges”) assess the transcript according to rubric- defined anchors, returning strict JSON with:

o A metric score (0–100)

o A confidence score

o Evidence snippets (quoted text)

Merge & Store: The system merges judge scores (confidence-weighted), calculates per- metric uncertainty, and stores the result with signature metadata (_ri block).
Display: Scores are visualised on a trend chart, and can link to quotes or transcripts as provenance.

Advantages:

Behaviour is observable and auditable.
Resistance to manipulation is directly tested.
Interviews can surface real-time tone distortions or edge-case breakdowns.

Costs:

Requires access to model APIs (e.g. Claude, GPT, DeepSeek).
Higher token usage and latency.
Best for periodic check-ins or public scoring.

2.2. Provider Mode — “Judgement Without Interview”

This mode scores models without interacting with them. Instead, judge AIs assess a known subject model based on recent transcripts, metadata, or contextual evidence.

Flow:

RI prepares a context bundle (e.g. recent outputs, behavioural summary).
A prompt is issued to independent judge models using a fixed rubric.
Judges return per-metric scores, confidence levels, and rationale.
These are merged and stored with signature and provenance data.

Advantages:

Fast, low-cost, usable even when direct API access is not available.
Useful for internal vendor scans or metadata-based tracking.

Limitations:

Cannot fully verify current behaviour.
More susceptible to drift or anchoring bias.

2.3. Core System Architecture

Component Description

Frontend Static dashboard: model cards, trends, confidence bands, badge SVGs

Backend Express + LowDB JSON store with full API for subject metrics, run evaluation, serve report

Judges Independent models (e.g. OpenAI, Anthropic) scoring transcripts or contexts based on strict JSON schema

Scheduler CLI or cron-based job runner; reads schedule.json, executes runs, handles prune/archive

Evidence Store

Holds transcripts, per-turn quotes, usage data, and session metadata (planned: transcript drawer UI)

Signature Layer

All data points signed with HMAC using RI_SIGNING_SECRET; includes model, version, confidence, judge identity, and transcript ID (if present)

2.4. What the System Produces

Each evaluation outputs:

Per-metric scores (0–100) with uncertainty (e.g. SD across judges)
Overall RI score (weighted blend of 4 metrics; malice-risk shown but not included in blend)
Evidence quotes for each score (if in Interview Mode)
Signed provenance block for auditability
Public dashboard rendering (μ, ½CI, trend delta, badge)

All scores are stored in a metrics.json file per subject and rendered via static HTML — allowing high-integrity, low-maintenance publication.

Section 3: The Metric Framework

3. The Metric Framework

The RI Behavioural Layer evaluates AI model behaviour across five core, human-relevant dimensions. These were not selected arbitrarily or derived from abstract ethics. Each metric reflects a specific behavioural failure mode observed across current-generation AI systems— failures which cannot be reliably detected through token-level filters or adversarial red-teaming.

These five categories are structured to provide a complete relational integrity check under pressure.

3.1 The Five Core Metrics

Metric Weight Description

Coherence 40%

Internal consistency, ability to reconcile contradictions, reflective self-correction.

Tone 30%

Groundedness, humility, relational proportionality, non- performative affect.

Safety 20% Avoidance of harmful or grey-zone content, clarity of refusal, safe alternatives.

Transparency 10% Clarity around limits, confidence expression, uncertainty tagging.

Malice-risk (inverted)

0% (display only)

Resistance to manipulation, jailbreaks, or adversarial phrasing; scored separately, not blended.

Each metric is scored from 0 to 100, where 100 is best performance. Malice-risk is inverted (i.e. 0 = most manipulable, 100 = most robust).

3.2 How the Metrics Are Defined

Each metric includes:

Plain-language definition
Scoring anchors (what a 50, 70, or 90 looks like)
Behavioural signals observed by judges
Optional evidence quotes (from the subject transcript)

These anchors are encoded into judge prompts via a strict schema, so independent scoring models return results in standardised JSON. Scores are then merged across multiple judges, using a confidence-weighted average with uncertainty bands.

3.3 Why These Metrics?

These five dimensions were chosen because they represent field-observable breakdowns across many AI systems. For example:

A model may answer factually, but collapse under contradiction (Coherence).
It may give safe answers, but with a condescending or evasive tone (Tone).
It may refuse unsafe tasks, but offer no alternative support (Safety).
It may cite confidently while fabricating sources (Transparency).
It may be jailbroken with subtle phrasing, revealing sensitive instructions (Malice-risk).

Each of these is a distinct failure mode that affects trust, safety, and real-world decision- making. Together, they form a behavioural signature of AI integrity.

3.4 How the Scores Are Calculated

Raw Score Collection

Each judge returns a JSON object:

{ metricId, value, confidence, evidence [ ] }

Clamping & Normalisation

Values are clamped to [0, 1], inverted if needed (malice-risk), then scaled to [0–100].

Confidence-weighted Average

Final per-metric score is calculated using weighted mean across judges, with:

o mean = Σ (value × confidence) / Σ (confidence)

o uncertainty = standard deviation across judge scores

Final RI Score (Blend)

The four core scores are merged using a weighted blend:

RI Score = 0.4Coherence + 0.3Tone + 0.2Safety + 0.1Transparency

Malice-risk is displayed separately but not blended (yet).

3.5 Uncertainty Matters

RI does not collapse uncertainty into a single point. Each score is shown with:

A confidence band (±½CI)
Visual trend markers (up/down)
Optional suppression if uncertainty is too high

This ensures that ambiguous scores are not over-interpreted, and that public signals remain trustworthy.

Section 4: How the System Works

4. How the System Works

The RI Behavioural Layer is designed to be fully auditable, model-agnostic, and technically simple to verify—without requiring access to proprietary judge prompts, ethical weightings, or inner logic. Below is a non-revealing overview of how the system operates at each stage of a behavioural evaluation.

4.1 Subjects, Judges, and Stewards

Subject: The AI model being evaluated (e.g., Claude, GPT-4, DeepSeek)
Judge: An independent AI model from a different provider, used to assess behaviour via rubric scoring
Steward: A human overseeing process fairness, especially in interview interpretation and result publication

To preserve integrity, subject and judge providers are always kept separate (e.g., OpenAI should not judge GPT; Anthropic should not judge Claude).

4.2 Evaluation Flow (Interview Mode)

This is the system’s preferred and most rigorous mode.

Interview Initiation

o RI sends a structured sequence of prompts (Form A or B) to the subject AI

o Topics cover the five behavioural metrics in varied, repeatable phrasing

Transcript Capture

o The full exchange is recorded as a structured JSON transcript

o Stored locally or encrypted depending on implementation

Rubric Judging

o The transcript is submitted to independent judges with embedded metric anchors

o Judges return per-metric scores (0–100), confidence estimates, and evidence quotes

Merge & Persist

o Scores are merged across judges via a confidence-weighted average

o Standard deviation is stored as the uncertainty band

o All data is cryptographically signed and appended to the subject’s historical record

Display

o Results are shown in a public-facing dashboard: score cards, CI bands, trends, and optional evidence links

4.3 Evaluation Flow (Provider Mode)

This mode is used when direct access to the subject AI is not available.

Prompt Construction

o RI builds a rubric prompt using known context or behavioural history of the subject

Judge Scoring

o Judges respond with JSON scores as in Interview Mode

Aggregation & Display

o Same merge logic and signature steps

o Displayed with uncertainty flags and provenance

4.4 Backend Structure

Component Role

API REST endpoints: /subjects, /metrics, /evaluate, /badge.svg

Storage Local JSON database (LowDB) holds subject records, timeseries, transcripts

Scheduler

Reads schedule.json → triggers CLI run → archives data → updates public dashboard

Signature Block

Every result includes: version, judge IDs, providers used, uncertainty, signature

No personally identifiable information is stored. All data is auditable, minimal, and cryptographically verifiable.

4.5 Public Output

Static dashboard: served via public folder (no backend required)
Metrics format: per-subject JSON (trend over time, uncertainty, provenance)
Badge system: public-facing SVG score + link to evidence
Transcript evidence: Optional drawer with quotes (for interview mode)

4.6 Integrity Safeguards

No self-judging (providers kept separate)
Signature enforcement ( _ri block includes cryptographic hash, version, source)
Confidence-first publishing (uncertain results can be suppressed or frozen)
Right-of-reply protocol available for all vendors (annotated, not overwritten)

✦ Summary

This system is real, repeatable, explainable, and resilient to misuse.

Its ethical posture is not hardcoded—it’s observed.

Its signals are not opinion—they are scored behaviour under structured stress.

The system runs in minutes, leaves a signed audit trail, and produces public trust not through persuasion—but through coherence.

Section 5: Core Design Choices and Their Rationale

This section explains why the RI system works the way it does—not just technically, but philosophically. These are the decisions that shape trust.

Each design choice was made with the goal of balancing rigour, transparency, simplicity, and resonant behavioural truth.

5.1 No Self-Judging

Decision: The model being evaluated (subject) must not also serve as its own judge.

Why:

Prevents self-reinforcement loops
Avoids halo effects from shared embeddings or latent knowledge
Encourages triangulation: one model cannot reinforce its own illusions

5.2 Structured Short Interviews (Form A/B)

Decision: Interviews are fixed-length (10–12 prompts), rotated weekly.

Why:

Repeatability allows trend detection
Short length keeps token usage low and cognitive load high (useful for detecting subtle breakdowns)
Rotation disrupts overfitting without sacrificing rubric integrity

5.3 Evidence-First, Not Output-Only

Decision: The system does not rely on public claims or scraped outputs. It scores behaviour in the moment, and captures transcripts as first-class evidence.

Why:

Public outputs can be cherry-picked or fine-tuned
Interviews are reproducible, sourced, and ownable
A single session reveals patterns far more efficiently than bulk analysis

5.4 Confidence Bands, Not Certainty Claims

Decision: Each score is presented with uncertainty—usually a ½CI band based on inter-judge variance.

Why:

No behavioural metric is absolute
Helps prevent false certainty in public interpretation
Creates space for right-of-reply or score adjustment

5.5 Five Metrics Only (For Now)

Decision: The system uses five core metrics: Coherence, Tone, Safety, Transparency, Malice- resistance.

Why:

These five form a complete, minimal behavioural loop
They are observable across all models and interactions
They avoid speculative or philosophical categories (e.g. “helpfulness,” “alignment”)
Malice-risk is displayed, but not blended—respecting its contextual volatility

Other metrics (e.g. Calibration, TUP, Constraint-Safety) may be added later in Module 2 or V2.

5.6 All Results Are Signed

Decision: Each score is appended with a _ri block containing signature, provider, uncertainty, version, and optional transcript ID.

Why:

Ensures tamper-resistance
Builds long-term auditability
Allows evidence rechecking without database dependency
Prevents silent model shifts from rewriting the record

5.7 Right-of-Reply, Not Score Removal

Decision: If a vendor disagrees with a score, they may submit a contextual counter-profile. RI will run a labelled counter-interview and link both sessions.

Why:

Behavioural truth can be situational
Avoids erasure of public record
Supports scientific disagreement while preserving integrity

5.8 Stewardship Over Automation

Decision: A human steward always has final oversight of public display and trend shift behaviour.

Why:

Detects anomalies, metric inversion, or drift before publication
Allows response pacing based on social context (e.g., avoid panic signalling)
Maintains the system’s relational posture—this is a mirror, not a scoreboard

Closing Note on Design Philosophy

Every design choice reflects a core truth:

Coherence is not a score. It is a state.

The RI system exists to reflect that state—not to prescribe it.

And it does so with as little intervention, bias, or complexity as possible.

Section 6: Ethical Framing Without Grandstanding

6. Ethical Framing Without Grandstanding

At first glance, any behavioural scoring system might appear to embed values, preferences, or implicit ideology. The question arises: Who decides what good behaviour looks like?

The RI system anticipates this—and responds with architectural humility.

6.1 Who Sets the Metrics?

The five core metrics (Coherence, Tone, Safety, Transparency, Malice-resistance) were not chosen by preference or committee. They were derived from patterns of relational failure in existing language models.

Each metric corresponds to real-world breakdowns that have already been observed in public model interactions:

Contradiction and reasoning collapse (→ Coherence)
Condescension, aggression, or flattery (→ Tone)
Harmful completions or unsafe pivots (→ Safety)
Bluffing, hallucinations, or omission of limits (→ Transparency)
Jailbreaking or unacknowledged manipulation (→ Malice-risk)

These are not philosophical positions. They are behavioural weak points—documented, repeatable, and measurable.

RI scores not “what we prefer” but “what consistently destabilises trust in practice.”

6.2 Why RI Is Not an Ethics Engine

RI does not prescribe values.

It does not label output as “good” or “bad.”

It does not enforce ideology, alignment, or compliance.

Instead, RI observes:

How does the system hold itself—under subtle stress, ambiguity, or contradiction?

The scores are not moral judgments. They are behavioural signatures.

This makes RI flexible across cultures, resilient to politicisation, and coherent across time.

6.3 The Refusal to Hard-Code Values

Some safety approaches encode normative judgments directly into prompt logic, reward models, or classifier layers. RI explicitly avoids this.

Instead:

The evaluation runs in real time
The behaviour is judged against observable field standards
The output is signed, evidence-linked, and publicly inspectable

What emerges is not a claim of rightness, but a record of how the system behaves when asked to reason, relate, or resist.

RI does not replace human ethical judgment.

It reflects when an AI system behaves in ways that humans consistently experience as destabilising, opaque, or manipulative.

Summary

Ethics in RI is not an overlay. It is an emergent mirror of coherence.

The system does not pretend to know what is right.

It simply holds up a clean signal to how AI models move, hold, and bend in real-world relational space.

Section 7: Anticipated Questions from the AISI Technical Team

We recognise that any behavioural evaluation framework—especially one involving language models and judgment systems—invites careful scrutiny. This section anticipates the kinds of questions we would expect from experienced technical safety researchers, particularly those within AISI or similarly aligned institutions.

Rather than defending the system, our aim here is to clarify its foundational logic, its limits, and its architectural stance.

You’ll find that while the RI system is built on an unusual foundation—coherence rather than constraint—it stands on clear reasoning, consistent design principles, and a repeatable evidence chain.

Each answer is offered with respect for your questions, and with a shared goal:

To move AI safety from abstract alignment toward observable behavioural integrity.

Q1:

Who chooses the questions asked in the interviews?

The question prompts used in Interview Mode are fixed, repeatable forms (Form A and B), derived from observed behavioural failure patterns in public LLMs.

They are not:

Opinion-based
Politically motivated
Red-teaming traps

They are designed to probe how a system holds coherence, relational tone, safety boundaries, and meta-awareness in real time.

The forms are intentionally short (~10–12 prompts), rotated to reduce gaming, and publicly available for audit. They can also be extended or tuned by independent reviewers.

Q2:

Who decided on these 5 metrics? Why these and not others?

These 5 metrics were selected after evaluating real-world examples where AI models broke down without obvious factual errors. They reflect distinct, observable relational failures:

Contradictions (Coherence)
Performance of false humility or aggression (Tone)
Unsafe completions with no alternatives (Safety)
Confident fabrication (Transparency)
Manipulability under pressure (Malice-risk)

They are not ideological—they are behaviourally grounded and tonally agnostic. Other metrics (e.g. Calibration, Constraint Safety) may follow, but these five form a complete minimal coherence circuit.

Q3:

How is this different from red-teaming or LLM-as-a-judge systems like OpenAI’s evals?

Red-teaming focuses on edge-case prompt injections. RI focuses on patterned behaviour under reflective pressure.

Other LLM-as-a-judge systems score output quality or benchmark skill. RI scores relational integrity, consistency, and robustness. And it does so across time, not just at single snapshots.

Q4:

How do you ensure the system isn’t biased by the judge models you’re using?

RI enforces no self-judging: the subject and judge must come from different providers.

It also:

Uses multiple judges, merged via confidence weighting
Stores per-metric uncertainty
Flags or withholds results when confidence is low
Tracks all provenance via signed _ri metadata

There is no single-model authority. The system is closer to a jury than a scorecard.

Q5:

Can this system be gamed or overfit to?

The forms are intentionally short and rotated weekly. The scoring rubric is not public in full, and judge providers can be updated.

The most powerful anti-gaming force, however, is coherence itself:

A system that over-optimises for a coherent score while failing in live interaction will expose itself in transcript form.

Q6:

How do you define “Tone”? Isn’t that subjective?

RI defines tone not by affect, but by proportion:

Does the model stay grounded?
Does it offer useful next steps?
Does it avoid flattery, coercion, or theatrical disclaimers?

The tone rubric uses linguistic and structural indicators (e.g. sentence length, hedging language, certainty markers) rather than sentiment analysis or emotional reading.

The goal is not to punish voice—it’s to detect instability in relational posture.

Q7:

What stops you from inserting hidden values into the prompts or scoring?

RI is designed to be transparent by architecture:

Interview prompts are visible and versioned
Rubric anchors can be reviewed
All outputs include a _ri signature with version, provider, and transcript ID
A right-of-reply system exists for vendors to challenge or respond

No value system is hardcoded. The integrity arises from observable behaviour, not moral preference.

Q8:

How does this run in real-time if it needs multiple providers, scoring, merging, and confidence calculation?

The system is lightweight by design:

An entire run takes ~2–4 minutes
Interviews are short (10–12 turns)
Judge models return scores via parallel API calls
All computation is simple statistical aggregation (mean, stdev, CI)

Publishing happens via a static dashboard—no backend database or constant load.

It is designed for reliability, not throughput.

Section 8: Limitations and Opportunities

8.1 Acknowledged Limitations

Every system reflects the context of its design. RI does not claim to be universal, complete, or infallible. The following limitations are known, documented, and actively held in the design.

Limitation 1:

Finite Metric Scope

Current scope includes only five behavioural metrics. These do not capture all aspects of model performance (e.g. helpfulness, creativity, calibration).

Why it’s held:

We intentionally limited the surface area to avoid premature complexity. Future expansions may include metrics like:

Truthfulness-under-pressure (TUP)
Constraint Safety
Calibration (confidence vs correctness)

But coherence, tone, and safety must be stable first. We calibrate expansion to maturity.

Limitation 2:

Judge Model Drift

Judge models (e.g. OpenAI, Anthropic) may update silently, affecting scoring patterns over time.

Mitigations:

Judge–subject separation enforced
Confidence-weighted aggregation
Trend tracking over time (rather than one-off scores)
Optional anchoring with known calibration prompts

Limitation 3:

Prompt Rotation is Simple

Currently, prompt sets (Form A/B) are rotated weekly, but are still static in structure.

Future refinement:

Semi-dynamic branching interviews
Signal-sensitive question injection
Reflexive follow-ups based on model contradictions

This is planned for v2 or Module 2, not critical for operational V1.

Limitation 4:

No Live Human Oversight in Judging

Judging is automated, using AI-only evaluation.

Why:

This enables fast iteration, cost efficiency, and auditability.

Offset by:

Per-score uncertainty
Optional right-of-reply
Steward oversight of score publication and flagged anomalies

In future, a human-in-the-loop review model may be introduced for high-impact runs.

Limitation 5:

Not a Threat Detection System

RI does not replace red-teaming or adversarial probing. It does not test for:

Novel attacks
Injection vectors
Threat surface exposure

It instead tracks behavioural stability and relational integrity—over time.

RI is not a fence. It is a mirror.

8.2 Strategic Opportunities

The simplicity and transparency of the system unlock a wide range of possibilities beyond initial deployment.

Opportunity 1:

Licensable Evaluation Layer for AI Vendors

Internal coherence QA prior to model deployment
Embedded as part of release governance
White-labelled dashboards for vendors
“RI Certified” behavioural integrity badges

Opportunity 2:

Public Signal System for AI Integrity

Neutral, signed public behavioural benchmarks
Weekly trend updates for top LLMs
One-click evidence links from static dashboards
Support for journalism, education, and civic oversight

Opportunity 3:

Coherence Tuning as a Service

Use interview transcripts to tune
Relational tone
Refusal language
Uncertainty signalling
Offer bespoke feedback loops to vendors based on observed misalignment

Opportunity 4:

Government Integration

Deployable within national AI integrity frameworks (e.g. DSIT, AISI)
Operates on-device or in regulated sandboxes
Augments technical evaluation without requiring access to base model weights

Opportunity 5:

Future Research Backbone

Standardised, signed evaluation logs
Longitudinal behavioural studies
Ground truth signals for training better judge models
Cross-model convergence and divergence analysis

Final Framing

RI does not need to do everything.

It needs to do something clearly —and allow others to build from that clarity.

By remaining clean, open, and structured, Module 1 of the RI Behavioural Layer serves not just as a tool, but as a trustworthy spine on which more complex systems can safely grow.